COUNT (*) x COUNT (1) x COUNT (id)

3

I would like to better understand the difference between these ways of using the counter:

select COUNT(*) from tabela

select COUNT(1) from tabela

select COUNT(id) from tabela

This question deals with performance between two ways, but would like to better detailing what differs each, and adding a third way to the question.

  • Is there a (real) performance difference?
  • Can there be a difference in the results?
  • Is it possible to map the best use for each of the forms?
asked by anonymous 17.09.2018 / 14:15

1 answer

5
select COUNT(*) from tabela

Take all the columns of the table during the query, then all data will be available in memory for the SGDB to do the counting operation and other things. If there is no optimization. But many DBs optimize this in some way for a simple expression so some can do this even at constant time O (1). Otherwise it will be O (N).

select COUNT(1) from tabela

It takes a constant, that is, a value that is already in memory, so in theory it is to be absurdly faster by neither having to load anything from the database, but it depends on having some optimization. It always depends on the implementation of the database. It can be O (1), but most will be O (N) same. The difference is that the load on each of the elements will be potentially smaller.

select COUNT(id) from tabela

Here it reads only one column, it is usually faster than the first one (some cases of tables with few short columns can be the same), and it will be the same as before because you have to read all the lines, even if you do not bring anything else of the indicator to the counter, but again, unless it has some optimization, in this case this may be much slower than the previous one, but it may not be. It can be O (N) or O (1).

More details

Some databases have an optimization where the total unfiltered count (without WHERE ) has already stored automatically and is guaranteed to be always updated, in which case the complexity will be O (1). The most scalable DBs do not usually have this because of technical difficulties. Can I quote the MySQL MyISAM example that always knows the COUNT() no filter or other criteria being used. But in most cases it will be O (N). If you have WHERE or other information in SELECT or make a JOIN or have other characteristics that can affect the count it will certainly be O (N).

If it is O (N) the performance difference will be small, especially in tables with very short rows. Because he will have to count, you will have to carry all the lines to count. On very large lines, there may be a difference when using COUNT(*) , but only in these cases. And then the other 2 examples should be the same, because reading the line just to count without needing to use any information and reading a simple id gives it.

To understand this all depends on understanding how the internal storage of the engine is responsible for this in your SGDB and also how the file system works in general. And understand that every page the software does hit the disk (or SSD or other form of storage) is absurdly slower (it's up to 3 orders of magnitude) than picking up a die that is already in memory no matter the size of the die, although it does change a little the proportion in some cases. And then understand when to go in secondary storage or not. So the cache can make an absurd difference, in some extreme cases it can go from less than 1 millisecond when everything is cached, to more than 1 second when nothing is cached.

  

Is there a (real) performance difference?

It depends a little on the implementation if it is a performance problem or not, because it is possible to optimize knowing that in practice, this example makes no difference, after all the data load has zero function in this specific query. Failure to load even saying to do so does not change anything. There, just testing in each case to see if it does or not. And the test may change with each run. Contrary to what people imagine testing database performance is very complicated. People expect linear answers, but in something that has so many optimizations this does not happen.

  

Can there be a difference in results?

Should not in this case in most DBs, in other cases can.

  

Is it possible to map the best use for each of the forms?

I think so, whatever that means. But the basics is to test each one where it will use and see which one is faster (doing in multiple scenarios, considering the cache). In other examples it may be more a matter of doing what you expect or not, but can only be analyzed on a case-by-case basis.

    
17.09.2018 / 14:34