There are controversies. That can be seen in this question . Read everything, including the comments, there are very good things there, including showing how things actually occur in practice.
It seems to me that measuring algorithm complexity is not measuring its speed. There is a relationship between the two, but it is not so straightforward.
Of course, it's impossible to get a lot of data and have an O (N ^ M) faster than an O (1) on any hardware.
The algorithm, the way people do it does not take into account hardware conditions.
The most common is to classify efficiency within a neutral environment. In general, optimizations are not considered.
Imagine running this on an overloaded computer. Every theory evaporates. Imagine a computer that spends more time doing swap on disk than it does on calculation. What is the use of knowing in theory the efficiency of the algorithm?
In theory you can consider hardware. You can add variables that consider these conditions. But it is very rare to have an advantage in knowing this. Usually this is done in a very specific academic or analysis environment and when you are doing something that will have a lot of use in something that potentially needs a lot of performance. So there are cases to consider all variables. On a day-to-day basis you make a surface efficiency analysis to choose the correct algorithm and make actual speed tests with typical and extreme data to see what actually happens under real conditions. It usually gives better results from an engineering point of view, although from the scientific point of view it is not necessarily the best.
I've seen people saying that there are calculations in this sense, but I've never seen anyone show them. I know they exist, but they're well hidden:)
All told, my experience shows that caching makes more difference than branch prediction in most cases. Algorithms that have fewer branchs can be more efficient on simpler machines.
An algorithm that must handle data on disk may be the most appropriate, while the same data in memory algorithm may be more appropriate. Imagine that an L1 cache or even registers could hold all the data to be processed as there could be improvement.
Remembering that for a few data (N below), in practice, whatever algorithm is used.
See this External and Internal Memory Ordering Algorithms .