Why does 2 * i * i tend to be faster than 2 * (i * i) when i is integer?

Question

Why does 2 * i * i tend to be faster than 2 * (i * i) when i is integer?

Navigation

#1 by (6 votes)

11

The two multiplications, 2*i*i and 2*(i*i) , are equal and should generate the same result, which only changes is the order that the multiplications are made, but apparently they are treated differently by the interpreter. >

In the first multiplication, given the absence of parentheses and considering that all operations have the same precedence , the interpreter will execute from left to right; that is, first it will be done 2*i and the result multiplied by i .

In the second multiplication, given the presence of the parentheses, the order of execution is changed because the i*i multiplication takes precedence over multiplication by 2, then the first interpreter will i*i and the result will multiply by 2 .

Mathematically, the order of the factors does not change the product, so the result should be the same, however using the native package timeit you can see a difference between the runtimes between these multiplications:

>>> print('2*i*i:', timeit("for i in range(1000): 2*i*i"))
2*i*i: 276.91411599100684

>>> print('2*(i*i):', timeit("for i in range(1000): 2*(i*i)"))
2*(i*i): 279.21798045100877

Tests were done at Repl.it

Note: It is important to note that the timeit function will, by default, execute the specified code snippet 1,000,000 times, and that the exact execution time may vary due to processor oscillations and between computers.

Why is there such a time difference? Does the presence of the parentheses in the expression change the executed algorithm?

Emphasizing that the analysis should be done in the official implementation of Python, CPython, in version 3+. Comparisons with other versions or implementations will be welcomed as a way to increase response.

algoritmo python python-3.x python-internals cpython

asked by anonymous 03.01.2019 / 12:19

1 answer

What maximum size can a JSON file have? Use local files if CDN is Offline

score 6 · Answer 1

"You already know, but it's hard to repeat" if you think you need optimizations at this level in a Python code snippet, you're writing this snippet in the wrong language (or any other language very high level, such as ruby, php, and even Java).

Now there are some things you can answer in your question, and to speculate a bit, even without reaching a definitive answer.

First - when you want to know differences of this type in Python, it's worth looking at what the bytecode was generated for each code snippet. By comparing the bytecode it is easier to speculate what the differences might be. In Python, the dis function of the module of the same name allows you to see the bytecode:

In [54]: def a(): 
    ...:     return 2 * i * i 
    ...:                                                                                                 

In [55]: def b(): 
    ...:     return 2 * (i * i) 
    ...:                                                                                                 

In [56]: import dis                                                                                      

In [57]: dis.dis(a)                                                                                      
  2           0 LOAD_CONST               1 (2)
              2 LOAD_GLOBAL              0 (i)
              4 BINARY_MULTIPLY
              6 LOAD_GLOBAL              0 (i)
              8 BINARY_MULTIPLY
             10 RETURN_VALUE

In [58]: dis.dis(b)                                                                                      
  2           0 LOAD_CONST               1 (2)
              2 LOAD_GLOBAL              0 (i)
              4 LOAD_GLOBAL              0 (i)
              6 BINARY_MULTIPLY
              8 BINARY_MULTIPLY
             10 RETURN_VALUE

So, as you can see, there's actually a small difference: In the first case, the bytecode switches to load the values to be multiplied in the stack of the Python virtual machine with the multiplication calls itself. In the second case, it puts all values in the stack of the virtual machine, and then calls the multiplication operator twice in a row. It may be that microcode level optimizers of the CPU can optimize this repeated call next to the function that will be called by opcode BINARY_MULTIPLY (for example, the call can give more hits in the CPU prediction branch).

Anyway, if not exactly that, I'd still bet my chips that what happens is exactly some optimization at the microcode level of the CPU that is activated in the second case.

Exactly the kind of thing you'll rarely worry about even if you're coding in C, let alone in Python. And in that case, before saying "wow, let me use inline assembler", it would be the case to look for high-level solutions that could user computer resources more appropriately - code that uses the GPU or vector processing units of CPU for example, which would give a gain orders of magnitude greater than getting micro-optimizing a single operation. (And in the case of Python, the "first" remedy will always be to use NumPy).

To take away the stubbornness if by chance it is nothing inside the same Pyton code, just checking also the code that will be called for the BINARY_MULTIPLY - which certainly has optimizations for when both operands are Python Integer, but not beyond (for example, an extra optimization if one of the operators is '2' - anyway, the runtime has to call the code that is in int.__mul__ for this) - but I'm talking about optimizations that will be between the VM find the opcode and call int.__mul__ .

By the way, the difference really is so minor that other changes will make things change - see what happens when I measure the times for the functions a and b above:

In [59]: i = 10                                                                                          

In [60]: %timeit a()                                                                                     
119 ns ± 1.71 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [61]: %timeit b()                                                                                     
123 ns ± 0.725 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

That is, on this machine, the 2 * i * i form is faster!