I've set up Visual Studio to compile VC ++ with /Ox
and compiled this code (with more others being omitted for simplicity).
union { unsigned long long u64 ; unsigned short u16[4] ; } x ;
union { unsigned u32 ; unsigned short u16[2] ; } i ;
i.u16[0] -= x.u16[3] ;
According to the disassembly, the i
was entirely in ecx
and x
was in memory. I expected the compiler to generate the assembly as well.
sub cx , word ptr [x+6]
But what he generated was this, that is, two more preceded instructions that prove unnecessary.
mov rax , qword ptr [x]
shr rax , 30h
sub cx , ax
That is, you loaded it into a register ... and you loaded more than you needed, then you had to move the data. It was like he was doing i.u16[0]-=(unsigned short)(x.u64>>48)
! In addition, data in rax
are no longer used (are overwritten), rendering both shr
and mov
unusable.
Why did not it optimize more? Is there any other configuration needed to improve this code that even a baby can see where it can reduce operations?