jacdelad wrote: Fri Apr 08, 2022 7:42 am
because these are native sizes of processor registers or whatever (I'm not so deep into asm yet)?
That is often the problem. There are now very different hardware specifications of x86 / x64 processors "in the wild". Some are carefully tuned to minimize production cost (that also depends on the available technology at a certain time and place) while others are brute force optimized for one specific small usability case (especially gaming setups). IMHO if you want to achieve the maximum possible speed on every single architecture, you really can not achieve that with one universal single binary. You would have to compile at least for two or three cpu "types" into one binary file, at the start check which one to use and ignore the others.
As we talk about x86
OR x64 architectures, there are cases when using a 8-bit wide operation on x86 is the best, while doing the same on a x64 platform would be really bad. The one single thing i keep in mind about speed, is not the execution cycle of assembler commands, but the memory requirement for a specific task. Creating a sine table for example to speed up sine calculations, is a good trick but be careful about the sine table size. On some architectures if the table does not fit into the fastest cpu cache size anymore, your performance will suddenly drop a lot due to "cache trashing".
I don't know if the latest gcc is dealing with these runtime differences. Ironically microsofts .NET architecture with IL would be able to do it.
As a general rule, keep the memory requirement (
references!) of "fast" operations, low. It is often more about the memory cache and not so much about the cpu cycles. But obviously you can create an example on an architecture that "proves" the opposite ...
Don't worry too much, keep memory references in "fast" routines low and if things go bad (too slow on customer machines), start to look into optimising the code, but
always find the real limitation first. It might be code size and memory cache and not cpu cycles. In special cases, it could be something completely different too (fpu, gpu,
instruction set or even drivers ...).
If you are not working in compiler design, don't worry too much, keep your memory references low for critical routines and let (gcc from v6 on) do its work.
ps: keep in mind that early optimization is (often) the devil. However on a general architecture layer, early optimization is crucial.