Possition of function in source code effects performance!!

Kaeru Gaman · Post by **Kaeru Gaman** » Mon Aug 20, 2007 1:47 am

if you read technicorn's post completely, you'll see that the effect of different alignments differs from machine to machine.

on his machine Align8 has maximum effect, on others it may be Align16,
wich is worst on his machine.

all you can say is that Align4 is an improvement to no Align at all,
but I would not like the compiler adding NOPs on each and every jump-in-point
and thus bloating the exe and slowing down process before the JIP,
even if it is in an event-loop that does not need it.
the compiler is not able to make such decisions.

I understand that it is unconveniant for your actual problem.

dioxin · Post by **dioxin** » Mon Aug 20, 2007 12:10 pm

This is an alignment problem. Alignment of data and code is the top optimization you can make to your program. Just inserting a few NOPs to improve alignment can give a 20% improvement in speed in tight loops.

With Intel type CPUs data should be aligned on a boundary with an address exactly divisible by the size of the data, with 1 exception. So 2-byte data should be on a 2-byte boundary, 4 byte data on a 4-byte boundary and 8 byte data on an 8 byte boundary and so on. The exception is the FPU EXT data which is 10-byte and AMD say this should be on an 8-byte boundary.
The reason is simple. If a single item of data straddles a cache boundary then the CPU must make 2 data fetches which takes twice as long. By following the above rule, only 1 fetch will be required for any item of data.

For code the alignment is more complicated. Code isn't 4-byte aligned. A CPU will fetch instructions in blocks of typically 16 bytes and parallel decode them to save time. Single instructions can be many bytes long. If an instruction straddles the 16 byte line then that instruction cannot be executed in parallel with the others as the whole instruction isn't available until the next block is fetched. In AMD processors this would convert a normally direct decoded instruction (fast dedicated hardware) to a vector decoded instruction (slower microcode).
Also, in a tight loop, if a branch takes place to an istruction toward the end of a 16 byte boundary then the preceding instructions in that 16 bytes are fetched but not decoded as they aren't needed. The CPU then fetches the next 16 byte line immediately. It's better if the branch target is toward the start of the 16 bytes so the one fetch/decode will decode more useful instructions before needing to fetch again.

For more details see the Athlon optimization guide:
http://www.amd.com/us-en/assets/content ... /22007.pdf

A compiler could certainly do some of these optomisations. Maybe there should be an option to optimise for speed or size.
For size, don't align any code.
For speed, align everything at the expense of a slightly larger EXE.

Paul.

pdwyer · Post by **pdwyer** » Mon Aug 20, 2007 12:58 pm

Using technicorm's code, my work intel core duo had numbers pretty much the same. my home AMD dual core has

Code: Select all


Attempt: 1

ExecMax: 735
ExecMin: 547

Attempt: 2


ExecMax: 718
ExecMin: 485

which means...?

technicorn · Post by **technicorn** » Mon Aug 20, 2007 1:31 pm

Hy pdwyer,

have a look at the code-alignment on which the minimum execution time occurs.

And having different times for first an second attempt is complitely normal
for a multitasking OS.

Just moving the mouse generates thousands of interrupts that have to be served.

@dioxin
Which that nick, I hope that not just reading your post is toxic!?

But seriously,
as best/worse alignment is so different on CPUs, it would be much more effective
to have a better pairing of instructions in the code.

As modern CPUs have three or more execution units, they can execute several instructions at once,
but only if they are not interdependent.

Something like this:

XOR eax,eax
MOV [var],eax

gives small code but is very bad,
it had to wait for the XOR result to write it to [var]

MOV [var],0

gives more code bytes, but is much quicker.

But it's not so easy to shuffle bigger blocks of code around for better performance,
without losing it's functinality.