Possition of function in source code effects performance!!

Everything else that doesn't fall into one of the other PB categories.
User avatar
Kaeru Gaman
Addict
Addict
Posts: 4826
Joined: Sun Mar 19, 2006 1:57 pm
Location: Germany

Post by Kaeru Gaman »

if you read technicorn's post completely, you'll see that the effect of different alignments differs from machine to machine.

on his machine Align8 has maximum effect, on others it may be Align16,
wich is worst on his machine.

all you can say is that Align4 is an improvement to no Align at all,
but I would not like the compiler adding NOPs on each and every jump-in-point
and thus bloating the exe and slowing down process before the JIP,
even if it is in an event-loop that does not need it.
the compiler is not able to make such decisions.

I understand that it is unconveniant for your actual problem.
oh... and have a nice day.
dioxin
User
User
Posts: 97
Joined: Thu May 11, 2006 9:53 pm

Post by dioxin »

This is an alignment problem. Alignment of data and code is the top optimization you can make to your program. Just inserting a few NOPs to improve alignment can give a 20% improvement in speed in tight loops.

With Intel type CPUs data should be aligned on a boundary with an address exactly divisible by the size of the data, with 1 exception. So 2-byte data should be on a 2-byte boundary, 4 byte data on a 4-byte boundary and 8 byte data on an 8 byte boundary and so on. The exception is the FPU EXT data which is 10-byte and AMD say this should be on an 8-byte boundary.
The reason is simple. If a single item of data straddles a cache boundary then the CPU must make 2 data fetches which takes twice as long. By following the above rule, only 1 fetch will be required for any item of data.

For code the alignment is more complicated. Code isn't 4-byte aligned. A CPU will fetch instructions in blocks of typically 16 bytes and parallel decode them to save time. Single instructions can be many bytes long. If an instruction straddles the 16 byte line then that instruction cannot be executed in parallel with the others as the whole instruction isn't available until the next block is fetched. In AMD processors this would convert a normally direct decoded instruction (fast dedicated hardware) to a vector decoded instruction (slower microcode).
Also, in a tight loop, if a branch takes place to an istruction toward the end of a 16 byte boundary then the preceding instructions in that 16 bytes are fetched but not decoded as they aren't needed. The CPU then fetches the next 16 byte line immediately. It's better if the branch target is toward the start of the 16 bytes so the one fetch/decode will decode more useful instructions before needing to fetch again.

For more details see the Athlon optimization guide:
http://www.amd.com/us-en/assets/content ... /22007.pdf


A compiler could certainly do some of these optomisations. Maybe there should be an option to optimise for speed or size.
For size, don't align any code.
For speed, align everything at the expense of a slightly larger EXE.

Paul.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

Using technicorm's code, my work intel core duo had numbers pretty much the same. my home AMD dual core has

Code: Select all


Attempt: 1

ExecMax: 735
ExecMin: 547

Attempt: 2


ExecMax: 718
ExecMin: 485

which means...?
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
technicorn
Enthusiast
Enthusiast
Posts: 105
Joined: Wed Jan 18, 2006 7:40 pm
Location: Hamburg

Post by technicorn »

Hy pdwyer,

have a look at the code-alignment on which the minimum execution time occurs.

And having different times for first an second attempt is complitely normal
for a multitasking OS.

Just moving the mouse generates thousands of interrupts that have to be served.

@dioxin
Which that nick, I hope that not just reading your post is toxic!? :lol:

But seriously,
as best/worse alignment is so different on CPUs, it would be much more effective
to have a better pairing of instructions in the code.

As modern CPUs have three or more execution units, they can execute several instructions at once,
but only if they are not interdependent.

Something like this:

XOR eax,eax
MOV [var],eax

gives small code but is very bad,
it had to wait for the XOR result to write it to [var]

MOV [var],0

gives more code bytes, but is much quicker.

But it's not so easy to shuffle bigger blocks of code around for better performance,
without losing it's functinality.
Post Reply