Possition of function in source code effects performance!!

Everything else that doesn't fall into one of the other PB categories.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Possition of function in source code effects performance!!

Post by pdwyer »

I'm still not sure that I'm seeing this correctly or not

I was trying to optimise a function so I copied and pasted the function under a different name and took out some string manipulation and found it ran slower (as an exe, not in the IDE/debugger). Scratching my head about why it got slower when all I did was take some code out I eventually tried swapping the position of the functions position in the source code (getting desparate in other words) and was surprised to see that the performance reversed. :shock:

Whichever function was higher in the source code was about 7-8% more efficient!!!

I must be doing something wrong but I'm just copy/pasting... :?

Has anyone else seen this? Is this even possible?
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Derek
Addict
Addict
Posts: 2354
Joined: Wed Apr 07, 2004 12:51 am
Location: England

Post by Derek »

This is a long shot but the only thing I can think of is that the processor is unable to cache all of the program and is having to load parts into the cache or that maybe a jump/call is having to be recalculated.

But, if the program is only a small one then that is obviously not going to be the case. :?
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

it's only a couple of hundred lines, no modules.

Tests on other sample code hasn't been consistant with this patten, but I am seeing some slight differences here and there.
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Derek
Addict
Addict
Posts: 2354
Joined: Wed Apr 07, 2004 12:51 am
Location: England

Post by Derek »

I suppose even a miscalculated jump etc would not equate to 7-8% of performance loss. Have you got some code we can look at?
xgp
Enthusiast
Enthusiast
Posts: 128
Joined: Mon Jun 13, 2005 6:03 pm

Post by xgp »

Maybe you were running less applications at the time you test the performance?
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

I have also noticed this. If my program is like this:

procedure 1
procedure 2

test speed of procedure 1
test speed of procedure 2

and procedure 2 wins, if I move them around like this:

procedure 2
procedure 1

test speed of procedure 1
test speed of procedure 2

then procedure 1 will suddenly win.

When it happens, the difference in speed is very consistent across runs, and about 7-10%.
Derek
Addict
Addict
Posts: 2354
Joined: Wed Apr 07, 2004 12:51 am
Location: England

Post by Derek »

Code: Select all

Procedure one()
For n=1 To 1000000
Next
EndProcedure
Procedure two()
For n=1 To 1000000
Next
EndProcedure

t=ElapsedMilliseconds()
one()
t=ElapsedMilliseconds()-t
MessageRequester("",Str(t))

t=ElapsedMilliseconds()
two()
t=ElapsedMilliseconds()-t
MessageRequester("",Str(t))
With debugger on the first procedure (whichever one it is) takes 31 and the second one 78 but without debugger on they both take 15 whichever one is called first.

Running exe's with the number changed to 10000000 does show a slow down depending on which one is called first. Very strange. :?
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

thank god, perhaps I'm not going mad :)

maybe the compiler is assigning variable priority to registers based on a first come first server for the whole app... although I think compilers tend to scan from the bottom of the code so perhaps I'm speaking out of my arse :P

It'd be nice to know a bit more how it worked so that I could put different functions in different places.
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
Kaeru Gaman
Addict
Addict
Posts: 4826
Joined: Sun Mar 19, 2006 1:57 pm
Location: Germany

Post by Kaeru Gaman »

one reason for different performance in different places is the ALIGN4-issue...

if a jump-in-point of a code isn't at a adress that is a multiple of 4,
(4 byte = 1 long = base of everything in a 32bit system)
it is slower that a jump-in-point that exactly matches a multiple of 4.

thus, two loops with exactly the same machine-code can have different performance,
if the jump-in-point of the one alignes to 4 and the other does not.
oh... and have a nice day.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

so you mean that rather than the position in the source code, the order that they appear in may effect their address location? putting dummy functions in might effect this to then rather than just moving it up?

basically, I'm not sure I follow, is there a known bug? sounds like the compiler could just set things up so that the function addresses are on their 32bit boundries (or am I just showing that I have no clue :oops: )
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
Kaeru Gaman
Addict
Addict
Posts: 4826
Joined: Sun Mar 19, 2006 1:57 pm
Location: Germany

Post by Kaeru Gaman »

> sounds like the compiler could just set things up so that the function addresses are on their 32bit boundries
the compiler could, but most times it is such a minimal differece,
that it is not worth keeping an eye on it in general.

you can put it yourself by using the !Align 4 ASM-command,
but I'm not sure if it helps with for/next,
how many instructions are set before the jump-in-point.

when you code everything in ASM, you can make intensive use of !Align 4

but in 99.9999% of the cases it's much much much more important, how you optimize the code inside the loop.
dumb or smart coding in this place makes a hundred or even thousand times bigger differences than Align4.
oh... and have a nice day.
Derek
Addict
Addict
Posts: 2354
Joined: Wed Apr 07, 2004 12:51 am
Location: England

Post by Derek »

Kaeru Gaman wrote:in 99.9999% of the cases it's much much much more important, how you optimize the code inside the loop.
dumb or smart coding in this place makes a hundred or even thousand times bigger differences than Align4.
Exactly, the example I posted above wouldn't exist in a real world program. Functions and prodedures are never that simple!
jear
User
User
Posts: 20
Joined: Sun Dec 28, 2003 12:27 am
Location: Ammerland

Another code to play with

Post by jear »

Code: Select all

#num = 1000000    ; for debugger version
;#num = 100000000  ; for compiled version

Define.s out

Procedure one()
  Protected dummy.l
  For n = 1 To #num : dummy = n : Next
  Delay(0) 
EndProcedure 

Procedure two() 
  Protected dummy.l
  For n = 1 To #num : dummy = n : Next
  Delay(0)  
EndProcedure 

out = "@one" + #tab$ + "@two" + #lf$
out + Str(@one()%4) + #tab$ + Str(@two()%4) + #lf$ + #lf$

For idx = 1 To 30
  t=ElapsedMilliseconds() 
  one() 
  t=ElapsedMilliseconds()-t 
  out + Str(t) + Space(2) 
  
  t=ElapsedMilliseconds() 
  two() 
  t=ElapsedMilliseconds()-t 
  out + #tab$ + Str(t) + #lf$
Next

MessageRequester("", out)

End
technicorn
Enthusiast
Enthusiast
Posts: 105
Joined: Wed Jan 18, 2006 7:40 pm
Location: Hamburg

Post by technicorn »

This is speed change is more a code-alignment issue than caching problem,
at least for small sized code execution, which has nothing to do
with the overall size of your program.
Even if your prog is 1Mb+, if you running in a short loop,
only the just executed code is cached.

But code-alignment has far bigger impact.
If you align to a multiply of 4, 8 or 16 bytes, a loop on a pentium+ it will run about 1.5-2 times faster, same for procedure entryes,
but it also depends on the cpu model which alignment works best.

You can test it by adding one or more "!NOP" before a loop for example,
without changing anything else in your program,
but as you don't know the alignment of the generated code,
you have to try between one and 15 NOPs and test the speed each time.

Also data-missalignment can slow down your program dramatical,
for example:

Global a.b, a2.l

Procedure p1()
Protected a.b, a2.l ....
...
...
EndProcedure


PB handles this by grouping variables of different sizes,
or for the local variables by always using a minimum of 4 bytes.

But you should avoid something like this:

Structure MyStruc
a.b
a2.l
EndStructure

Here, everyhing after a.b will be missaligned,
so you have to group them at the end of the struc. yourself if possible,
or always use long, if size doesn't matter.

The problem is, you can't automate the code-alignment, at least not most of the time.

For / Next:
There is init code before the actual loop, so only the init would be aligned,
not the loop itself

Procdures:
They are inserted as assembler-macros at the end of the code,
so aligning the declaration would not help,
as this would not go with the macro, but stays at the place of the declaration.
And for procedures there's a second problem, the loop that clears
the stack for local variables, if your proc. has many protected vars.,
it can give a longer delay if the clear loop isn't aligned, more than for a
unaligned proc. entry address.

The only loops where alignment works are:
ForEach / Next
While / Wend
Repeat / Until

Just did some extensive testing and it works...
not.

That is, when you use assembler-macros to adjust the alignment,
the assembler has to do calculate the amount of NOPs to insert,
to align the loop, but that shifts code around and changes addresses of
jump destinations, which can change a short jump to a long jump and
vice versa.
So the assembler needs another iteration over the code, which changes
everything again and again and again...
So the assembler gives up after some times and uses what ever
code results, that might leave some loops unaligned or not.

You can try this program to see that the alignment is not what you
wanted it to be.
It first alignes the code to always be on a multiply of 16 and than miss-
alignes it in a range of 1 - 16 bytes, at least it should do,
but as you can see from the address of the loop start it's not:

Code: Select all

; Assembler macro to align code on a given boundary, 
; aligning loop starts on 4/8/16 byte boundaries can speed up the code 
; by about 1.5 to 2 times! 

! macro codealignjmp value 
! { 
!   local dest
! 
!   if ((value - 1) - ((($ - $$) + value - 1) mod value)) > 3 
!     jmp dest 
!   end if 
! 
!   rept value
!   \{ 
!     if ($ - $$) mod value 
!       nop 
!     end if 
!   \} 
!   dest:
! }

! macro codealign value 
! { 
!   rept value 
!   \{ 
!     if ($ - $$) mod value 
!       nop 
!     end if 
!   \} 
! }

!macro pb_codealign value {
!  local dest
!  jmp short dest
!  rb (value-1) - ($-PureBasicStart + value-1) mod value
!dest: }


Macro AlignmentTiming(alignmentValue)
  i = 0
  tStart2 = ElapsedMilliseconds(): Repeat: tStart = ElapsedMilliseconds(): Until tStart <> tStart2
  !codealignjmp 16 ; Align to multiply of 16
  !rept alignmentValue { NOP } ; Now (miss-) algin to wanted value
  loopstart#alignmentValue:
  Repeat
   i + 1
  Until i = #iEnd
  tExec = ElapsedMilliseconds() - tStart
  If tExec < tExecMin: tExecMin = tExec: EndIf
  If tExec > tExecMax: tExecMax = tExec: EndIf
  PrintN("Align " + RSet(Str(alignmentValue), 2) + ": " + RSet(Str(tExec), 5) + ", Loopaddr: " + RSet(Hex(?loopStart#alignmentValue), 8, "0"))
EndMacro

#iEnd = 200000000
Define i.l, tStart.l, tStart2.l, tExec.l, txt.s
Define n.l
Define tExecMin.l, tExecMax.l

OpenConsole()

For n = 1 To 2
  tExecMax = 0
  tExecMin = #MAXLONG
  PrintN("Attempt: " + Str(n))
  AlignmentTiming(1)
  AlignmentTiming(2)
  AlignmentTiming(3)
  AlignmentTiming(4)
  AlignmentTiming(5)
  AlignmentTiming(6)
  AlignmentTiming(7)
  AlignmentTiming(8)
  AlignmentTiming(9)
  AlignmentTiming(10)
  AlignmentTiming(11)
  AlignmentTiming(12)
  AlignmentTiming(13)
  AlignmentTiming(14)
  AlignmentTiming(15)
  AlignmentTiming(16)
  PrintN("")
  PrintN("ExecMax: " + Str(tExecMax))
  PrintN("ExecMin: " + Str(tExecMin))
  PrintN("")
  PrintN("")
Next n

PrintN("Press key to quit")
Input()
I have a AMD AthlonXP 1.4GHz, and alingment of 8 gives best results,
alignment of 16 most worse!?!
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

Kaeru Gaman wrote: but in 99.9999% of the cases it's much much much more important, how you optimize the code inside the loop.
dumb or smart coding in this place makes a hundred or even thousand times bigger differences than Align4.
May be, but in this case, I had a function that had about 10-15 areas that had some possibilities for minor (10%) improvement, since the function was complex I created a new copy with a new name and tweaked parts one at a time testing the output as I went. Testing a change ended up reducing performance, this seemed odd but no impossible. I reverted the change and found that even with the same code the performance was less.

It's very frustrating to be attempting minor improvement tests in these conditions. In the reverse case you could think that you've made a minor improvement only to find that another change effects the alignment and performance drops again.

I would have thought that code alignment is the job of the compiler, I don't touch ASM and don't really intend to.

sprinkling my code with NOPs to align it sounds like a silly workaround to patch a compiler shortcoming.

Is code alignment in a basic compiler a programmers job? I wouldn't have thought so
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Post Reply