Page 1 of 1

Strange optimization timings

Posted: Sat May 03, 2008 1:41 pm
by Derek
I thought I'd test the speed difference between using the swap command and using a temporary variable so I tried this bit of code

(make sure you turn off the debugger first)

Code: Select all

min.f=100.55
mean.f=55.34
max.f=33.21

a=0
For m=1 To 10
t=ElapsedMilliseconds()
For n=1 To 100000000

If max<mean
  Swap max,mean
EndIf

If mean<min
  Swap mean,min
EndIf

If max<mean
  Swap max,mean
EndIf

Next
t=ElapsedMilliseconds()-t
a+t
Next
MessageRequester("",Str(a/10))

min.f=100.55
mean.f=55.34
max.f=33.21

a=0
For m=1 To 10
t=ElapsedMilliseconds()
For n=1 To 100000000

If max<mean
tq.f=max
max=mean
mean=tq
EndIf

If mean<min
tq.f=mean
mean=min
min=tq
EndIf

If max<mean
tq.f=max
max=mean
mean=tq
EndIf

Next
t=ElapsedMilliseconds()-t
a+t
Next
MessageRequester("",Str(a/10))
;970 885
;948 887
;962 895
;996 882
;1007 895
so it appears that the temp variable is faster, but then I thought I'd print both results at the same time to make it easier to see so I did this

Code: Select all

min.f=100.55
mean.f=55.34
max.f=33.21

a=0
For m=1 To 10
t=ElapsedMilliseconds()
For n=1 To 100000000

If max<mean
  Swap max,mean
EndIf

If mean<min
  Swap mean,min
EndIf

If max<mean
  Swap max,mean
EndIf

Next
t=ElapsedMilliseconds()-t
a+t
Next
a2=a

min.f=100.55
mean.f=55.34
max.f=33.21

a=0
For m=1 To 10
t=ElapsedMilliseconds()
For n=1 To 100000000

If max<mean
tq.f=max
max=mean
mean=tq
EndIf

If mean<min
tq.f=mean
mean=min
min=tq
EndIf

If max<mean
tq.f=max
max=mean
mean=tq
EndIf

Next
t=ElapsedMilliseconds()-t
a+t
Next
MessageRequester("",Str(a2/10)+" "+Str(a/10))
;970 885
;948 887
;962 895
;996 882
;1007 895

;770 896
;770 896
;768 898
;771 896
and now the swap routine is 25% faster but nothing in the timing loops has been altered and the messagerequester() was outside of the timing so why should the first routine get so much faster. Nothing changed?

Posted: Sat May 03, 2008 4:11 pm
by Trond
There's a lot of things like alignment and caching that can influence the results of such microbenchmarks.

I don't know what your processor is smoking, but here both tests are consistent, with the first one being slightly slower than the second one in each test.

Also, I don't see why it isn't implemented like one of these two ways:

Code: Select all

#Times = 700000000

T = ElapsedMilliseconds()
For I = 0 To #Times
  Swap min.f, max.f
Next
T1 = ElapsedMilliseconds()-T

T = ElapsedMilliseconds()
For I = 0 To #Times
  MOV eax, min
  MOV ebx, max
  MOV max, eax
  MOV min, ebx
Next
T2 = ElapsedMilliseconds()-T

T = ElapsedMilliseconds()
For I = 0 To #Times
  MOV eax, min
  FLD max
  MOV max, eax
  FSTP min
Next
T3 = ElapsedMilliseconds()-T

MessageRequester("", Str(T1) + ", " + Str(T2) + ", " + Str(T3))

Posted: Sat May 03, 2008 6:33 pm
by Derek
It wasn't really the actual code that I was puzzled about but the fact that by moving one messagerequester you could dramatically change a timing.

I thought about caching etc but seeing as the requester isn't contained in the loop then shouldn't the loop be executed in the same time as it hasn't changed size or been realigned.

Strange that yours works fine, perhaps I'll try it on my other computer and see what happens.

Posted: Sun May 04, 2008 10:30 am
by remi_meier
I think, someone should write a How-To about making speed tests...

The problem with your benchmarks is, that you have more jumps
in there than swaps. Every IF needs a jump, every loop needs a jump
and overall, there is more code doing other things than swapping.
For the jumps, the important point is the alignment, like Trond said.
And because there are that many jumps, a simple call to MessageReq..
can move a lot of the jumps out of the line.

Actually, there is no real way to check the speed of a Swap, because
modern processors are just too complex and behave very differently
according to the environment the Swap is in.

But to at least be able to say something, you just have to put 1000
Swaps within the loop. Like that most of the time is spent on the
Swaps and less on the jumps and you could actually say something
about the results.

But anyway, most of the time it's still the algorithm that slows down
a program and only real optimizations done by the big C/C++/etc.
compilers could actually help (like moving a repeated calculation out
of a loop and things like that...). It's possible to write fast programs
in PB, but you have to think more than in C :wink: (btw, C compilers
are not the only ones that are fast, but of course a lot of effort has
been put in those to make'em faster because of the large user base).

Posted: Sun May 04, 2008 11:20 am
by Trond
remi_meier is right, you have more other things than actual swaps.

Here is my best try at making a fair benchmark. Note these things:
- I run all of them in one run, which may not be completely fair, but I spin up the processor before the first one to not give it a disadvantage.
- I test nothing but swapping inside the loops
- I run the loops fewer times by using the asm repeat directive inside them. It will repeat the gives instruction similar to a macro (so they are not looped, but physically inserted in the exe multiple times).
- I tried to align the loops equally by using !align 4. (Note: this doesn't mean the jump targets are aligned because of the counter variable code in the for comes after the align but before the jump target. But they will at the very least be equally misaligned.)

Of course it's not sure that it's completely fair, but my solution is the fastest, which is a good indication, don't you think? :wink: It's very close to the native swap, though. A temp variable is over twice as slow as either.

You may have to adjust #Tries to your faster processor, but be careful or it will run a long time, as one more try means 1000 more repetitions of the test code.

Code: Select all

A.f
B.f
T.f


#Tries = 500000

time = GetTickCount_()+250
While GetTickCount_() < time
Wend

time = GetTickCount_()
!align 4
For U = 0 To #Tries
  !repeat 1000
    Swap A, B
  !end repeat
  !align 4
Next
time1 = GetTickCount_()-time

; This one loses
time = GetTickCount_()
!align 4
For U = 0 To #Tries
  !repeat 1000
    T = A
    A = B
    B = T
  !end repeat
  !align 4
Next
time2 = GetTickCount_()-time

; This one wins
time = GetTickCount_()
!align 4
For U = 0 To #Tries
  !repeat 1000
    MOV eax, A
    MOV ebx, B
    MOV B, eax
    MOV A, ebx
  !end repeat
  !align 4
Next
time3 = GetTickCount_()-time

time = GetTickCount_()
!align 4
For U = 0 To #Tries
  !repeat 1000
    MOV eax, A 
    FLD B
    MOV B, eax 
    FSTP A
  !end repeat
  !align 4
Next
time4 = GetTickCount_()-time

R.s = Str(time1) + #CRLF$
R + Str(time2) + #CRLF$
R + Str(time3) + #CRLF$
R + Str(time4)

MessageRequester("", R)

Posted: Sun May 04, 2008 12:23 pm
by remi_meier
That's more like it :)

My results are similar:
Swap: 1548
Temp Var: 2470
MOVs: 1524
FPU: 1524

Posted: Sun May 04, 2008 3:15 pm
by SCRJ
719
2297
1203
1312

Posted: Sun May 04, 2008 3:48 pm
by Derek
I understand what you are saying, but as I said it's not really the swaps etc that I was puzzled about it's the fact that by taking a messgerequestor out of the program and putting at the end of the program made the first part faster.

The thing that puzzles me is that the first part would in theory still be aligned in the same way as it comes BEFORE the messagerequestor, I can understand that the second part might change but on my mobile AMD the second part stays more or less the same but the first part gains a 25% speed boost.

On my core2duo I don't get the speed boost so presumably it's an AMD thing or something?

Posted: Sun May 04, 2008 4:15 pm
by remi_meier
Derek wrote:I understand what you are saying, but as I said it's not really the swaps etc that I was puzzled about it's the fact that by taking a messgerequestor out of the program and putting at the end of the program made the first part faster.
Yes, I know. Just wanted to say the things I said because there are
too many speed tests like this.
Derek wrote:On my core2duo I don't get the speed boost so presumably it's an AMD thing or something?
Well, as I said, today's CPUs aren't as simple as they were a few years
ago. There are many things going on. It could be, that AMDs jump
prediction had an advantage in this situation. It could also just be,
that you had different power-safe modes (SpeedStep, etc.). It's
also possible that the prefetching worked differently with AMD and
Core2Duo. And I'm sure there are a lot more things that could make
a difference.

And btw. the results of SCRJ tell us to not believe any of our results
so far. It's just not possible to say anything about such speed tests.
A simple MessageRequester() shows completely different results and
only AMD or Intel could tell us why :wink:

Posted: Sun May 04, 2008 4:40 pm
by Derek
remi_meier wrote:And btw. the results of SCRJ tell us to not believe any of our results so far. It's just not possible to say anything about such speed tests.
A simple MessageRequester() shows completely different results and only AMD or Intel could tell us why :wink:
I agree, unless a simple speed test shows a massive improvement then I don't usually change my code accordingly but when I got a 25% increase I thought I'd mention it just in case it was something other people got as well but in this case it doesn't appear so. :)