Page 1 of 2

Optimization suggestion

Posted: Sat Aug 12, 2006 8:08 pm
by Trond
I noticed that PB 4 shows some weird behaviour in a special case, and I think it shouldn't be that hard to optimize it.

This code:
a = b

Generates this asm code:
push b
pop a

While this code:
a = b+0

Generates this asm code:
mov ebx, b
mov a, ebx

(Obviously the second one is faster.)

Posted: Sun Aug 13, 2006 1:16 am
by Killswitch
Just to be clear, which one are you trying to optimise?

Posted: Sun Aug 13, 2006 9:47 am
by Trond
The second generated asm is faster, so PB should generate that code for the first expression as well. It's a bit strange that it uses push and pop for that expression but it refuses to use push and pop for more complicated expressions, instead it says "out of cpu registers".

Posted: Sat Feb 09, 2008 9:58 pm
by kinglestat
why you say is faster?
if I remember correctly pop and push are 4 clock cycles each on 32bit architecture, while move is 19+offset calc. The only issue is that branch prediction will not "reset" on pop and push

Posted: Sat Feb 09, 2008 10:03 pm
by Trond
why you say is faster?
1. Because it is faster. Just test it.
2. Push/pop is 4 memory accesses while mov/mov is 2 memory accesses.

Posted: Sun Feb 10, 2008 11:23 am
by Derek

Code: Select all

a=0
b=0

t=ElapsedMilliseconds()
For m=1 To 3
For n=1 To 10000000
a=b
Next
Next
t=ElapsedMilliseconds()-t

t2=ElapsedMilliseconds()
For m=1 To 3
For n=1 To 10000000
a=b+0
Next
Next
t2=ElapsedMilliseconds()-t2

MessageRequester("",Str(t)+" "+Str(t2))
Run with debugger off, second one is definately faster.

Posted: Sun Feb 10, 2008 12:49 pm
by pdwyer
I would say that either PB has a reason for it or PB would fix it (especially now that you've mentioned it) (I get about 15-20% on my AMD chip)

As for optimisation I would suggest, rather than do tricky things like you did in method 2 which might not be faster in the next incremental update to the compiler if something under the hood changes, add inline ASM yourself. Then no mater what PB does or changes you know that your code does things the way you want it to.

Personally my ASM skills suck so I'll just take the perf hit till PB addresses it. I don't really want wierd things in my code that perform better only because of a peculiarity with the compiler (in the version as at the time the code was written).

I'm glad you have the skills to spot this though, it means that many people may benefit from a perf increase in a near future version :) Keep looking!

Posted: Sun Feb 10, 2008 8:52 pm
by djes
You can't be sure that a pop will always be faster than a mov. Just try on different processors, and you will have surprises!

Posted: Sun Feb 10, 2008 9:57 pm
by Trond
If it's slower it would most likely be because of misalignment.

Edit:
You can't be sure that a pop will always be faster than a mov.
I said that mov is always faster than pop...

Posted: Mon Feb 11, 2008 10:05 am
by djes
:wink:

Posted: Thu Feb 21, 2008 5:58 am
by va!n
very interesting to see this different and its speed comparsion.
The a=b+0 version runs about 3-4 times faster as just only a=b.

Posted: Thu Feb 21, 2008 10:02 am
by dell_jockey
Derek wrote:

Code: Select all

a=0
b=0

t=ElapsedMilliseconds()
For m=1 To 3
For n=1 To 10000000
a=b
Next
Next
t=ElapsedMilliseconds()-t

t2=ElapsedMilliseconds()
For m=1 To 3
For n=1 To 10000000
a=b+0
Next
Next
t2=ElapsedMilliseconds()-t2

MessageRequester("",Str(t)+" "+Str(t2))
Run with debugger off, second one is definately faster.

on my machine (a Dell Precision T3400) there's no real difference in both version. Some runs the first one wins, some runs the other one wins. The difference between the two is marginal.

Whatever, isn't this example a bit futile? It only shows once again that PureBasic is not an optimising compiler, since:

- an optimising compiler will detect that 'a' and 'b' don't change value within the loops
- so by using constant folding it should strip the assignment statements and merely insert the values for 'a' and 'b'
- the next step would be to detect that the loop controlling variables 'm' and 'n' are used for loop control only and that they are not used for anything else and are not changed inside the loops either.
- inside the loops, nothing else happens that would change the state of any other variable
- taken together, this allows an optimizing compiler to determine that there is no 'raison d'ĂȘtre' for the loops, so they can be dropped.

The bottom line: except for the timing measurement lines, for a good optimising compiler this code reduces to two assignment statements in the object code:

Code: Select all

a=0
b=0
If anything beyond this is in the object file, it only shows that PB is not optimising code, but instead - as we know already - its speed is based on linking in optimised libraries.

Posted: Mon Feb 25, 2008 12:27 pm
by va!n
impressive speed different :!: :shock:

On my DualCore T5200 @ 1,6Ghz Notebook i get following results:

Code: Select all

328 vs 110    ( a=b    vs    a=b+0 )
When using in all my source temporary +0 like topic example, to speed up the code, will there be nowhere any problems? (i.e. overwritting registers like eax, ebx.... ?)

@Fred and PureTeam:
Any way to implent this extreme speed optimisation to the 4.20 final? thx

Posted: Mon Feb 25, 2008 1:03 pm
by AND51
My results on Intel Pentium D (3.4 GHz, DualCore):

Code: Select all

 141 vs 62
(a=b vs a=b+0)

Posted: Mon Feb 25, 2008 2:45 pm
by pdwyer
Interestingly I get much closer results when I swap the position of the functions around. Could be CPU caching but I also seem to remember that some functions were faster when placed at the top of code (hard to believe I know, but I came across in on another thread). Make the a=b+0 bit the same for both functions and see if you get the same time...

Also, you might want to increase the loop count as elapsedmilliseconds precision is not too good, 16ms or so I think.

But yes, I get 20-30% increase in speed too.