Optimization suggestion
Optimization suggestion
I noticed that PB 4 shows some weird behaviour in a special case, and I think it shouldn't be that hard to optimize it.
This code:
a = b
Generates this asm code:
push b
pop a
While this code:
a = b+0
Generates this asm code:
mov ebx, b
mov a, ebx
(Obviously the second one is faster.)
This code:
a = b
Generates this asm code:
push b
pop a
While this code:
a = b+0
Generates this asm code:
mov ebx, b
mov a, ebx
(Obviously the second one is faster.)
-
- Enthusiast
- Posts: 731
- Joined: Wed Apr 21, 2004 7:12 pm
-
- Enthusiast
- Posts: 746
- Joined: Fri Jul 14, 2006 8:53 pm
- Location: Malta
- Contact:
why you say is faster?
if I remember correctly pop and push are 4 clock cycles each on 32bit architecture, while move is 19+offset calc. The only issue is that branch prediction will not "reset" on pop and push
if I remember correctly pop and push are 4 clock cycles each on 32bit architecture, while move is 19+offset calc. The only issue is that branch prediction will not "reset" on pop and push
I may not help with your coding
Just ask about mental issues!
http://www.lulu.com/spotlight/kingwolf
http://www.sen3.net
Just ask about mental issues!
http://www.lulu.com/spotlight/kingwolf
http://www.sen3.net
Code: Select all
a=0
b=0
t=ElapsedMilliseconds()
For m=1 To 3
For n=1 To 10000000
a=b
Next
Next
t=ElapsedMilliseconds()-t
t2=ElapsedMilliseconds()
For m=1 To 3
For n=1 To 10000000
a=b+0
Next
Next
t2=ElapsedMilliseconds()-t2
MessageRequester("",Str(t)+" "+Str(t2))
I would say that either PB has a reason for it or PB would fix it (especially now that you've mentioned it) (I get about 15-20% on my AMD chip)
As for optimisation I would suggest, rather than do tricky things like you did in method 2 which might not be faster in the next incremental update to the compiler if something under the hood changes, add inline ASM yourself. Then no mater what PB does or changes you know that your code does things the way you want it to.
Personally my ASM skills suck so I'll just take the perf hit till PB addresses it. I don't really want wierd things in my code that perform better only because of a peculiarity with the compiler (in the version as at the time the code was written).
I'm glad you have the skills to spot this though, it means that many people may benefit from a perf increase in a near future version
Keep looking!
As for optimisation I would suggest, rather than do tricky things like you did in method 2 which might not be faster in the next incremental update to the compiler if something under the hood changes, add inline ASM yourself. Then no mater what PB does or changes you know that your code does things the way you want it to.
Personally my ASM skills suck so I'll just take the perf hit till PB addresses it. I don't really want wierd things in my code that perform better only because of a peculiarity with the compiler (in the version as at the time the code was written).
I'm glad you have the skills to spot this though, it means that many people may benefit from a perf increase in a near future version

Paul Dwyer
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
-
- Enthusiast
- Posts: 767
- Joined: Sat Jan 24, 2004 6:56 pm
Derek wrote:Run with debugger off, second one is definately faster.Code: Select all
a=0 b=0 t=ElapsedMilliseconds() For m=1 To 3 For n=1 To 10000000 a=b Next Next t=ElapsedMilliseconds()-t t2=ElapsedMilliseconds() For m=1 To 3 For n=1 To 10000000 a=b+0 Next Next t2=ElapsedMilliseconds()-t2 MessageRequester("",Str(t)+" "+Str(t2))
on my machine (a Dell Precision T3400) there's no real difference in both version. Some runs the first one wins, some runs the other one wins. The difference between the two is marginal.
Whatever, isn't this example a bit futile? It only shows once again that PureBasic is not an optimising compiler, since:
- an optimising compiler will detect that 'a' and 'b' don't change value within the loops
- so by using constant folding it should strip the assignment statements and merely insert the values for 'a' and 'b'
- the next step would be to detect that the loop controlling variables 'm' and 'n' are used for loop control only and that they are not used for anything else and are not changed inside the loops either.
- inside the loops, nothing else happens that would change the state of any other variable
- taken together, this allows an optimizing compiler to determine that there is no 'raison d'être' for the loops, so they can be dropped.
The bottom line: except for the timing measurement lines, for a good optimising compiler this code reduces to two assignment statements in the object code:
Code: Select all
a=0
b=0
impressive speed different
On my DualCore T5200 @ 1,6Ghz Notebook i get following results:
When using in all my source temporary +0 like topic example, to speed up the code, will there be nowhere any problems? (i.e. overwritting registers like eax, ebx.... ?)
@Fred and PureTeam:
Any way to implent this extreme speed optimisation to the 4.20 final? thx


On my DualCore T5200 @ 1,6Ghz Notebook i get following results:
Code: Select all
328 vs 110 ( a=b vs a=b+0 )
@Fred and PureTeam:
Any way to implent this extreme speed optimisation to the 4.20 final? thx
va!n aka Thorsten
Intel i7-980X Extreme Edition, 12 GB DDR3, Radeon 5870 2GB, Windows7 x64,
Intel i7-980X Extreme Edition, 12 GB DDR3, Radeon 5870 2GB, Windows7 x64,
My results on Intel Pentium D (3.4 GHz, DualCore):
Code: Select all
141 vs 62
(a=b vs a=b+0)
PB 4.30
Code: Select all
onErrorGoto(?Fred)
Interestingly I get much closer results when I swap the position of the functions around. Could be CPU caching but I also seem to remember that some functions were faster when placed at the top of code (hard to believe I know, but I came across in on another thread). Make the a=b+0 bit the same for both functions and see if you get the same time...
Also, you might want to increase the loop count as elapsedmilliseconds precision is not too good, 16ms or so I think.
But yes, I get 20-30% increase in speed too.
Also, you might want to increase the loop count as elapsedmilliseconds precision is not too good, 16ms or so I think.
But yes, I get 20-30% increase in speed too.
Paul Dwyer
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein