Optimization suggestion

Everything else that doesn't fall into one of the other PB categories.
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Optimization suggestion

Post by Trond »

I noticed that PB 4 shows some weird behaviour in a special case, and I think it shouldn't be that hard to optimize it.

This code:
a = b

Generates this asm code:
push b
pop a

While this code:
a = b+0

Generates this asm code:
mov ebx, b
mov a, ebx

(Obviously the second one is faster.)
Killswitch
Enthusiast
Enthusiast
Posts: 731
Joined: Wed Apr 21, 2004 7:12 pm

Post by Killswitch »

Just to be clear, which one are you trying to optimise?
~I see one problem with your reasoning: the fact is thats not a chicken~
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

The second generated asm is faster, so PB should generate that code for the first expression as well. It's a bit strange that it uses push and pop for that expression but it refuses to use push and pop for more complicated expressions, instead it says "out of cpu registers".
kinglestat
Enthusiast
Enthusiast
Posts: 746
Joined: Fri Jul 14, 2006 8:53 pm
Location: Malta
Contact:

Post by kinglestat »

why you say is faster?
if I remember correctly pop and push are 4 clock cycles each on 32bit architecture, while move is 19+offset calc. The only issue is that branch prediction will not "reset" on pop and push
I may not help with your coding
Just ask about mental issues!

http://www.lulu.com/spotlight/kingwolf
http://www.sen3.net
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

why you say is faster?
1. Because it is faster. Just test it.
2. Push/pop is 4 memory accesses while mov/mov is 2 memory accesses.
Derek
Addict
Addict
Posts: 2354
Joined: Wed Apr 07, 2004 12:51 am
Location: England

Post by Derek »

Code: Select all

a=0
b=0

t=ElapsedMilliseconds()
For m=1 To 3
For n=1 To 10000000
a=b
Next
Next
t=ElapsedMilliseconds()-t

t2=ElapsedMilliseconds()
For m=1 To 3
For n=1 To 10000000
a=b+0
Next
Next
t2=ElapsedMilliseconds()-t2

MessageRequester("",Str(t)+" "+Str(t2))
Run with debugger off, second one is definately faster.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

I would say that either PB has a reason for it or PB would fix it (especially now that you've mentioned it) (I get about 15-20% on my AMD chip)

As for optimisation I would suggest, rather than do tricky things like you did in method 2 which might not be faster in the next incremental update to the compiler if something under the hood changes, add inline ASM yourself. Then no mater what PB does or changes you know that your code does things the way you want it to.

Personally my ASM skills suck so I'll just take the perf hit till PB addresses it. I don't really want wierd things in my code that perform better only because of a peculiarity with the compiler (in the version as at the time the code was written).

I'm glad you have the skills to spot this though, it means that many people may benefit from a perf increase in a near future version :) Keep looking!
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
djes
Addict
Addict
Posts: 1806
Joined: Sat Feb 19, 2005 2:46 pm
Location: Pas-de-Calais, France

Post by djes »

You can't be sure that a pop will always be faster than a mov. Just try on different processors, and you will have surprises!
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

If it's slower it would most likely be because of misalignment.

Edit:
You can't be sure that a pop will always be faster than a mov.
I said that mov is always faster than pop...
User avatar
djes
Addict
Addict
Posts: 1806
Joined: Sat Feb 19, 2005 2:46 pm
Location: Pas-de-Calais, France

Post by djes »

:wink:
va!n
Addict
Addict
Posts: 1104
Joined: Wed Apr 20, 2005 12:48 pm

Post by va!n »

very interesting to see this different and its speed comparsion.
The a=b+0 version runs about 3-4 times faster as just only a=b.
va!n aka Thorsten

Intel i7-980X Extreme Edition, 12 GB DDR3, Radeon 5870 2GB, Windows7 x64,
dell_jockey
Enthusiast
Enthusiast
Posts: 767
Joined: Sat Jan 24, 2004 6:56 pm

Post by dell_jockey »

Derek wrote:

Code: Select all

a=0
b=0

t=ElapsedMilliseconds()
For m=1 To 3
For n=1 To 10000000
a=b
Next
Next
t=ElapsedMilliseconds()-t

t2=ElapsedMilliseconds()
For m=1 To 3
For n=1 To 10000000
a=b+0
Next
Next
t2=ElapsedMilliseconds()-t2

MessageRequester("",Str(t)+" "+Str(t2))
Run with debugger off, second one is definately faster.

on my machine (a Dell Precision T3400) there's no real difference in both version. Some runs the first one wins, some runs the other one wins. The difference between the two is marginal.

Whatever, isn't this example a bit futile? It only shows once again that PureBasic is not an optimising compiler, since:

- an optimising compiler will detect that 'a' and 'b' don't change value within the loops
- so by using constant folding it should strip the assignment statements and merely insert the values for 'a' and 'b'
- the next step would be to detect that the loop controlling variables 'm' and 'n' are used for loop control only and that they are not used for anything else and are not changed inside the loops either.
- inside the loops, nothing else happens that would change the state of any other variable
- taken together, this allows an optimizing compiler to determine that there is no 'raison d'être' for the loops, so they can be dropped.

The bottom line: except for the timing measurement lines, for a good optimising compiler this code reduces to two assignment statements in the object code:

Code: Select all

a=0
b=0
If anything beyond this is in the object file, it only shows that PB is not optimising code, but instead - as we know already - its speed is based on linking in optimised libraries.
cheers,
dell_jockey
________
http://blog.forex-trading-ideas.com
va!n
Addict
Addict
Posts: 1104
Joined: Wed Apr 20, 2005 12:48 pm

Post by va!n »

impressive speed different :!: :shock:

On my DualCore T5200 @ 1,6Ghz Notebook i get following results:

Code: Select all

328 vs 110    ( a=b    vs    a=b+0 )
When using in all my source temporary +0 like topic example, to speed up the code, will there be nowhere any problems? (i.e. overwritting registers like eax, ebx.... ?)

@Fred and PureTeam:
Any way to implent this extreme speed optimisation to the 4.20 final? thx
va!n aka Thorsten

Intel i7-980X Extreme Edition, 12 GB DDR3, Radeon 5870 2GB, Windows7 x64,
AND51
Addict
Addict
Posts: 1040
Joined: Sun Oct 15, 2006 8:56 pm
Location: Germany
Contact:

Post by AND51 »

My results on Intel Pentium D (3.4 GHz, DualCore):

Code: Select all

 141 vs 62
(a=b vs a=b+0)
PB 4.30

Code: Select all

onErrorGoto(?Fred)
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

Interestingly I get much closer results when I swap the position of the functions around. Could be CPU caching but I also seem to remember that some functions were faster when placed at the top of code (hard to believe I know, but I came across in on another thread). Make the a=b+0 bit the same for both functions and see if you get the same time...

Also, you might want to increase the loop count as elapsedmilliseconds precision is not too good, 16ms or so I think.

But yes, I get 20-30% increase in speed too.
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Post Reply