Page 1 of 1

Second+ procedures a little faster?

Posted: Thu May 18, 2006 2:34 am
by Dare2
Could somebody with a little time confirm this.

Having one dummy procedure before all others, and calling that once, early or before everything else, will speed up the next procedure? That is, the first procedure is, for some reason, slower? This is true (or false?) especially if strings are involved?

Background:

I did some instrRev tests and accidentally discovered that switching procedures positionally in the code affected their performance.

I then tried the same procedure (under different names) and the second instance was always a little faster than the first. If I swapped them around, positionally, the speeds changed. The slower proc now had the legs. Diff was about the same, only they had swapped performance.

I placed a third, dummy, do nothing much procedure before both, called it once and then did the tests and the two identical procs ran same speed, that is, within a few ms of each other, with neither always having a slight edge. So externals were at work, nothing internal. Whereas before the diff was larger and the second (positioned, top down, not second being tested/called) always ran faster.

Debugger was off. No asm, libs, etc, were involved. Just straight PureCode. Version 4. XP Pro with all updates.


This seems strange. Can someone confirm this?


Note: This is not important, I am just curious.

Posted: Thu May 18, 2006 3:17 am
by jack
I have seen some strange timings, sometimes adding a few instructions can change the outcome quite a bit.
consider these example by El_Choni, run with debug off and wait for requester, it will take some time.

Code: Select all

time = ElapsedMilliseconds() 
For i=0 To 1000000000 
  !FLD tword [l_at] 
  ;b.e=0.57823984700423848847388532e-11 
  !FLD tword [l_bt] 
  ;c.e=a.e+b.e 
  !FADDP 
  !FSTP tword [l_ct] 
Next i 
tt = ElapsedMilliseconds()-time 
time = ElapsedMilliseconds() 
For i=0 To 1000000000 
  !FLD qword [l_aq] 
  ;b.e=0.57823984700423848847388532e-11 
  !FADD qword [l_bq] 
  ;c.e=a.e+b.e 
  !FSTP qword [l_cq] 
Next i 
tq = ElapsedMilliseconds()-time 

MessageRequester("Results:", "Tword operation: "+Str(tt)+Chr(10)+"Qword operation: "+Str(tq)) 

End 
!section '.data' code readable writeable 
at: 
!dt 6.938827757238428549483e-9 
bt: 
!dt 0.57823984700423848847388532e-11 
ct: 
!dt 0.0 
aq: 
!dq 6.938827757238428549483e-9 
bq: 
!dq 0.57823984700423848847388532e-11 
cq: 
!dq 0.0 
now swap the loops and run again.

Code: Select all

time = ElapsedMilliseconds() 
For i=0 To 1000000000 
  !FLD qword [l_aq] 
  ;b.e=0.57823984700423848847388532e-11 
  !FADD qword [l_bq] 
  ;c.e=a.e+b.e 
  !FSTP qword [l_cq] 
Next i 
tq = ElapsedMilliseconds()-time 
time = ElapsedMilliseconds() 
For i=0 To 1000000000 
  !FLD tword [l_at] 
  ;b.e=0.57823984700423848847388532e-11 
  !FLD tword [l_bt] 
  ;c.e=a.e+b.e 
  !FADDP 
  !FSTP tword [l_ct] 
Next i 
tt = ElapsedMilliseconds()-time 

MessageRequester("Results:", "Tword operation: "+Str(tt)+Chr(10)+"Qword operation: "+Str(tq)) 

End 
!section '.data' code readable writeable 
at: 
!dt 6.938827757238428549483e-9 
bt: 
!dt 0.57823984700423848847388532e-11 
ct: 
!dt 0.0 
aq: 
!dq 6.938827757238428549483e-9 
bq: 
!dq 0.57823984700423848847388532e-11 
cq: 
!dq 0.0 

Posted: Thu May 18, 2006 6:07 am
by Dare2
Heya Jack.

It is interesting.

My first post was too simplistic. I've messed around with some more stuff, and positioning is important but not just at the procedural level (as your example showed).

Some things, no probs no matter what you do. Others can have a speed drop or increase just by switching the position of one line of code. Although using strings appears to be more likely to have this peculiarity than string-free.

Some things it is quite significant, rations of approx 13:12.

In all cases I tested, the ASM output is the same statement-by-statement, only the position changed.

I am beginning to wonder if it is something like an AV taking a closer look at a particular pattern of code or sequence of events before letting it go. But it is too advanced for me.

What I do know is that my speed tests may not be worth twopence unless I test in several different contexts. Even then they may be suspect.


BTW, anyone done a good fast InStrRev? :)

Posted: Thu May 18, 2006 8:43 am
by Trond
I´ve run into this, but only with the debugger. But theoretically it could also happen without it for various reasons.

I think the pentium processors have two instruction "pipes", the u pipe and the v pipe. Certain instructions pairs can be run at the same time in the different pipes, while certain instructions are not pairable and if one of these executes in one pipe the other pipe will stay idle. But this would not be relevant with the same asm code.

Most likely the cause of this is that instructions aligned on a four-byte boundary or instructions that reference data aligned on a four-byte boundary are faster.

Note that this is just guesswork.

Posted: Thu May 18, 2006 9:13 am
by Dare2
Probably a more educated guess than mine. :)

With the pairing, if this is an example:

Code: Select all

  MOV    ebx,dword [esp+12]
  ADD    ebx,dword [esp+PS2+8]
  MOV    dword [esp+16],ebx
Then it pairs, say, MOV & ADD, then MOV (and whatever follows, if 'pairable')

Moving this might set up a situation where the 2nd MOV cannot be paired, or perhaps the first depending on what preceded it? Or what it is manipulating, I guess?

Anyhow, not important, just a curiosity thing.

Thanks for the insight.

Posted: Thu May 18, 2006 9:13 am
by GedB
It could have something to do with memory paging.

Remember that the memory space is virtual and mapped to pages of real memory that are swapped in and out.

If a procedure is closer to the code calling it there is a greater chance that it will be within the same memory page and jumps within the same page are quicker.

Again, just guess work.

Posted: Thu May 18, 2006 9:15 am
by Dare2
Also more educated than any I make. :)

Although the code is tiny, so would that mean paging is unlikely?

Posted: Thu May 18, 2006 9:19 am
by GedB
Good point.

Also the results of page swapping would be less consistant, more random.

Posted: Thu May 18, 2006 9:29 am
by blueznl
someone tried it on amd?

(i tried to run the samples but couldn't get them to run)

Posted: Thu May 18, 2006 10:08 am
by Dare2
Actually, any small code snippets in pure would do. Have identical code in two procedures, call each in a timing loop.

Sometimes there is no difference.

Sometimes there is a consistent difference. If there is a consistent difference, favouring one, swap the two procedures and try again. Now the other (by name) procedure should outperform, but it is in the same location as the previous top performer.

It happened often enough here for it to be beyond co-incidence, when it happened.


On this box I have a p4. On one right beside it I have an AMD Sempron, but it was a cheap box with a cheap HD and the HD is knackered. I need to get another HD before I can see if AMD makes a difference.


I was actually a little nervous that some RAM was going sluggy and threatening to die. But the fact that Jack and Trond have seen similar things is re-assuring.

Posted: Thu May 18, 2006 10:31 am
by Trond
I´ve only seen it with the debugger as far as I remember, but that´s because I haven´t been looking for the problem without it.

Memory page swapping is not really a possibility since memory pages should be 4 kilobytes and the two procedures should fit inside it.

I can´t test this now but I´ll test it when I get home.

Posted: Thu May 18, 2006 11:06 am
by dioxin
Dare2,
didn't I just tell you the cause in the other thread a couple of days ago?

http://www.purebasic.fr/english/viewtop ... 0&start=73

Lots of things can affect the precise timing of a short routine, but for small loops, code alignment of the brach target is very important.

The U and V pipelines are a bit out of date, they were introduced with the Pentium but things have now moved on. These days a CPU has many pipelines (not just the one and a half that the Pentium had) e.g. an Athlon has 2 FPU and 3 integer pipelines and it decodes blocks of 16 bytes and allocates resources as required to each instruction in the 16 bytes of code so all instructions in that block can execute at the same time under the right conditions.
Instruction order can still make a difference but it's more complex than it used to be!

Paul.

Added:
Anti virus software would not cause this. AV looks at the entire code first before executing any of it so it may delay the time it takes to begin executing the code but it will have no effect on timing within that code .. except, maybe for file accesses.
Most likely the cause of this is that instructions aligned on a four-byte boundary or instructions that reference data aligned on a four-byte boundary are faster.
Instuction alignment is by the 16 byte-block or the cache line (usually 32 bytes), not 4 bytes.

Data alignment should always tie in with the size of the target data. 2-byte values should be aligned on a 2 byte boundary, 4 byte values should be aligned on a 4 byte boundary, etc. The exception is EXTENDED FP, 80 bit numbers which should be aligned on an 8 byte bounday, not the 10 byte that you may expect.

If the compiler is written well then all data values will be properly aligned so you needn't worry about it unless you create your own data structures.

Anyhow, not important, just a curiosity thing.
Not important enough to gain you 8% in speed? Some folk spend hours tweaking code to gain less than that!
In practice you can gain up to 20% when moving from badly aligned code to well aligned code but realistically, it's only worth the effort in critical situations or when writing in ASM.

It could have something to do with memory paging.
Very unlikely, partly because, as has been pointed out, small code will usually be in the same page and partly because the OS pages everything into memory at the outset and only pages it out when some other task needs the space. Since you don't do timings on code like this with huge applications running in the background then the OS is never going to have a reason to page your code out.

someone tried it on amd?
My arguments above are all based on AMD and I'm assuming similar conditions exist with Pentiums although the exact figures may vary.

Posted: Thu May 18, 2006 11:42 am
by Dare2
dioxin wrote:Dare2,
didn't I just tell you the cause in the other thread a couple of days ago?
Now that you mention it and bring it back to my attention, yes. In fact that is probably why I noticed it.


Sheeesh, I have a short little span of attention. :? :D

Posted: Thu May 18, 2006 11:54 am
by dioxin
Dare2,
I assume you'll all be aware of this one but it fits in with this discussion so I'll mention it here.
All code always runs more slowly the first time because it's not cached, it's usually in main memory which is VERY slow by comparison.
The first time the code is called it gets fetched from main memory to CPU cache, which takes a while. But, provided there is never any reason to throw it out of the cache it will remain there and each subsequent time it gets called it will run at cache speeds.

You can use this to your advantage by cache warming. Before you need the time critical code, call it once for a dummy run so the code and data are in cache ready for the real, high speed run.

I suspect that your "short little span of attention" is a similar effect. Had the previous thread been written an hour ago it would still be in you brain's cache and you would have immediately recognised it.
Since it was written 2 days ago, other, more recent events have been cached, such as hungry..need food, and the explanation in the previous thread has been thrown out of brain cache to main memory where it is not so readily accessible.

Paul.

Posted: Thu May 18, 2006 12:20 pm
by Dare2
:D

Good analogies! But now I'm hungry again (permanently cached ..)