This is speed change is more a code-alignment issue than caching problem,
at least for small sized code execution, which has nothing to do
with the overall size of your program.
Even if your prog is 1Mb+, if you running in a short loop,
only the just executed code is cached.
But code-alignment has far bigger impact.
If you align to a multiply of 4, 8 or 16 bytes, a loop on a pentium+ it will run about 1.5-2 times faster, same for procedure entryes,
but it also depends on the cpu model which alignment works best.
You can test it by adding one or more "!NOP" before a loop for example,
without changing anything else in your program,
but as you don't know the alignment of the generated code,
you have to try between one and 15 NOPs and test the speed each time.
Also data-missalignment can slow down your program dramatical,
for example:
Global a.b, a2.l
Procedure p1()
Protected a.b, a2.l ....
...
...
EndProcedure
PB handles this by grouping variables of different sizes,
or for the local variables by always using a minimum of 4 bytes.
But you should avoid something like this:
Structure MyStruc
a.b
a2.l
EndStructure
Here, everyhing after a.b will be missaligned,
so you have to group them at the end of the struc. yourself if possible,
or always use long, if size doesn't matter.
The problem is, you can't automate the code-alignment, at least not most of the time.
For / Next:
There is init code before the actual loop, so only the init would be aligned,
not the loop itself
Procdures:
They are inserted as assembler-macros at the end of the code,
so aligning the declaration would not help,
as this would not go with the macro, but stays at the place of the declaration.
And for procedures there's a second problem, the loop that clears
the stack for local variables, if your proc. has many protected vars.,
it can give a longer delay if the clear loop isn't aligned, more than for a
unaligned proc. entry address.
The only loops where alignment works are:
ForEach / Next
While / Wend
Repeat / Until
Just did some extensive testing and it works...
not.
That is, when you use assembler-macros to adjust the alignment,
the assembler has to do calculate the amount of NOPs to insert,
to align the loop, but that shifts code around and changes addresses of
jump destinations, which can change a short jump to a long jump and
vice versa.
So the assembler needs another iteration over the code, which changes
everything again and again and again...
So the assembler gives up after some times and uses what ever
code results, that might leave some loops unaligned or not.
You can try this program to see that the alignment is not what you
wanted it to be.
It first alignes the code to always be on a multiply of 16 and than miss-
alignes it in a range of 1 - 16 bytes, at least it should do,
but as you can see from the address of the loop start it's not:
Code: Select all
; Assembler macro to align code on a given boundary,
; aligning loop starts on 4/8/16 byte boundaries can speed up the code
; by about 1.5 to 2 times!
! macro codealignjmp value
! {
! local dest
!
! if ((value - 1) - ((($ - $$) + value - 1) mod value)) > 3
! jmp dest
! end if
!
! rept value
! \{
! if ($ - $$) mod value
! nop
! end if
! \}
! dest:
! }
! macro codealign value
! {
! rept value
! \{
! if ($ - $$) mod value
! nop
! end if
! \}
! }
!macro pb_codealign value {
! local dest
! jmp short dest
! rb (value-1) - ($-PureBasicStart + value-1) mod value
!dest: }
Macro AlignmentTiming(alignmentValue)
i = 0
tStart2 = ElapsedMilliseconds(): Repeat: tStart = ElapsedMilliseconds(): Until tStart <> tStart2
!codealignjmp 16 ; Align to multiply of 16
!rept alignmentValue { NOP } ; Now (miss-) algin to wanted value
loopstart#alignmentValue:
Repeat
i + 1
Until i = #iEnd
tExec = ElapsedMilliseconds() - tStart
If tExec < tExecMin: tExecMin = tExec: EndIf
If tExec > tExecMax: tExecMax = tExec: EndIf
PrintN("Align " + RSet(Str(alignmentValue), 2) + ": " + RSet(Str(tExec), 5) + ", Loopaddr: " + RSet(Hex(?loopStart#alignmentValue), 8, "0"))
EndMacro
#iEnd = 200000000
Define i.l, tStart.l, tStart2.l, tExec.l, txt.s
Define n.l
Define tExecMin.l, tExecMax.l
OpenConsole()
For n = 1 To 2
tExecMax = 0
tExecMin = #MAXLONG
PrintN("Attempt: " + Str(n))
AlignmentTiming(1)
AlignmentTiming(2)
AlignmentTiming(3)
AlignmentTiming(4)
AlignmentTiming(5)
AlignmentTiming(6)
AlignmentTiming(7)
AlignmentTiming(8)
AlignmentTiming(9)
AlignmentTiming(10)
AlignmentTiming(11)
AlignmentTiming(12)
AlignmentTiming(13)
AlignmentTiming(14)
AlignmentTiming(15)
AlignmentTiming(16)
PrintN("")
PrintN("ExecMax: " + Str(tExecMax))
PrintN("ExecMin: " + Str(tExecMin))
PrintN("")
PrintN("")
Next n
PrintN("Press key to quit")
Input()
I have a AMD AthlonXP 1.4GHz, and alingment of 8 gives best results,
alignment of 16 most worse!?!