Page 1 of 2

Loop & Procedure alignment

Posted: Sun Jul 31, 2016 7:12 pm
by Keya
It would be a nice performance boost to be able to align loops, for example:
  For i = 0 to 999999999 Align 8
or
  For i = 0 to 999999999 Align SizeOf(#PB_Integer)
(etc). While we can already use "!align n" for inline asm it doesn't work with PB's For/While/Repeat loops, as the loop start is embedded a few instructions deep so it's not immediately after where we could put an !align, but the compiler already knows exactly where that is

Also would be great to be able to do the same with Procedure's :)
  Procedure Myalignedproc(var1,var2) Align 16

Re: Loop alignment

Posted: Sun Jul 31, 2016 7:29 pm
by Bisonte
Keya wrote:   For i = 0 to 999999999 Align 8
or
  For i = 0 to 999999999 Align SizeOf(#PB_Integer)
Is the align thing the same like this ?

Code: Select all

For i = 0 to 999999999 STEP 8

Re: Loop alignment

Posted: Sun Jul 31, 2016 7:38 pm
by Keya
:) no, Step is the loop increment ... for example "For i = 0 to 9 Step 3", i will be 0, then 3, then 6, then 9.

Loop alignment means the start of the loop (the actual very first instruction that the loop end repeatedly jumps back to) will be at an aligned address in memory, typically to 4 or 8 (so, if the unaligned loop was originally starting at 00401003 the "align 4" version would put a single NOP byte in front of that to push the loop start down to the 4-aligned address of 00401004), as the CPU can jump back there a little bit quicker than to non-aligned addresses in the same way it can also read/write quicker from aligned addresses. Especially handy for things like nested loops.

Re: Loop alignment

Posted: Sun Jul 31, 2016 7:53 pm
by Bisonte
ok. Thx.

Re: Loop alignment

Posted: Sun Jul 31, 2016 10:51 pm
by PMV
Shouldn't be loops always aligned to 32/64 bit? :|
I don't have done much with ASM so this is more a question
... arent instructions always 32/64-bit aligned? :shock:

Re: Loop alignment

Posted: Sun Jul 31, 2016 11:53 pm
by Keya
Shouldn't be loops always aligned to 32/64 bit?
They're not automatically aligned for you at the moment no, but should they be? I'd say no, because a lot of (most?) loops don't benefit much from alignment as they don't iterate enough, especially with todays CPUs that can do squillions of ops per second, and alignment comes at the cost of (1 to 15) bytes extra worth of NOPs, but yes Intel still today recommends aligning such branch targets (function entrypoints, loop starts etc) for optimization.

But when the coder knows a specific loop is called a lot it's great to be able to direct the compiler to align it! :)
... arent instructions always 32/64-bit aligned?
No! That would require useless NOPs in between just about every instruction

For a little example of loop alignment, consider the following inline PB :) Just a little loop which increments eax until it reaches 9:

Code: Select all

! xor eax, eax
;! align 4
!_NextIncrement:
! inc eax          ;the start of the loop (the branch target)
! cmp eax, 9
! jne _NextIncrement
Compiled, here's how it looked on my system with the !align statement commented out (so, as it'd normally be), so the start of the loop could end up anywhere, and in this case it's not aligned (had a 1 in 4 chance!):

Code: Select all

00401040  |.  31C0            xor eax, eax
00401042  |>  40              /inc eax               ;0x00401042 is not 4-aligned
00401043  |.  83F8 09         |cmp eax, 9
00401046  |.^ 75 FA           \jnz short 00401042
You can see that it keeps jumping back to 00401042 - that's the start of the loop, but 00401042 isn't a multiple of 4 - it's not aligned. btw ideal x64 alignment is actually 16!

When the !align statement is used you can see that NOPs are added (if required) to push the start of the loop to the next aligned boundary address - in this case two bytes (so two NOPs, although there are multibyte NOPs) were needed to do that so the fasm/yasm assembler has inserted them:

Code: Select all

00401040  |.  31C0            xor eax, eax
00401042  |.  90              nop
00401043  |.  90              nop
00401044  |>  40              /inc eax               ;0x00401044 is 4-aligned
00401045  |.  83F8 09         |cmp eax, 9
00401048  |.^ 75 FA           \jnz short 00401044
there's already a lot of existing literature on it and far better than i can explain so if interested in more detailed info just google: x86 align loops :)

Re: Loop & Procedure alignment

Posted: Mon Aug 01, 2016 7:28 am
by Fred
Could you post benchmark showcasing perf diff ?

Re: Loop & Procedure alignment

Posted: Mon Aug 01, 2016 10:47 am
by Keya
you don't believe Intel/AMD? what's wrong with several decades of Intel's recommendations that they still recommend today???
and what's wrong with making software go faster if it's VERY easy to do so??? (this isn't my recommendation, just my small feature request :))

from a quick google there's obviously a lot of various docs about it so here's just a couple from top of the search:
https://software.intel.com/en-us/articl ... structures
https://software.intel.com/en-us/articl ... erformance
also Intel® 64 and IA-32 Architectures Optimization Reference Manual, section "Code Alignment"
everyones heard of Agner Fogg's optimizations lol http://www.agner.org/optimize/optimizing_assembly.pdf

obviously nothing new that you're unaware of Fred, and fasm and yasm that PB use both support it. :) And i'm sure you're not suggesting the C compilers (virtually all of them) that add int3's/nops to align nearly ALL functions when optimized and their thorough support of "align" statements are doing so for no good reason!? :)
And even a blank Purebasic program already includes the following two lines in the .asm code generated by PB:

Code: Select all

macro    pb_align value { rb (value-1) - ($-_PB_DataSection + value-1) mod value }
macro pb_bssalign value { rb (value-1) - ($-_PB_BSSSection  + value-1) mod value }
I'm not asking for anything major, essentially just the ability to tell Purebasic to insert an "! align" statement at a place it already knows about that I can't directly access as that code is generated, to help improve performance in some areas as per Intel's recommendations.

And while obviously loops like "for i = 1 to 10" unnested loops like i already said in previous post aren't going to make any difference but I know you're not suggesting that Purebasic is never used for intensive loops - I had a single PB program running over 3 months from Dec-~Feb doing nothing but bulk image processing (actually 5 instances of the same process all on Below Normal priority so i could still use my machine lol), but obviously many loops and functions in regular programs that are called a lot and especially recursively or nested can benefit, i'm not saying anything new or profound here. :P but yes i accept it's just microoptimization lol, but i guessed it would be relatively easy to add so no harm in requesting? :P

Re: Loop & Procedure alignment

Posted: Mon Aug 01, 2016 11:37 am
by wilbert
While I like the idea of having the ability to align code, the question Fred made is also a good one.
My own experiments with code alignment inside asm sources have often been inconsistent.
There's a lot of different cpu's and all have different characteristics.
What will improve speed on one, might decrease the speed on another one.
It's hard to do what Fred asks; produce two asm routines, one aligned and one not and prove the aligned one is faster.
Variable alignment and being able to allocate aligned blocks of memory often has a bigger impact on performance.

Re: Loop & Procedure alignment

Posted: Mon Aug 01, 2016 11:41 am
by djes
This code is still working (there's more on the forum) :

Code: Select all

;djes floating point speed test
;2005

#nb = 300000000

f1.f = 103.2
f2.f = 215.45
l1.l = 103
l2.l = 215

;****************************************************************

Goto f
!SECTION '.testf' CODE READABLE EXECUTABLE ALIGN 4096
f:

Temps1 = ElapsedMilliseconds()

!ALIGN 4

For n = 1 To #nb
  ; put your code here
  f3.f = f1 * f2
Next

Temps2 = ElapsedMilliseconds()

;****************************************************************

Goto i
!SECTION '.testi' CODE READABLE EXECUTABLE ALIGN 4096
i:

Temps3 = ElapsedMilliseconds()

!ALIGN 4

For n = 1 To #nb
  ; put your code here
  l3 = l1 * l2
Next

Temps4 = ElapsedMilliseconds()

;****************************************************************
MessageRequester("Speed test", "First : " + Str(Temps2 - Temps1) + " ; Second : " + Str(Temps4 - Temps3) + Chr(10) + "Ratio = 1 / " + StrF((Temps2 - Temps1) / (Temps4 - Temps3)), 0) 
End

Re: Loop & Procedure alignment

Posted: Mon Aug 01, 2016 12:10 pm
by Keya
djes wrote:!ALIGN 4
For n = 1 To #nb
this doesnt work unfortunately, hence my request :) it doesn't work because the instruction immediately after the !align isn't actually the start of the loop, it's a couple more instructions deeper. For example:

Code: Select all

! align 4
For i = 1 To 3
  ! xchg ecx, ecx
Next i
Without !align, it looks like this (conveniently x86 unaligned due to 3-in-4 chance lol):

Code: Select all

; 00401042  |.  C705 E4304000 01000000      mov dword ptr [4030E4], 1     ;<- the instruction an !align statement would affect
; 0040104C  |.  EB 00                       jmp short 0040104E
; 0040104E  |>  B8 03000000                 /mov eax, 3                   ;<- actual start of loop we want to align
; 00401053  |.  3B05 E4304000               |cmp eax, dword ptr [4030E4]
; 00401059  |.  7C 0A                       |jl short 00401065
; 0040105B  |.  87C9                        |xchg ecx, ecx
; 0040105D  |.  FF05 E4304000               |inc dword ptr [4030E4]
; 00401063  |.^ 71 E9                       \jno short 0040104E
With the !align statement it results in this differing start:

Code: Select all

; 00401042  |.  90                          nop
; 00401043  |.  90                          nop
; 00401044  |.  C705 E4304000 01000000      mov dword ptr [4030E4], 1
So it's aligned that "mov dword ptr" instruction but the one we want to align is the loop start/branch target - "mov eax, 3", and you can see from the .asm output that the Purebasic compiler already knows where this is :) (although i'm definitely not saying that that equates to making this an easy addition for Fred! but hopefully makes it easier at least, but i also appreciate there's a million other feature requests, and even if it never gets added no harm in asking?) :)

Code: Select all

; For i = 1 To 3
  MOV    dword [v_i],1
  JMP   _ForSkipDebug1
_For1:                      ;<-- :)
_ForSkipDebug1:
  MOV    eax,3
  CMP    eax,dword [v_i]
  JL    _Next2
; ! xchg ecx, ecx
  xchg   ecx, ecx
; Next i
_NextContinue2:
  INC    dword [v_i]
  JNO   _For1
_Next2:

Re: Loop & Procedure alignment

Posted: Mon Aug 01, 2016 1:11 pm
by djes
Thank you for your answer ! Actually, it was not the !align that I showed you, it's the Goto/Section trick :

Code: Select all

#nb = 300000000

Define i.i

Temps1 = ElapsedMilliseconds()

For x = 0 To #nb
  
  ; For i = 1 To 3
  i = 1
  !JMP _MyForSkipDebug1
  
  !SECTION '.test2' CODE READABLE EXECUTABLE ALIGN 4096
  
  !_MyFor1:                      ;<-- :)
  !_MyForSkipDebug1:
  !MOV    eax,3
  !CMP    eax,dword [v_i]
  !JL    _MyNext2
  !XCHG   ecx, ecx
  ; Next i
  !_MyNextContinue2:
  !INC    dword [v_i]
  !JNO   _MyFor1
  !_MyNext2:
  
Next x

Temps2 = ElapsedMilliseconds()


;****************************************************************

Temps3 = ElapsedMilliseconds()

For x = 0 To #nb
  
  ; For i = 1 To 3
  i = 1
  !JMP _MyForSkipDebug1_2
  
  !SECTION '.test2' CODE READABLE EXECUTABLE ALIGN 16
  
  !_MyFor1_2:                      ;<-- :)
  !_MyForSkipDebug1_2:
  !MOV    eax,3
  !CMP    eax,dword [v_i]
  !JL    _MyNext2_2
  !XCHG   ecx, ecx
  ; Next i
  !_MyNextContinue2_2:
  !INC    dword [v_i]
  !JNO   _MyFor1_2
  !_MyNext2_2:
  
Next x

Temps4 = ElapsedMilliseconds()

;****************************************************************
MessageRequester("Speed test", "First : " + Str(Temps2 - Temps1) + " ; Second : " + Str(Temps4 - Temps3) + Chr(10) + "Ratio = 1 / " + StrF((Temps2 - Temps1) / (Temps4 - Temps3)), 0) 
End

Re: Loop & Procedure alignment

Posted: Mon Aug 01, 2016 1:38 pm
by wilbert
No significant difference on my computer

Code: Select all

DisableDebugger

t1 = ElapsedMilliseconds()
!mov ecx, 0
!align 8
!loop0:
!dec ecx
!jnz loop0
t2 = ElapsedMilliseconds()
!mov ecx, 0
!align 8
!nop
!loop1:
!dec ecx
!jnz loop1
t3 = ElapsedMilliseconds()

MessageRequester("Timings", "aligned "+Str(t2-t1)+" vs unaligned "+Str(t3-t2))

Re: Loop & Procedure alignment

Posted: Mon Aug 01, 2016 2:54 pm
by Keya
wilbert no i wouldn't expect much difference from that one either!

djes interesting trick! :)

Re: Loop & Procedure alignment

Posted: Mon Aug 01, 2016 3:12 pm
by wilbert
Keya wrote:wilbert no i wouldn't expect much difference from that one either!
In what cases do you expect a difference ?
I tried with a bit more code in between and the result is still the same.

Code: Select all

DisableDebugger

t1 = ElapsedMilliseconds()
!mov ecx, 0x10000000
!align 8
!loop0:
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!sub ecx, 1
!jnz loop0
t2 = ElapsedMilliseconds()
!mov ecx, 0x10000000
!align 8
!nop
!loop1:
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!sub ecx, 1
!jnz loop1
t3 = ElapsedMilliseconds()

MessageRequester("Timings", "aligned "+Str(t2-t1)+" vs unaligned "+Str(t3-t2))