Loop & Procedure alignment

Got an idea for enhancing PureBasic? New command(s) you'd like to see?
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Loop & Procedure alignment

Post by Keya »

It would be a nice performance boost to be able to align loops, for example:
  For i = 0 to 999999999 Align 8
or
  For i = 0 to 999999999 Align SizeOf(#PB_Integer)
(etc). While we can already use "!align n" for inline asm it doesn't work with PB's For/While/Repeat loops, as the loop start is embedded a few instructions deep so it's not immediately after where we could put an !align, but the compiler already knows exactly where that is

Also would be great to be able to do the same with Procedure's :)
  Procedure Myalignedproc(var1,var2) Align 16
Last edited by Keya on Mon Aug 01, 2016 12:32 am, edited 2 times in total.
User avatar
Bisonte
Addict
Addict
Posts: 1305
Joined: Tue Oct 09, 2007 2:15 am

Re: Loop alignment

Post by Bisonte »

Keya wrote:   For i = 0 to 999999999 Align 8
or
  For i = 0 to 999999999 Align SizeOf(#PB_Integer)
Is the align thing the same like this ?

Code: Select all

For i = 0 to 999999999 STEP 8
PureBasic 6.21 (Windows x64) | Windows 11 Pro | AsRock B850 Steel Legend Wifi | R7 9800x3D | 64GB RAM | RTX 5080 | ThermaltakeView 270 TG ARGB | build by vannicom​​
English is not my native language... (I often use DeepL.)
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Loop alignment

Post by Keya »

:) no, Step is the loop increment ... for example "For i = 0 to 9 Step 3", i will be 0, then 3, then 6, then 9.

Loop alignment means the start of the loop (the actual very first instruction that the loop end repeatedly jumps back to) will be at an aligned address in memory, typically to 4 or 8 (so, if the unaligned loop was originally starting at 00401003 the "align 4" version would put a single NOP byte in front of that to push the loop start down to the 4-aligned address of 00401004), as the CPU can jump back there a little bit quicker than to non-aligned addresses in the same way it can also read/write quicker from aligned addresses. Especially handy for things like nested loops.
Last edited by Keya on Sun Jul 31, 2016 7:54 pm, edited 1 time in total.
User avatar
Bisonte
Addict
Addict
Posts: 1305
Joined: Tue Oct 09, 2007 2:15 am

Re: Loop alignment

Post by Bisonte »

ok. Thx.
PureBasic 6.21 (Windows x64) | Windows 11 Pro | AsRock B850 Steel Legend Wifi | R7 9800x3D | 64GB RAM | RTX 5080 | ThermaltakeView 270 TG ARGB | build by vannicom​​
English is not my native language... (I often use DeepL.)
PMV
Enthusiast
Enthusiast
Posts: 727
Joined: Sat Feb 24, 2007 3:15 pm
Location: Germany

Re: Loop alignment

Post by PMV »

Shouldn't be loops always aligned to 32/64 bit? :|
I don't have done much with ASM so this is more a question
... arent instructions always 32/64-bit aligned? :shock:
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Loop alignment

Post by Keya »

Shouldn't be loops always aligned to 32/64 bit?
They're not automatically aligned for you at the moment no, but should they be? I'd say no, because a lot of (most?) loops don't benefit much from alignment as they don't iterate enough, especially with todays CPUs that can do squillions of ops per second, and alignment comes at the cost of (1 to 15) bytes extra worth of NOPs, but yes Intel still today recommends aligning such branch targets (function entrypoints, loop starts etc) for optimization.

But when the coder knows a specific loop is called a lot it's great to be able to direct the compiler to align it! :)
... arent instructions always 32/64-bit aligned?
No! That would require useless NOPs in between just about every instruction

For a little example of loop alignment, consider the following inline PB :) Just a little loop which increments eax until it reaches 9:

Code: Select all

! xor eax, eax
;! align 4
!_NextIncrement:
! inc eax          ;the start of the loop (the branch target)
! cmp eax, 9
! jne _NextIncrement
Compiled, here's how it looked on my system with the !align statement commented out (so, as it'd normally be), so the start of the loop could end up anywhere, and in this case it's not aligned (had a 1 in 4 chance!):

Code: Select all

00401040  |.  31C0            xor eax, eax
00401042  |>  40              /inc eax               ;0x00401042 is not 4-aligned
00401043  |.  83F8 09         |cmp eax, 9
00401046  |.^ 75 FA           \jnz short 00401042
You can see that it keeps jumping back to 00401042 - that's the start of the loop, but 00401042 isn't a multiple of 4 - it's not aligned. btw ideal x64 alignment is actually 16!

When the !align statement is used you can see that NOPs are added (if required) to push the start of the loop to the next aligned boundary address - in this case two bytes (so two NOPs, although there are multibyte NOPs) were needed to do that so the fasm/yasm assembler has inserted them:

Code: Select all

00401040  |.  31C0            xor eax, eax
00401042  |.  90              nop
00401043  |.  90              nop
00401044  |>  40              /inc eax               ;0x00401044 is 4-aligned
00401045  |.  83F8 09         |cmp eax, 9
00401048  |.^ 75 FA           \jnz short 00401044
there's already a lot of existing literature on it and far better than i can explain so if interested in more detailed info just google: x86 align loops :)
Fred
Administrator
Administrator
Posts: 18162
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: Loop & Procedure alignment

Post by Fred »

Could you post benchmark showcasing perf diff ?
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Loop & Procedure alignment

Post by Keya »

you don't believe Intel/AMD? what's wrong with several decades of Intel's recommendations that they still recommend today???
and what's wrong with making software go faster if it's VERY easy to do so??? (this isn't my recommendation, just my small feature request :))

from a quick google there's obviously a lot of various docs about it so here's just a couple from top of the search:
https://software.intel.com/en-us/articl ... structures
https://software.intel.com/en-us/articl ... erformance
also Intel® 64 and IA-32 Architectures Optimization Reference Manual, section "Code Alignment"
everyones heard of Agner Fogg's optimizations lol http://www.agner.org/optimize/optimizing_assembly.pdf

obviously nothing new that you're unaware of Fred, and fasm and yasm that PB use both support it. :) And i'm sure you're not suggesting the C compilers (virtually all of them) that add int3's/nops to align nearly ALL functions when optimized and their thorough support of "align" statements are doing so for no good reason!? :)
And even a blank Purebasic program already includes the following two lines in the .asm code generated by PB:

Code: Select all

macro    pb_align value { rb (value-1) - ($-_PB_DataSection + value-1) mod value }
macro pb_bssalign value { rb (value-1) - ($-_PB_BSSSection  + value-1) mod value }
I'm not asking for anything major, essentially just the ability to tell Purebasic to insert an "! align" statement at a place it already knows about that I can't directly access as that code is generated, to help improve performance in some areas as per Intel's recommendations.

And while obviously loops like "for i = 1 to 10" unnested loops like i already said in previous post aren't going to make any difference but I know you're not suggesting that Purebasic is never used for intensive loops - I had a single PB program running over 3 months from Dec-~Feb doing nothing but bulk image processing (actually 5 instances of the same process all on Below Normal priority so i could still use my machine lol), but obviously many loops and functions in regular programs that are called a lot and especially recursively or nested can benefit, i'm not saying anything new or profound here. :P but yes i accept it's just microoptimization lol, but i guessed it would be relatively easy to add so no harm in requesting? :P
Last edited by Keya on Mon Aug 01, 2016 12:27 pm, edited 3 times in total.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Loop & Procedure alignment

Post by wilbert »

While I like the idea of having the ability to align code, the question Fred made is also a good one.
My own experiments with code alignment inside asm sources have often been inconsistent.
There's a lot of different cpu's and all have different characteristics.
What will improve speed on one, might decrease the speed on another one.
It's hard to do what Fred asks; produce two asm routines, one aligned and one not and prove the aligned one is faster.
Variable alignment and being able to allocate aligned blocks of memory often has a bigger impact on performance.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
djes
Addict
Addict
Posts: 1806
Joined: Sat Feb 19, 2005 2:46 pm
Location: Pas-de-Calais, France

Re: Loop & Procedure alignment

Post by djes »

This code is still working (there's more on the forum) :

Code: Select all

;djes floating point speed test
;2005

#nb = 300000000

f1.f = 103.2
f2.f = 215.45
l1.l = 103
l2.l = 215

;****************************************************************

Goto f
!SECTION '.testf' CODE READABLE EXECUTABLE ALIGN 4096
f:

Temps1 = ElapsedMilliseconds()

!ALIGN 4

For n = 1 To #nb
  ; put your code here
  f3.f = f1 * f2
Next

Temps2 = ElapsedMilliseconds()

;****************************************************************

Goto i
!SECTION '.testi' CODE READABLE EXECUTABLE ALIGN 4096
i:

Temps3 = ElapsedMilliseconds()

!ALIGN 4

For n = 1 To #nb
  ; put your code here
  l3 = l1 * l2
Next

Temps4 = ElapsedMilliseconds()

;****************************************************************
MessageRequester("Speed test", "First : " + Str(Temps2 - Temps1) + " ; Second : " + Str(Temps4 - Temps3) + Chr(10) + "Ratio = 1 / " + StrF((Temps2 - Temps1) / (Temps4 - Temps3)), 0) 
End
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Loop & Procedure alignment

Post by Keya »

djes wrote:!ALIGN 4
For n = 1 To #nb
this doesnt work unfortunately, hence my request :) it doesn't work because the instruction immediately after the !align isn't actually the start of the loop, it's a couple more instructions deeper. For example:

Code: Select all

! align 4
For i = 1 To 3
  ! xchg ecx, ecx
Next i
Without !align, it looks like this (conveniently x86 unaligned due to 3-in-4 chance lol):

Code: Select all

; 00401042  |.  C705 E4304000 01000000      mov dword ptr [4030E4], 1     ;<- the instruction an !align statement would affect
; 0040104C  |.  EB 00                       jmp short 0040104E
; 0040104E  |>  B8 03000000                 /mov eax, 3                   ;<- actual start of loop we want to align
; 00401053  |.  3B05 E4304000               |cmp eax, dword ptr [4030E4]
; 00401059  |.  7C 0A                       |jl short 00401065
; 0040105B  |.  87C9                        |xchg ecx, ecx
; 0040105D  |.  FF05 E4304000               |inc dword ptr [4030E4]
; 00401063  |.^ 71 E9                       \jno short 0040104E
With the !align statement it results in this differing start:

Code: Select all

; 00401042  |.  90                          nop
; 00401043  |.  90                          nop
; 00401044  |.  C705 E4304000 01000000      mov dword ptr [4030E4], 1
So it's aligned that "mov dword ptr" instruction but the one we want to align is the loop start/branch target - "mov eax, 3", and you can see from the .asm output that the Purebasic compiler already knows where this is :) (although i'm definitely not saying that that equates to making this an easy addition for Fred! but hopefully makes it easier at least, but i also appreciate there's a million other feature requests, and even if it never gets added no harm in asking?) :)

Code: Select all

; For i = 1 To 3
  MOV    dword [v_i],1
  JMP   _ForSkipDebug1
_For1:                      ;<-- :)
_ForSkipDebug1:
  MOV    eax,3
  CMP    eax,dword [v_i]
  JL    _Next2
; ! xchg ecx, ecx
  xchg   ecx, ecx
; Next i
_NextContinue2:
  INC    dword [v_i]
  JNO   _For1
_Next2:
User avatar
djes
Addict
Addict
Posts: 1806
Joined: Sat Feb 19, 2005 2:46 pm
Location: Pas-de-Calais, France

Re: Loop & Procedure alignment

Post by djes »

Thank you for your answer ! Actually, it was not the !align that I showed you, it's the Goto/Section trick :

Code: Select all

#nb = 300000000

Define i.i

Temps1 = ElapsedMilliseconds()

For x = 0 To #nb
  
  ; For i = 1 To 3
  i = 1
  !JMP _MyForSkipDebug1
  
  !SECTION '.test2' CODE READABLE EXECUTABLE ALIGN 4096
  
  !_MyFor1:                      ;<-- :)
  !_MyForSkipDebug1:
  !MOV    eax,3
  !CMP    eax,dword [v_i]
  !JL    _MyNext2
  !XCHG   ecx, ecx
  ; Next i
  !_MyNextContinue2:
  !INC    dword [v_i]
  !JNO   _MyFor1
  !_MyNext2:
  
Next x

Temps2 = ElapsedMilliseconds()


;****************************************************************

Temps3 = ElapsedMilliseconds()

For x = 0 To #nb
  
  ; For i = 1 To 3
  i = 1
  !JMP _MyForSkipDebug1_2
  
  !SECTION '.test2' CODE READABLE EXECUTABLE ALIGN 16
  
  !_MyFor1_2:                      ;<-- :)
  !_MyForSkipDebug1_2:
  !MOV    eax,3
  !CMP    eax,dword [v_i]
  !JL    _MyNext2_2
  !XCHG   ecx, ecx
  ; Next i
  !_MyNextContinue2_2:
  !INC    dword [v_i]
  !JNO   _MyFor1_2
  !_MyNext2_2:
  
Next x

Temps4 = ElapsedMilliseconds()

;****************************************************************
MessageRequester("Speed test", "First : " + Str(Temps2 - Temps1) + " ; Second : " + Str(Temps4 - Temps3) + Chr(10) + "Ratio = 1 / " + StrF((Temps2 - Temps1) / (Temps4 - Temps3)), 0) 
End
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Loop & Procedure alignment

Post by wilbert »

No significant difference on my computer

Code: Select all

DisableDebugger

t1 = ElapsedMilliseconds()
!mov ecx, 0
!align 8
!loop0:
!dec ecx
!jnz loop0
t2 = ElapsedMilliseconds()
!mov ecx, 0
!align 8
!nop
!loop1:
!dec ecx
!jnz loop1
t3 = ElapsedMilliseconds()

MessageRequester("Timings", "aligned "+Str(t2-t1)+" vs unaligned "+Str(t3-t2))
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Loop & Procedure alignment

Post by Keya »

wilbert no i wouldn't expect much difference from that one either!

djes interesting trick! :)
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Loop & Procedure alignment

Post by wilbert »

Keya wrote:wilbert no i wouldn't expect much difference from that one either!
In what cases do you expect a difference ?
I tried with a bit more code in between and the result is still the same.

Code: Select all

DisableDebugger

t1 = ElapsedMilliseconds()
!mov ecx, 0x10000000
!align 8
!loop0:
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!sub ecx, 1
!jnz loop0
t2 = ElapsedMilliseconds()
!mov ecx, 0x10000000
!align 8
!nop
!loop1:
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!xor ecx, 1
!rol ecx, 1
!xor ecx, 2
!ror ecx, 1
!sub ecx, 1
!jnz loop1
t3 = ElapsedMilliseconds()

MessageRequester("Timings", "aligned "+Str(t2-t1)+" vs unaligned "+Str(t3-t2))
Windows (x64)
Raspberry Pi OS (Arm64)
Post Reply