Shouldn't be loops always aligned to 32/64 bit?
They're not automatically aligned for you at the moment no, but should they be? I'd say no, because a lot of (most?) loops don't benefit much from alignment as they don't iterate enough, especially with todays CPUs that can do squillions of ops per second, and alignment comes at the cost of (1 to 15) bytes extra worth of NOPs, but yes Intel still today recommends aligning such branch targets (function entrypoints, loop starts etc) for optimization.
But when the coder knows a specific loop is called a lot it's great to be able to direct the compiler to align it!
... arent instructions always 32/64-bit aligned?
No! That would require useless NOPs in between just about every instruction
For a little example of loop alignment, consider the following inline PB

Just a little loop which increments eax until it reaches 9:
Code: Select all
! xor eax, eax
;! align 4
!_NextIncrement:
! inc eax ;the start of the loop (the branch target)
! cmp eax, 9
! jne _NextIncrement
Compiled, here's how it looked on my system with the !align statement commented out (so, as it'd normally be), so the start of the loop could end up anywhere, and in this case it's not aligned (had a 1 in 4 chance!):
Code: Select all
00401040 |. 31C0 xor eax, eax
00401042 |> 40 /inc eax ;0x00401042 is not 4-aligned
00401043 |. 83F8 09 |cmp eax, 9
00401046 |.^ 75 FA \jnz short 00401042
You can see that it keeps jumping back to 00401042 - that's the start of the loop, but 00401042 isn't a multiple of 4 - it's not aligned. btw ideal x64 alignment is actually 16!
When the !align statement is used you can see that NOPs are added (if required) to push the start of the loop to the next aligned boundary address - in this case two bytes (so two NOPs, although there are multibyte NOPs) were needed to do that so the fasm/yasm assembler has inserted them:
Code: Select all
00401040 |. 31C0 xor eax, eax
00401042 |. 90 nop
00401043 |. 90 nop
00401044 |> 40 /inc eax ;0x00401044 is 4-aligned
00401045 |. 83F8 09 |cmp eax, 9
00401048 |.^ 75 FA \jnz short 00401044
there's already a lot of existing literature on it and far better than i can explain so if interested in more detailed info just google:
x86 align loops 