StarBootics wrote: Mon Sep 25, 2023 2:43 pm
On my computer with the C backend the difference between Method A and C is ridiculous. About the ASM backend maybe Fred should contact the guys who created the gcc compiler and share his knowledge on optimization in this particular case because Method B is way much faster.
Thanks for the measurements! If you don't mind sharing: what CPU are you running?
It's always interesting how different microarchitectures vary, and yours has way better FP performance relative mine but sees only marginal gains on the integer side.
Also, GCC can squeeze out some pretty great performance, but you have to compile with "-O3 -mavx2" to get it…
AVX2 isn't so bad (though you'd want to use dynamic dispatch to not alienate users), but -O3 may cause too many subtle bugs for it ever to be officially supported in PB. IIRC the Linux kernel devs are still rightly spooked about the stability of it (even though -O3 has gotten much better in recent years).
Here's equivalent sum of divisions in simple C:
Code: Select all
#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>
static int32_t DIVISOR = 12345;
static int32_t NUM_ITER = 1000000000;
uint64_t ElapsedMicroseconds() {
struct timeval current_time;
gettimeofday(¤t_time, NULL);
return current_time.tv_sec * 1000000 + current_time.tv_usec;
}
int main() {
int64_t sum = 0;
int64_t elapsed_time, start_time;
start_time = ElapsedMicroseconds();
for (int32_t i = 1; i <= NUM_ITER; i++) {
sum += i / DIVISOR;
}
elapsed_time = ElapsedMicroseconds() - start_time;
printf("\nMethod E: Pure C");
printf("\n sum D: %lld", sum);
printf("\n time D: %fms", elapsed_time / 1000.f);
return 0;
}
Results without AVX2 (flags in parentheses):
Code: Select all
Method E: Pure C (gcc -O2)
sum D: 40501727705054
time D: 430.404999ms
Method E: Pure C (gcc -O3)
sum D: 40501727705054
time D: 402.380005ms
Results with AVX2 (flags in parentheses):
Code: Select all
Method E: Pure C (gcc -O2 -mavx2)
sum D: 40501727705054
time D: 423.395996ms
Method E: Pure C (gcc -O3 -mavx2)
sum D: 40501727705054
time D: 134.649002ms
We can still handily beat GCC (with a ~21% speedup) with some trivial handwritten AVX2 ASM:
Code: Select all
EnableExplicit
CompilerIf #PB_Compiler_Backend <> #PB_Backend_Asm
CompilerError "Please enable the ASM compiler backend." ; Sorry but this is FASM-style ASM.
CompilerElseIf #PB_Compiler_Processor <> #PB_Processor_x64
CompilerError "Please use the x64 PB compiler." ; Sorry but no AVX2 on x86, ARM, etc.
CompilerElseIf #PB_Compiler_Version < 600
CompilerError "Please use PB 6.00 or higher." ; Sorry but recent FASM needed.
CompilerEndIf
#MAGIC_S = 13
#MAGIC_N = -1444876401
Define MAGIC_N.l = #MAGIC_N ; #MAGIC_N made visible for easy use in ASM.
Define loopFrom.l = 1
Define loopTo.l = 1000000000
Define x.l = loopFrom
Define sum.q
Define startTime, elapsedTime
If Not OpenConsole()
MessageRequester("Error:", "Failed to open console!")
End
EndIf
PrintN("")
PrintN("Method D: AVX2 i32 × 8")
startTime = ElapsedMilliseconds()
; Register assignment:
; ═══════════════════════════════════════════════════════
; ecx <- Loop counter.
; rax <- Sum.
; rdx <- Temporary state.
;
; ymm0 <- [CONST] Loop increment: (i32 × 8) [8 × 8]
; ymm1 <- [CONST] Division magic: (i32 × 8) [MAGIC × 8]
; ymm2 <- Loop counter.
; ymm3 <- Division results.
; ymm4 <- Division results. Temporary state.
; ymm5 <- Division results accumulator.
;
; Loop setup.
; ═══════════════════════════════════════════════════════
; (This could easily be optimised more generally)
; Prepare ecx loop counter.
; Calculate range [from...to] inclusive, then fast-divide by 8 via SHR 3 in order
; to determine the exact loop count required.
! movsxd rcx, [v_loopTo]
! movsxd rax, [v_loopFrom]
! sub rcx, rax
! add rcx, 1
! shr rcx, 3
; Skip vectorised path if we've fewer than 8 non-vectorised iterations.
! cmp ecx, 0
! jle l__lbl_loop_skip
; Initialise ymm1 with division MAGIC values.
; (i32 × 8) [MAGIC × 8]
! vpbroadcastd ymm1, [v_MAGIC_N]
; Initialise ymm2 with loop counter values.
; Temporarily fill ymm4 with loop stagger values.
; (i32 × 8) [0, 1, 2, 3, 4, 5, 6, 7]
! mov rdx, 0x0706050403020100
! vmovq xmm4, rdx
! vpmovzxbd ymm4, xmm4
! vpbroadcastd ymm2, [v_loopFrom]
! vpaddd ymm2, ymm2, ymm4
; Replace ymm0 with loop increment values.
; (i32 × 8) [8 × 8]
! mov edx, 8
! vmovd xmm0, edx
! vpbroadcastd ymm0, xmm0
; Since the range is unlikely to be perfectly divisible by 8, we'll calculate the index
; from which we should continue using non-vectorised path after the fact.
! lea eax, [ecx * 8 + eax]
! mov [v_x], eax
; Clear rax for use as sum, rdx for temporary state.
! xor rax, rax
! xor rdx, rdx
! vpxor ymm5, ymm5, ymm5
; Loop body.
; ═══════════════════════════════════════════════════════
_lbl_loop_body:
; Copy shifted entries (i64 × 4) of ymm2 -> ymm4.
! vpsrlq ymm4, ymm2, 32
; Division magic: multiply by magic number.
; ──────────────────────────────────────────────────
; Multiply even entries (i32 × 8) of ymm2 * ymm1 (#MAGIC) into (i64 × 4) ymm3.
! vpmuldq ymm3, ymm2, ymm1
; Multiply even entries (i32 × 8) of ymm4 * ymm1 (#MAGIC) into (i64 × 4) ymm4.
! vpmuldq ymm4, ymm4, ymm1
; Shuffle separate i64 × 4 vectors into an i32 × 8.
; ──────────────────────────────────────────────────
; Shift entries (i64 × 4) of ymm3 into odd places to ready for interleaving with ymm4.
; Equivalent shuffle:
; vpshufd ymm3, ymm3, 00110001b
! vpsrlq ymm3, ymm3, 32
; Interleave entries (i32 × 8) of ymm3 and ymm4 into ymm4.
! vpblendd ymm4, ymm3, ymm4, 10101010b
; Division magic: add dividend.
; ──────────────────────────────────────────────────
; Add original entries (i32 × 8) of ymm2 into ymm4.
! vpaddd ymm4, ymm4, ymm2
; Division magic: shift by magic number.
; ──────────────────────────────────────────────────
; Shift-right entries (i32 × 8) by magic value 13.
! vpsrad ymm4, ymm4, 13
; Accumulate sums into i64 × 4.
; ──────────────────────────────────────────────────
; Take high half of division results from ymm4 into ymm3.
! vextracti128 xmm3, ymm4, 1
; Sum high and low halves of division results into ymm3.
! vpaddd xmm3, xmm3, xmm4
; Spread (i32 × 8)[a, b, c, d, 0, 0, 0, 0] from ymm3
; into (i64 × 4)[a, b, c, d]
! vpmovzxdq ymm3, xmm3
; Sum into (i64 × 4) vector ymm5. Sufficient to avoid overflow for applicable loop ranges.
! vpaddq ymm5, ymm5, ymm3
; Next loop step.
; ──────────────────────────────────────────────────
; Decrement loop counter.
! sub rcx, 1
; Increment vectorised loop count (ymm3) by iteration increment values (ymm0).
! vpaddd ymm2, ymm2, ymm0
; Continue loop if end not reached.
! test ecx, ecx
! jnz l__lbl_loop_body
_lbl_loop_end:
; End of vectorised loop: transfer i64 sums from ymm5 to sum variable.
! vextracti128 xmm4, ymm5, 1
! vpaddq ymm4, ymm4, ymm5
! vpshufd xmm3, xmm4, 11101110b
! vpaddq xmm3, xmm3, xmm4
! vmovq [v_sum], xmm3
; Non-vectorised loop fallback.
; ═══════════════════════════════════════════════════════
_lbl_loop_skip:
For x = x To loopTo
sum + (x + (x * #MAGIC_N) >> 32) >> #MAGIC_S
Next x
elapsedTime = ElapsedMilliseconds() - startTime
PrintN(" sum D: " + sum)
PrintN(" time D: " + elapsedTime + "ms")
PrintN("Press <return> to exit...")
Input()
Results for that (PB 6.02):
Code: Select all
Method D: AVX2 i32 × 8
sum D: 40501727705054
time D: 111ms
The ASM is safe for any long loop with non-negative ranges, i.e.: 0…+2'147'483'647. It can be made to tolerate any possible long loop range, including -2'147'483'648…+2'147'483'647 while still beating GCC by ~13%. And by adding an extra branch and delaying accumulation of sums, you can beat GCC by ~72% for huge loop ranges like this. I've missed/omit some optimisations like this for readability.
This is very much a microbenchmark, but vectorisation optimisations like these can be hugely beneficial when you've a vast set of data that needs number-crunching done on it.
AVX2 is supported for ~91.2% of
Steam Hardware Survey participants. Gamers are much more likely to have modern CPUs, and so the level of support expected outside the gamer population will be lower.
SSE4 on the other hand is extremely widely supported and offers significant gains still (230.72ms in the test above).
Update: I did some experimentation on Apple Silicon and it seems like Clang really doesn't want to vectorise the code here, even with all high-performance tuning options maxed, so a little hand-tuning has incredible gains (>100%).