Speed up division by instead using inverse and multiply

PrincieD · Post by **PrincieD** » Sun Sep 24, 2023 10:01 pm

Hi guys here's a nice optimisation trick. If you want to divide by a value you can pre-calculate the inverse of the value (1 / value) and then multiply instead (which is faster on CPU)

Code: Select all

OpenConsole()

count = 1000000

width = 12345
invWidth.d = 1 / width

For z = 1 To 10
    
    PrintN("----------")

    RandomSeed(1337)
    startTime = ElapsedMilliseconds()
    For n = 1 To count
        x2 = Random(1000000)
        x = x2 / Width
    Next
    endTime = ElapsedMilliseconds()
    
    PrintN(Str(endTime-startTime))
    
    RandomSeed(1337)
    startTime = ElapsedMilliseconds()
    For n = 1 To count
        x2 = Random(1000000)
        x = x2 * invWidth
    Next
    endTime = ElapsedMilliseconds()
    
    PrintN(Str(endTime-startTime))

Next

Input()

ChrisR · Post by **ChrisR** » Sun Sep 24, 2023 10:11 pm

For a good measurement, I moved invWidth.d = 1 / width after the 2nd startTime = ElapsedMilliseconds()
Not obvious with count = 1000000 but indeed a small improvement with 10000000, thanks for the tip.

STARGÅTE · Post by **STARGÅTE** » Sun Sep 24, 2023 10:32 pm

Hm, I see no difference here.
Your original code gives

4
4

and when I increase count to 100 000 000, the result is:

383
377

However, I think the Random() call takes more time than a single FPU instruction.

yuki · Post by **yuki** » Sun Sep 24, 2023 11:04 pm

You can further improve performance using magic number + bitwise tricks:

Code: Select all

EnableExplicit

#WIDTH      = 12345
#WIDTH_INV  = 1.0 / #WIDTH

#WIDTH_MAGIC_N  = -1444876401
#WIDTH_MAGIC_S  = 13

#NUM_LOOPS = 1
#NUM_ITER_PER_LOOP = 1000000000

Define x, y
Define startTime, elapsedTime
Define max.q

CompilerIf #PB_Compiler_Debugger
  CompilerError "Please run with debugger off in order to collect proper timings."
CompilerEndIf

CompilerIf #PB_Compiler_OS <> #PB_OS_Windows And #PB_Compiler_ExecutableFormat <> #PB_Compiler_Console
  CompilerError "Please compiler with executable format: console."
CompilerEndIf

If Not OpenConsole()
  MessageRequester("Error:", "Failed to open console!")
  End
EndIf

For x = 1 To #NUM_LOOPS
  
  ; A
  ; ──────────────────────────────────────────────────
  PrintN("Method A: Straightforward")
  max = 0
  startTime = ElapsedMilliseconds()
  For y = 1 To #NUM_ITER_PER_LOOP
    max = y / #WIDTH
  Next y
  elapsedTime = ElapsedMilliseconds() - startTime
  PrintN("  max A: " + max)
  PrintN(" time A: " + elapsedTime + "ms")
  
  ; B
  ; ──────────────────────────────────────────────────
  PrintN("")
  PrintN("Method B: Floating Point")
  max = 0
  startTime = ElapsedMilliseconds()
  For y = 1 To #NUM_ITER_PER_LOOP
    max = y * #WIDTH_INV
  Next y
  elapsedTime = ElapsedMilliseconds() - startTime
  PrintN("  max B: " + max)
  PrintN(" time B: " + elapsedTime + "ms")
  
  ; C
  ; ──────────────────────────────────────────────────
  PrintN("")
  PrintN("Method C: Bit-shift Magic")
  max = 0
  startTime = ElapsedMilliseconds()
  For y = 1 To #NUM_ITER_PER_LOOP
    max = (y + (y * #WIDTH_MAGIC_N) >> 32) >> #WIDTH_MAGIC_S
  Next y
  elapsedTime = ElapsedMilliseconds() - startTime
  PrintN("  max C: " + max)
  PrintN(" time C: " + elapsedTime + "ms")
  
  PrintN("")
  PrintN("═══════════════════════════════════════════════════════")
  PrintN("")
  
Next x

PrintN("Press <return> to exit...")
Input()

Results (ASM backend, 6.02 LTS, 5800X):

Code: Select all

Method A: Straightforward
  max A: 81004
 time A: 1465ms

Method B: Floating Point
  max B: 81004
 time B: 851ms

Method C: Bit-shift Magic
  max C: 81004
 time C: 629ms

Some important notes:

I removed the Random() invocations, because like @STARGÅTE mentions, it pollutes measured timing.
The floating-point method suffers from inaccuracy problems if you're expecting the usual integer division which methods A and C follow.
- Change #NUM_ITER_PER_LOOP to, e.g., 35000 to demonstrate this (note the differing result max).
With the C-backend, when optimisation is enabled and the divisor is constant, method B is rather harmful to performance since method A should be optimised to operations like those in C, while few optimisations are available for method B.
- The C-backend can optimise away repeated assignments to max entirely. Avoid this by adding to max (treating it as sum) instead.

Method B can be useful for simple acceleration where the divisor can change to totally unknown values at runtime (and non-integer division is tolerable), but is really best when working primarily with floating-point values in order to avoid repeated conversion costs.

For pure integer math, libraries like libdivide exist to offer greater speedups like those of method C for even dynamic values. Useful when division is in the hot path.

PrincieD · Post by **PrincieD** » Mon Sep 25, 2023 12:09 am

ChrisR: No worries!

STARGÅTE: my CPU is a bit old so I should have probably increased the iterations lol

yuki: That's really cool! Thanks for providing that, I'm gunna have to study your code now and the bitwise magic heh

StarBootics · Post by **StarBootics** » Mon Sep 25, 2023 1:30 am

Hello everyone,

the yuki's code give me the following results :

Method A: Straightforward
max A: 81004
time A: 0ms

Method B: Floating Point
max B: 81004
time B: 2121ms

Method C: Bit-shift Magic
max C: 81004
time C: 0ms

But to make it to work I had to comment the following code part :

Code: Select all

CompilerIf #PB_Compiler_OS <> #PB_OS_Windows And #PB_Compiler_ExecutableFormat <> #PB_Compiler_Console
  CompilerError "Please compiler with executable format: console."
CompilerEndIf

Under Debian 12 x64 even if I set the compiler to compile the program as a Console I get the CompilerError message.

Best regards
StarBootics

yuki · Post by **yuki** » Mon Sep 25, 2023 2:56 am

StarBootics wrote: Mon Sep 25, 2023 1:30 am But to make it to work I had to comment the following code part :
Code: Select all
CompilerIf #PB_Compiler_OS <> #PB_OS_Windows And #PB_Compiler_ExecutableFormat <> #PB_Compiler_Console
  CompilerError "Please compiler with executable format: console."
CompilerEndIf
Under Debian 12 x64 even if I set the compiler to compile the program as a Console I get the CompilerError message.

Woops! #PB_Compiler_ExecutableFormat will indeed not take on #PB_Compiler_Console outside of Windows (which makes sense given neither Linux nor macOS executables are tagged either GUI or console). The intent was to ensure the IDE opens a proper console when running so output is visible (avoiding questions like: 1, 2, …), but I've clearly forgotten how that magic constant behaves

I guess, in a way, the forced error serves similar purpose by being a strong reminder to set the executable format

StarBootics wrote: Mon Sep 25, 2023 1:30 am
Method A: Straightforward
max A: 81004
time A: 0ms

Method B: Floating Point
max B: 81004
time B: 2121ms

Method C: Bit-shift Magic
max C: 81004
time C: 0ms

Definitely looks like C-compiler optimisation!

Because we're repeatedly assigning to the same variable without side-effects, the C-compiler knows only the terminal loop execution is required and does just that (likely precomputing the value as well for at least the straightforward approach).

One can take the summing approach mentioned in my earlier post to avoid loops being optimised away:

Code: Select all

EnableExplicit

#WIDTH      = 12345
#WIDTH_INV  = 1.0 / #WIDTH

#WIDTH_MAGIC_N  = -1444876401
#WIDTH_MAGIC_S  = 13

#NUM_LOOPS = 1
#NUM_ITER_PER_LOOP = 1000000000

Define x, y
Define startTime, elapsedTime
Define sum.q
Define sumFloat.d

CompilerIf #PB_Compiler_Debugger
  CompilerError "Please run with debugger off in order to collect proper timings."
CompilerEndIf

CompilerIf #PB_Compiler_OS <> #PB_OS_Windows
  CompilerError "Make sure to set executable format console, then remove this CompilerError line!"
CompilerEndIf

If Not OpenConsole()
  MessageRequester("Error:", "Failed to open console!")
  End
EndIf

For x = 1 To #NUM_LOOPS
  
  ; A
  ; ──────────────────────────────────────────────────
  PrintN("Method A: Straightforward")
  sum = 0
  startTime = ElapsedMilliseconds()
  For y = 1 To #NUM_ITER_PER_LOOP
    sum + y / #WIDTH
  Next y
  elapsedTime = ElapsedMilliseconds() - startTime
  PrintN("  sum A: " + sum)
  PrintN(" time A: " + elapsedTime + "ms")
  
  ; B
  ; ──────────────────────────────────────────────────
  PrintN("")
  PrintN("Method B: Floating Point")
  sumFloat = 0
  startTime = ElapsedMilliseconds()
  For y = 1 To #NUM_ITER_PER_LOOP
    sumFloat + y * #WIDTH_INV
  Next y
  elapsedTime = ElapsedMilliseconds() - startTime
  PrintN("  sum B: " + sumFloat)
  PrintN(" time B: " + elapsedTime + "ms")
  
  ; C
  ; ──────────────────────────────────────────────────
  PrintN("")
  PrintN("Method C: Bit-shift Magic")
  sum = 0
  startTime = ElapsedMilliseconds()
  For y = 1 To #NUM_ITER_PER_LOOP
    sum + (y + (y * #WIDTH_MAGIC_N) >> 32) >> #WIDTH_MAGIC_S
  Next y
  elapsedTime = ElapsedMilliseconds() - startTime
  PrintN("  sum C: " + sum)
  PrintN(" time C: " + elapsedTime + "ms")
  
  PrintN("")
  PrintN("═══════════════════════════════════════════════════════")
  PrintN("")
  
Next x

PrintN("Press <return> to exit...")
Input()

Results (C-backend, optimised, 6.02 LTS, 5800X):

Code: Select all

Method A: Straightforward
  sum A: 40501727705054
 time A: 444ms

Method B: Floating Point
  sum B: 40502227663021.445
 time B: 634ms

Method C: Bit-shift Magic
  sum C: 40501727705054
 time C: 435ms

Results (C-backend, unoptimised, 6.02 LTS, 5800X):

Code: Select all

Method A: Straightforward
  sum A: 40501727705054
 time A: 737ms

Method B: Floating Point
  sum B: 40502227663021.445
 time B: 2517ms

Method C: Bit-shift Magic
  sum C: 40501727705054
 time C: 682ms

With the optimiser enabled, method A and C should be effectively equivalent, which is really good to see!

You might have noticed method B is summed using floating-point math rather than integer. If you're curious, here're the (slow) results where the conversion is done (C-backend, optimised):

Code: Select all

Method B: Floating Point
  sum B: 40502227625968
 time B: 6513ms

This betters shows the inaccuracy of the FP approach as well as the drastic cost of repeatedly converting int↔float in (micro) benchmarks like this.

StarBootics · Post by **StarBootics** » Mon Sep 25, 2023 2:43 pm

Hello everyone,

There is the result with the ASM backend

Method A: Straightforward
max A: 81004
time A: 1455ms

Method B: Floating Point
max B: 81004
time B: 779ms

Method C: Bit-shift Magic
max C: 81004
time C: 614ms

And with the C backend no optimization :

Method A: Straightforward
max A: 81004
time A: 636ms

Method B: Floating Point
max B: 81004
time B: 2493ms

Method C: Bit-shift Magic
max C: 81004
time C: 630ms

On my computer with the C backend the difference between Method A and C is ridiculous. About the ASM backend maybe Fred should contact the guys who created the gcc compiler and share his knowledge on optimization in this particular case because Method B is way much faster.

Best regards
StarBootics

yuki · Post by **yuki** » Mon Sep 25, 2023 10:22 pm

StarBootics wrote: Mon Sep 25, 2023 2:43 pm On my computer with the C backend the difference between Method A and C is ridiculous. About the ASM backend maybe Fred should contact the guys who created the gcc compiler and share his knowledge on optimization in this particular case because Method B is way much faster.

Thanks for the measurements! If you don't mind sharing: what CPU are you running?

It's always interesting how different microarchitectures vary, and yours has way better FP performance relative mine but sees only marginal gains on the integer side.

Also, GCC can squeeze out some pretty great performance, but you have to compile with "-O3 -mavx2" to get it…

AVX2 isn't so bad (though you'd want to use dynamic dispatch to not alienate users), but -O3 may cause too many subtle bugs for it ever to be officially supported in PB. IIRC the Linux kernel devs are still rightly spooked about the stability of it (even though -O3 has gotten much better in recent years).

Here's equivalent sum of divisions in simple C:

Code: Select all

#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>

static int32_t DIVISOR  = 12345;
static int32_t NUM_ITER = 1000000000;

uint64_t ElapsedMicroseconds() {
    struct timeval current_time;
    gettimeofday(&current_time, NULL);
    return current_time.tv_sec * 1000000 + current_time.tv_usec;
}

int main() {
    int64_t sum = 0;
    int64_t elapsed_time, start_time;

    start_time = ElapsedMicroseconds();
    for (int32_t i = 1; i <= NUM_ITER; i++) {
        sum += i / DIVISOR;
    }
    elapsed_time = ElapsedMicroseconds() - start_time;
    
    printf("\nMethod E: Pure C");
    printf("\n  sum D: %lld", sum);
    printf("\n time D: %fms", elapsed_time / 1000.f);

    return 0;
}

Results without AVX2 (flags in parentheses):

Code: Select all

Method E: Pure C    (gcc -O2)
  sum D: 40501727705054
 time D: 430.404999ms

Method E: Pure C    (gcc -O3)
  sum D: 40501727705054
 time D: 402.380005ms

Results with AVX2 (flags in parentheses):

Code: Select all

Method E: Pure C    (gcc -O2 -mavx2)
  sum D: 40501727705054
 time D: 423.395996ms
 
Method E: Pure C    (gcc -O3 -mavx2)
  sum D: 40501727705054
 time D: 134.649002ms

We can still handily beat GCC (with a ~21% speedup) with some trivial handwritten AVX2 ASM:

Code: Select all

EnableExplicit

CompilerIf #PB_Compiler_Backend <> #PB_Backend_Asm
  CompilerError "Please enable the ASM compiler backend."   ; Sorry but this is FASM-style ASM.
CompilerElseIf #PB_Compiler_Processor <> #PB_Processor_x64
  CompilerError "Please use the x64 PB compiler."           ; Sorry but no AVX2 on x86, ARM, etc.
CompilerElseIf #PB_Compiler_Version < 600
  CompilerError "Please use PB 6.00 or higher."             ; Sorry but recent FASM needed.
CompilerEndIf

#MAGIC_S = 13
#MAGIC_N = -1444876401
Define MAGIC_N.l  = #MAGIC_N    ; #MAGIC_N made visible for easy use in ASM.

Define loopFrom.l = 1
Define loopTo.l   = 1000000000
Define x.l        = loopFrom
Define sum.q

Define startTime, elapsedTime

If Not OpenConsole()
  MessageRequester("Error:", "Failed to open console!")
  End
EndIf

PrintN("")
PrintN("Method D: AVX2 i32 × 8")
startTime = ElapsedMilliseconds()

; Register assignment:
; ═══════════════════════════════════════════════════════
;   ecx   <-  Loop counter.
;   rax   <-  Sum.
;   rdx   <-  Temporary state.
;
;   ymm0  <-  [CONST] Loop increment: (i32 × 8) [8 × 8]
;   ymm1  <-  [CONST] Division magic: (i32 × 8) [MAGIC × 8]
;   ymm2  <-  Loop counter.
;   ymm3  <-  Division results.
;   ymm4  <-  Division results. Temporary state.
;   ymm5  <-  Division results accumulator.
;

; Loop setup.
; ═══════════════════════════════════════════════════════
; (This could easily be optimised more generally)

; Prepare ecx loop counter.
;   Calculate range [from...to] inclusive, then fast-divide by 8 via SHR 3 in order
;   to determine the exact loop count required.
! movsxd    rcx, [v_loopTo]
! movsxd    rax, [v_loopFrom]
! sub       rcx, rax
! add       rcx, 1
! shr       rcx, 3

; Skip vectorised path if we've fewer than 8 non-vectorised iterations.
! cmp   ecx, 0
! jle   l__lbl_loop_skip

; Initialise ymm1 with division MAGIC values.
;   (i32 × 8) [MAGIC × 8]
! vpbroadcastd  ymm1, [v_MAGIC_N]

; Initialise ymm2 with loop counter values.
;   Temporarily fill ymm4 with loop stagger values.
;     (i32 × 8) [0, 1, 2, 3, 4, 5, 6, 7]
! mov           rdx,  0x0706050403020100
! vmovq         xmm4, rdx
! vpmovzxbd     ymm4, xmm4
! vpbroadcastd  ymm2, [v_loopFrom]
! vpaddd        ymm2, ymm2, ymm4

; Replace ymm0 with loop increment values.
;   (i32 × 8) [8 × 8]
! mov           edx,  8
! vmovd         xmm0, edx
! vpbroadcastd  ymm0, xmm0

; Since the range is unlikely to be perfectly divisible by 8, we'll calculate the index
; from which we should continue using non-vectorised path after the fact.
! lea   eax,    [ecx * 8 + eax]
! mov   [v_x],  eax

; Clear rax for use as sum, rdx for temporary state.
! xor   rax,  rax
! xor   rdx,  rdx
! vpxor ymm5, ymm5, ymm5

; Loop body.
; ═══════════════════════════════════════════════════════

_lbl_loop_body:
; Copy shifted entries (i64 × 4) of ymm2 -> ymm4.
! vpsrlq          ymm4, ymm2, 32

; Division magic: multiply by magic number.
; ──────────────────────────────────────────────────
; Multiply even entries (i32 × 8) of ymm2 * ymm1 (#MAGIC) into (i64 × 4) ymm3.
! vpmuldq         ymm3, ymm2, ymm1
; Multiply even entries (i32 × 8) of ymm4 * ymm1 (#MAGIC) into (i64 × 4) ymm4.
! vpmuldq         ymm4, ymm4, ymm1

; Shuffle separate i64 × 4 vectors into an i32 × 8.
; ──────────────────────────────────────────────────
; Shift entries (i64 × 4) of ymm3 into odd places to ready for interleaving with ymm4.
;   Equivalent shuffle:
;     vpshufd     ymm3, ymm3, 00110001b
! vpsrlq          ymm3, ymm3, 32
; Interleave entries (i32 × 8) of ymm3 and ymm4 into ymm4.
! vpblendd        ymm4, ymm3, ymm4, 10101010b

; Division magic: add dividend.
; ──────────────────────────────────────────────────
; Add original entries (i32 × 8) of ymm2 into ymm4.
! vpaddd          ymm4, ymm4, ymm2

; Division magic: shift by magic number.
; ──────────────────────────────────────────────────
; Shift-right entries (i32 × 8) by magic value 13.
! vpsrad          ymm4, ymm4, 13

; Accumulate sums into i64 × 4.
; ──────────────────────────────────────────────────
; Take high half of division results from ymm4 into ymm3.
! vextracti128    xmm3, ymm4, 1
; Sum high and low halves of division results into ymm3.
! vpaddd          xmm3, xmm3, xmm4
; Spread (i32 × 8)[a, b, c, d, 0, 0, 0, 0] from ymm3
;   into (i64 × 4)[a, b, c, d]
! vpmovzxdq       ymm3, xmm3
; Sum into (i64 × 4) vector ymm5. Sufficient to avoid overflow for applicable loop ranges.
! vpaddq          ymm5, ymm5, ymm3

; Next loop step.
; ──────────────────────────────────────────────────
; Decrement loop counter.
! sub     rcx,  1
; Increment vectorised loop count (ymm3) by iteration increment values (ymm0).
! vpaddd  ymm2, ymm2, ymm0

; Continue loop if end not reached.
! test  ecx, ecx
! jnz   l__lbl_loop_body
_lbl_loop_end:

; End of vectorised loop: transfer i64 sums from ymm5 to sum variable.
! vextracti128  xmm4, ymm5, 1
! vpaddq        ymm4, ymm4, ymm5
! vpshufd       xmm3, xmm4, 11101110b
! vpaddq        xmm3, xmm3, xmm4
! vmovq         [v_sum], xmm3

; Non-vectorised loop fallback.
; ═══════════════════════════════════════════════════════
_lbl_loop_skip:

For x = x To loopTo
  sum + (x + (x * #MAGIC_N) >> 32) >> #MAGIC_S
Next x

elapsedTime = ElapsedMilliseconds() - startTime
PrintN("  sum D: " + sum)
PrintN(" time D: " + elapsedTime + "ms")

PrintN("Press <return> to exit...")
Input()

Results for that (PB 6.02):

Code: Select all

Method D: AVX2 i32 × 8
  sum D: 40501727705054
 time D: 111ms

The ASM is safe for any long loop with non-negative ranges, i.e.: 0…+2'147'483'647. It can be made to tolerate any possible long loop range, including -2'147'483'648…+2'147'483'647 while still beating GCC by ~13%. And by adding an extra branch and delaying accumulation of sums, you can beat GCC by ~72% for huge loop ranges like this. I've missed/omit some optimisations like this for readability.

This is very much a microbenchmark, but vectorisation optimisations like these can be hugely beneficial when you've a vast set of data that needs number-crunching done on it.

AVX2 is supported for ~91.2% of Steam Hardware Survey participants. Gamers are much more likely to have modern CPUs, and so the level of support expected outside the gamer population will be lower.

SSE4 on the other hand is extremely widely supported and offers significant gains still (230.72ms in the test above).

Update: I did some experimentation on Apple Silicon and it seems like Clang really doesn't want to vectorise the code here, even with all high-performance tuning options maxed, so a little hand-tuning has incredible gains (>100%).

StarBootics · Post by **StarBootics** » Mon Sep 25, 2023 11:24 pm

yuki wrote: Mon Sep 25, 2023 10:22 pm If you don't mind sharing: what CPU are you running?

AMD Ryzen™ 7 5800X × 16

Best regards
StarBootics

benubi · Post by **benubi** » Mon Oct 02, 2023 2:30 pm

I used this trick too, when I ported and then optimized a Diamond-Square "height map" generating algorithm.

When I ran the program on a Raspberry PI / ARM-64 (or maybe it was when I used the preinstalled arm-32 raspbian version), the program would never return & end up in an infinite loop. There's a problem with floats, I am not sure if it's the chip or the C backend.

I was doing something recursive like this:

Code: Select all

Procedure DiamondSquare(x, y,width,height,Reach) ; parameters are integers
  ; for x2,y2 step Reach
  DiamondSquare(x2,y2, width * 0.5, height * 0.5, reach * 0.5)
  ; next
EndProcedure

I suppose that when the reach is 0.5, ARM will round up to 1, but the other backends will round down to 0. Something like that must happen, so I was forced to use division instead as then the rounding behavior is consistent again.

jacdelad · Post by **jacdelad** » Wed Oct 04, 2023 7:20 am

Try defining width and height as float. Or putting the 0.5 first. I can't test it right now.

benubi · Post by **benubi** » Fri Oct 06, 2023 3:51 pm

jacdelad wrote: Wed Oct 04, 2023 7:20 am Try defining width and height as float. Or putting the 0.5 first. I can't test it right now.

In that "special" case it won't work because modulo operations are performed on those variables, and modulos aren't allowed on floats. They represent pixel coordinates of a texture/image/2D height map array. The division of the reach only takes place once for every recursion in the algorithm, then most of it is loops of multiplications and additions (averaging points at the reach distance) with a random value, and have some divisions anyway for the averaging - the reach division doesn't weigh much in the equation. The performance gains are too slim for risking the C compiler bug causing an endless loop (it's apparently a C/gcc glitch or configuration problem rather than a hardware bug).

jacdelad · Post by **jacdelad** » Fri Oct 06, 2023 6:27 pm

Hum, I suspected some data type conversion. Like I said, can't test it right now. It was just a shot into the dark.

Jeff8888 · Post by **Jeff8888** » Fri Oct 06, 2023 10:21 pm

Hmm, I recall learning this trick in a Fortran class I took in 1967.

PureBasic Forums - English

Speed up division by instead using inverse and multiply

Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply

Re: Speed up division by instead using inverse and multiply