Use "& operator" instead of modulo on power of 2 divisors

Share your advanced PureBasic knowledge/code with the community.
benubi
Enthusiast
Enthusiast
Posts: 220
Joined: Tue Mar 29, 2005 4:01 pm

Use "& operator" instead of modulo on power of 2 divisors

Post by benubi »

Hello

this code is a nothingburger, not as fast as I hoped for, but still around 2x to 3x faster than using modulo % operator. When you have a divisor of power of 2, you can use an & operation with a mask. Modulo is said to be a slow operation, but it's not that slow as I imagined (at least in these tests). If you do millions of those specific modulo operations maybe this could be useful, tho; but it only works with 2^n divisors, with a mask = (2^n)-1.

In this "demo" there's a procedure and two macros, and a PB % operation version. All codes do the same: calculate the remainder, and if there's a remainder (non zero) they calculate the "padding" that should be added (divisor - remainder).

Code: Select all

; BinModOptimized.pb
; ------------------
;
; optimized modulo operations for divisors that are ALWAYS power of 2 (2,4,8 ... 1024 ... etc)
;
;
;


Procedure BinMod(value, divi)
;   Debug "value: "+value
;   Debug value % divi
;   Debug value & (divi-1)
;   If value & (divi-1)
;     Debug "Padding:" + Str(divi - ((value & (divi-1))))
;   Else
;     Debug "no padding necessary"
;   EndIf 
;   Debug "-----------------"
  ProcedureReturn value & (divi-1)
EndProcedure

Macro M_binMod(_result_, _value_, _divisor_, _paddingresult_)
  _result_ = _value_ & (_divisor_ - 1) 
  If _result_
    _paddingresult_ = _divisor_ - _result_
  Else 
    _paddingresult_ = 0
  EndIf 
EndMacro
Macro M_binMod2(_result_, _value_, _divisor_ , _divisormask_, _paddingresult_)
  _result_ = _value_ & _divisormask_
  If _result_
    _paddingresult_ = (_divisor_) - _result_
  Else 
    _paddingresult_ = 0
  EndIf 
EndMacro


#MAX_LOOPS = 1000 * 1000 * 100 ; 100 M
#MAX_RANDOM = 1234567890
#DIVISOR = 4096
#DIVISOR_MASK = #DIVISOR - 1
Define _rest, _padding
Define s1,e1

s1 = ElapsedMilliseconds()
For i=1 To #MAX_LOOPS
  _rest = BinMod(Random(#MAX_RANDOM),#DIVISOR)
  If _rest 
    _padding = #DIVISOR - _rest
  Else 
    _padding = 0
  EndIf 
Next
e1 = ElapsedMilliseconds()



Define s2,e2

s2 = ElapsedMilliseconds()
For i=1 To #MAX_LOOPS
  M_binMod(_rest, Random(#MAX_RANDOM), #DIVISOR, _padding)
Next
e2 = ElapsedMilliseconds()

Define s3,e3,_div_var

_div_var = #DIVISOR

s3 = ElapsedMilliseconds()
For i=1 To #MAX_LOOPS
  _rest = Random(#MAX_RANDOM) % _div_var
  
  If _rest 
    _padding = #DIVISOR - rest 
  Else 
    _padding = 0
  EndIf
  
Next 
e3 = ElapsedMilliseconds()

Define s4,e4


s4 = ElapsedMilliseconds()
For i=1 To #MAX_LOOPS
  M_binMod2(_rest, Random(#MAX_RANDOM), #DIVISOR, #DIVISOR_MASK, _padding)
Next 
e4 = ElapsedMilliseconds()

msg$="Test results"+#CRLF$
msg$+"Number of loops: "+FormatNumber(#MAX_LOOPS,0)+#CRLF$
msg$+"BinMod() function exec time ms: "+FormatNumber(e1-s1,0)+#CRLF$
msg$+ "M_binMod() macro exec time ms: "+FormatNumber(e2-s2,0)+#CRLF$
msg$+ "M_binMod2() macro exec time ms: "+FormatNumber(e4-s4,0)+#CRLF$
msg$+ "PureBasic % operator exec time ms: "+FormatNumber(e3-s3,0)+#CRLF$
msg$+" "+#CRLF$
msg$+"PureBasic "+FormatNumber(#PB_Compiler_Version/100,2)+"  "

CompilerSelect #PB_Compiler_OS
  CompilerCase #PB_OS_Windows
    msg$+"Windows"
  CompilerCase #PB_OS_Linux
    msg$+"Linux"
  CompilerCase #PB_OS_AmigaOS
    msg$+"AmigaOS"
  CompilerCase #PB_OS_MacOS
    msg$+"MacOS"
  CompilerCase #PB_OS_Web
    msg$+"Web"
  CompilerDefault
    msg$+"UNKNOWN-OPERATING-SYSTEM"
CompilerEndSelect
msg$+" "
CompilerSelect #PB_Compiler_Processor
  CompilerCase #PB_Processor_Arm32
    msg$+"arm-32"
  CompilerCase #PB_Processor_Arm64
    msg$+"arm-64"
  CompilerCase #PB_Processor_JavaScript
    msg$+"JavaScript"
  CompilerCase #PB_Processor_mc68000
    msg$+"mc68000"
  CompilerCase #PB_Processor_PowerPC
    msg$+"PowerPC"
  CompilerCase #PB_Processor_x64
    msg$+"x64"
  CompilerCase #PB_Processor_x86
    msg$+"x86"
  CompilerDefault
    msg$+"UNKNOWN-PROCESSOR"
CompilerEndSelect

CompilerIf #PB_Backend_Asm = #PB_Compiler_Backend
  msg$+" (ASM Backend"
CompilerElse 
  msg$+" (C Backend"
CompilerEndIf 

CompilerIf #PB_Compiler_Optimizer
  msg$+" with optimizer)"
CompilerElse
  msg$+")"
CompilerEndIf

msg$+#CRLF$

SetClipboardText(msg$)
MessageRequester("BinMod test results",msg$)

Some of my results. I haven't checked how well it works on Linux+Raspberry Pi, yet...

Code: Select all

Test results
Number of loops: 100,000,000
CheckBinMod() function exec time ms: 611
M_binMod() macro exec time ms: 396
M_binMod2() macro exec time ms: 396
PureBasic % operator exec time ms: 1,177
 
PureBasic 6.04  Windows x64 (ASM Backend)

Code: Select all

Test results
Number of loops: 100,000,000
BinMod() function exec time ms: 340
M_binMod() macro exec time ms: 342
M_binMod2() macro exec time ms: 340
PureBasic % operator exec time ms: 516
 
PureBasic 6.04  Windows x86 (C Backend with optimizer)

Code: Select all

Test results
Number of loops: 100,000,000
BinMod() function exec time ms: 359
M_binMod() macro exec time ms: 354
M_binMod2() macro exec time ms: 353
PureBasic % operator exec time ms: 1,058
 
PureBasic 6.10  Windows x64 (C Backend with optimizer)
User avatar
Mindphazer
Enthusiast
Enthusiast
Posts: 460
Joined: Mon Sep 10, 2012 10:41 am
Location: Savoie

Re: Use "& operator" instead of modulo on power of 2 divisors

Post by Mindphazer »

Here are my results on my MacBook Pro M1 :

Code: Select all

Test results
Number of loops: 100,000,000
BinMod() function exec time ms: 512
M_binMod() macro exec time ms: 448
M_binMod2() macro exec time ms: 435
PureBasic % operator exec time ms: 435

PureBasic 6.04  MacOS arm-64 (C Backend)
MacBook Pro 16" M4 Pro - 24 Gb - MacOS 15.4.1 - Iphone 15 Pro Max - iPad at home
...and unfortunately... Windows at work...
User avatar
NicTheQuick
Addict
Addict
Posts: 1519
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Use "& operator" instead of modulo on power of 2 divisors

Post by NicTheQuick »

Here's the result on my system:
ASM

Code: Select all

Test results
Number of loops: 100,000,000
BinMod() function exec time ms: 527
M_binMod() macro exec time ms: 348
M_binMod2() macro exec time ms: 358
PureBasic % operator exec time ms: 408
 
PureBasic 6.10  Linux x64 (ASM Backend with optimizer)
C-Backend

Code: Select all

Test results
Number of loops: 100,000,000
BinMod() function exec time ms: 333
M_binMod() macro exec time ms: 332
M_binMod2() macro exec time ms: 333
PureBasic % operator exec time ms: 353
 
PureBasic 6.10  Linux x64 (C Backend with optimizer)
As expected, the C-Backend is well aware of these optimizations and therefore every loop has nearly the same speed.
Without optimizer it looks like this:

Code: Select all

Test results
Number of loops: 100,000,000
BinMod() function exec time ms: 487
M_binMod() macro exec time ms: 383
M_binMod2() macro exec time ms: 359
PureBasic % operator exec time ms: 382
 
PureBasic 6.10  Linux x64 (C Backend)
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
User avatar
STARGÅTE
Addict
Addict
Posts: 2232
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Use "& operator" instead of modulo on power of 2 divisors

Post by STARGÅTE »

I see no difference with my Ryzen 9 3900X between ASM and C or optimized C. Also M_binMod is just 20% faster.

What is you processor benubi, that the % operator is so slow for you?
Test results
Number of loops: 100,000,000
BinMod() function exec time ms: 555
M_binMod() macro exec time ms: 394
M_binMod2() macro exec time ms: 395
PureBasic % operator exec time ms: 513

PureBasic 6.10 Windows x64 (ASM Backend)
Test results
Number of loops: 100,000,000
BinMod() function exec time ms: 494
M_binMod() macro exec time ms: 371
M_binMod2() macro exec time ms: 371
PureBasic % operator exec time ms: 504

PureBasic 6.10 Windows x64 (C Backend)
Test results
Number of loops: 100,000,000
BinMod() function exec time ms: 363
M_binMod() macro exec time ms: 359
M_binMod2() macro exec time ms: 357
PureBasic % operator exec time ms: 501

PureBasic 6.10 Windows x64 (C Backend with optimizer)
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
benubi
Enthusiast
Enthusiast
Posts: 220
Joined: Tue Mar 29, 2005 4:01 pm

Re: Use "& operator" instead of modulo on power of 2 divisors

Post by benubi »

My CPU, copy & paste from the device manager

Intel(R) Core(TM) i5-10400F CPU @ 2.90GHz 2.90 GHz


It seems when comparing the results, some compilers (e.g. Mac) will use that & operator optimization, while others won't but will use some other technique. I've seen somewhere there are sophisticated bitmask operations for optimized modulo operations on constants, and that compilers/optimizers may use them. This means every constant number, or a set of them, may have various bitmask tricks (idk how to describe that differently) to get a faster result. But this only works with constants, even tho I peeked into experimental procedures that may fork to those optimized subroutines but I imagine the gain to be low.

I wonder if processor hardware may do similar things, when interpreting Stargate's results. Here is how it could work (I am no hardware engineer so I imagine it from a layman/kid's perspective). It has to check the bit population of the divisor. If it is exactly 1, it must be a power of 2 number, and then it can quickly make the bitmask from it (all the lower bits following the first one). Since these are exactly predictable (32/64 possible exceptions) there could be an efficient way to take a shortcut from those alleged 400 cycles (??) the modulo operation can take, and return earlier.

That's only a suspicion but afaik divisions are the most cycle consuming compared to logical operations, so only 20% gain would mean that IMO there could be some hard/soft/middle-ware optimization happening outside of PB when the code is executed.
User avatar
STARGÅTE
Addict
Addict
Posts: 2232
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Use "& operator" instead of modulo on power of 2 divisors

Post by STARGÅTE »

The CPU instruction table for various CPUs show that the latency for DIV is aroung 15 and binary AND is just 1.
This is also similar between Intel Cannon Lake (Intel 10th gen. Core) and Zen 2 (AMD Ryzen 3000).

Probably the instruction Random() is the time limiting function in the loop.
You should subtract the time for a loop with only Random() from the other times. Then the result is different (as I would expect):

Code: Select all

s0 = ElapsedMilliseconds()
For i=1 To #MAX_LOOPS
  _rest = Random(#MAX_RANDOM)
Next
e0 = ElapsedMilliseconds()

Code: Select all

msg$+"BinMod() function exec time ms: "+FormatNumber(e1-s1-(e0-s0),0)+#CRLF$
msg$+ "M_binMod() macro exec time ms: "+FormatNumber(e2-s2-(e0-s0),0)+#CRLF$
msg$+ "M_binMod2() macro exec time ms: "+FormatNumber(e4-s4-(e0-s0),0)+#CRLF$
msg$+ "PureBasic % operator exec time ms: "+FormatNumber(e3-s3-(e0-s0),0)+#CRLF$

Code: Select all

Test results
Number of loops: 100,000,000
BinMod() function exec time ms: 221
M_binMod() macro exec time ms: 57
M_binMod2() macro exec time ms: 58
PureBasic % operator exec time ms: 178

PureBasic 6.10  Windows x64 (ASM Backend)
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
Post Reply