Population count

Bare metal programming in PureBasic, for experienced users
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Population count

Post by wilbert »

davido wrote:I've chosen the PB version of the method described by idle; it runs at 210%.
I present it, just in case it can be improved upon

Code: Select all

Procedure popcount9(i.l)
  i - (i >> 1) & $55555555
  i = (i & $33333333) + ((i >> 2) & $33333333)
  ProcedureReturn (((i + i >> 4) & $0f0f0f0f) * $01010101) >> 24 & $ff
EndProcedure
Last edited by wilbert on Fri May 15, 2015 3:00 pm, edited 1 time in total.
Windows (x64)
Raspberry Pi OS (Arm64)
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: Population count

Post by davido »

@wilbert,
Well squeezed! Thank you very much. :D
DE AA EB
User avatar
luis
Addict
Addict
Posts: 3876
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

Re: Population count

Post by luis »

@davido

Did you miss the link I posted above ?

Code: Select all

int swar(uint32_t i) {
  i = i - ((i >> 1) & 0x55555555);
  i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
  return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
It was already there. Don't tell me "but it's in C !" :)
"Have you tried turning it off and on again ?"
A little PureBasic review
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Population count

Post by wilbert »

luis wrote:It was already there. Don't tell me "but it's in C !" :)
I know, but it didn't work when I tried it exactly like that in PureBasic x64.
It turned out that the result of the multiplication instruction on x64 is a 64 bit value and that 64 bit value is shifted 24 bits to the right resulting in a wrong output.

A lot of the brackets aren't required due to operator priority

Code: Select all

Procedure popcount9(i.l)
  i - i >> 1 & $55555555
  i = i & $33333333 + i >> 2 & $33333333
  ProcedureReturn ((i + i >> 4) & $0f0f0f0f * $01010101) >> 24 & $ff
EndProcedure
but it might make less sense this way.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
BasicallyPure
Enthusiast
Enthusiast
Posts: 536
Joined: Thu Mar 24, 2011 12:40 am
Location: Iowa, USA

Re: Population count

Post by BasicallyPure »

Here is one I came up with.
It yields about 285% against wilbert's assembly code.

Code: Select all

Procedure PopCount_10(k.l) ; BasicallyPure
   Macro AaS : result + k & 1 : k >> 1 : EndMacro
   Protected result
   AaS : AaS : AaS : AaS : AaS : AaS : AaS : AaS : AaS : AaS
   AaS : AaS : AaS : AaS : AaS : AaS : AaS : AaS : AaS : AaS
   AaS : AaS : AaS : AaS : AaS : AaS : AaS : AaS : AaS : AaS
   AaS : AaS
   ProcedureReturn result
EndProcedure
This is all very interesting but I don't have any idea what it can be used for.
It's obviously very useful or there wouldn't be an assembly instruction for it.

edit:
Never mind, a little research on 'PopCnt' or 'PopCount' told me more than I ever wanted to know.
It's probably a safe bet that I'll never use it.
BasicallyPure
Until you know everything you know nothing, all you have is what you believe.
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: Population count

Post by davido »

@luis,
My apologies, I did get the code from the link you posted.
Thank you.
DE AA EB
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Population count

Post by wilbert »

Seems indeed to scale up to 64 bits as well :)

Code: Select all

Procedure PopCount64(x.q)
  !mov rax, [p.v_x]
  !mov rdx, rax
  !shr rdx, 1
  !and rdx, [popcount64_v55] 
  !sub rax, rdx 
  ;x = (x & $3333333333333333) + ((x >> 2) & $3333333333333333)
  !mov rdx, rax       ;x 
  !and rax, [popcount64_v33] 
  !shr rdx, 2 
  !and rdx, [popcount64_v33]
  !add rax, rdx
  ;x = (x + (x >> 4)) & $0f0f0f0f0f0f0f0f0f0f
  !mov rdx, rax 
  !shr rdx, 4 
  !add rax, rdx 
  !and rax, [popcount64_v0f]
  ;x * $0101010101010101 >> 56
  !imul rax, [popcount64_v01]
  !shr rax, 56
  ProcedureReturn
  !popcount64_v01: dq 0x0101010101010101
  !popcount64_v0f: dq 0x0f0f0f0f0f0f0f0f
  !popcount64_v33: dq 0x3333333333333333
  !popcount64_v55: dq 0x5555555555555555
EndProcedure 

x = Random($FFFFFFFF)
Debug Bin(x)
Debug Popcount64(x) 

Code: Select all

Procedure PopCount64(x.q); SSE2 (both x86 and x64)
  !movd xmm0, [p.v_x]
  !movd xmm1, [p.v_x + 4]
  !movq xmm2, [popcount64_v55]
  !movq xmm3, [popcount64_v33]
  !movq xmm4, [popcount64_v0f]
  !punpckldq xmm0, xmm1
  !movdqa xmm1, xmm0
  !psrlq xmm1, 1
  !pand xmm1, xmm2
  !psubb xmm0, xmm1 
  !movdqa xmm1, xmm0 
  !pand xmm0, xmm3
  !psrlq xmm1, 2 
  !pand xmm1, xmm3
  !paddb xmm0, xmm1
  !movdqa xmm1, xmm0
  !psrlq xmm1, 4
  !paddb xmm0, xmm1
  !pand xmm0, xmm4
  !pxor xmm1, xmm1
  !psadbw xmm0, xmm1
  !movd eax, xmm0
  ProcedureReturn
  !popcount64_v0f: dq 0x0f0f0f0f0f0f0f0f
  !popcount64_v33: dq 0x3333333333333333
  !popcount64_v55: dq 0x5555555555555555
EndProcedure
Using twice 32 bits

Code: Select all

Procedure Popcount64(x.q)
  !mov eax, [p.v_x]
  !mov ecx, [p.v_x + 4]
  !mov edx, ecx
  !shr edx, 1
  !jz popcount64_l0
  !and edx, 0x55555555
  !sub ecx, edx
  !mov edx, ecx
  !and ecx, 0x33333333
  !shr edx, 2
  !and edx, 0x33333333
  !add ecx, edx
  !mov edx, ecx
  !shr edx, 4
  !add ecx, edx
  !and ecx, 0x0f0f0f0f
  !popcount64_l0:
  !mov edx, eax
  !shr edx, 1
  !and edx, 0x55555555
  !sub eax, edx
  !mov edx, eax
  !and eax, 0x33333333
  !shr edx, 2
  !and edx, 0x33333333
  !add eax, edx
  !mov edx, eax
  !shr edx, 4
  !add eax, edx
  !and eax, 0x0f0f0f0f
  !add eax, ecx
  !imul eax, 0x01010101
  !shr eax, 24
  ProcedureReturn 
EndProcedure
A basic version should look like this

Code: Select all

Procedure PopCount64b1(i.q)
  i - i >> 1 & $5555555555555555
  i = i & $3333333333333333 + i >> 2 & $3333333333333333
  ProcedureReturn ((i + i >> 4) & $0f0f0f0f0f0f0f0f * $0101010101010101) >> 56
EndProcedure
But this fails on x64. This code however

Code: Select all

Procedure PopCount64b2(i.q)
  Protected m.q = $0101010101010101
  i - i >> 1 & $5555555555555555
  i = i & $3333333333333333 + i >> 2 & $3333333333333333
  i = (i + i >> 4) & $0f0f0f0f0f0f0f0f
  ProcedureReturn (i * m) >> 56
EndProcedure
works fine on x64.
PB bug?
Last edited by wilbert on Mon May 18, 2015 7:27 am, edited 6 times in total.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
idle
Always Here
Always Here
Posts: 5018
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Population count

Post by idle »

Wilbert do you know why x64 can't use immediates for the constants?
In my x64 version above I had to load them into a register

Code: Select all

...
!mov r15, 0x5555555555555555
!and rdx, r15
...
Windows 11, Manjaro, Raspberry Pi OS
Image
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Population count

Post by wilbert »

:oops: :oops: :oops: I completely overlooked your 64 bit version post Idle :oops: :oops: :oops:

I ran into the same problem with immediate values. After some searching I found the answer ...
64bit mode allows only signed 32bit immediates. The only instruction that can take a 64bit immediate is mov .
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
idle
Always Here
Always Here
Posts: 5018
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Population count

Post by idle »

wilbert wrote::oops: :oops: :oops: I completely overlooked your 64 bit version post Idle :oops: :oops: :oops:

I ran into the same problem with immediate values. After some searching I found the answer ...
64bit mode allows only signed 32bit immediates. The only instruction that can take a 64bit immediate is mov .
That's good to know, I didn't expect it'd be an issue and couldn't understand why it only worked via mov.
Windows 11, Manjaro, Raspberry Pi OS
Image
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Population count

Post by wilbert »

Here's a speed test for some PopCount64 versions (results are copied to the clipboard).

Speed test code

Code: Select all

; PopCount 64 bit test

EnableExplicit

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
  
  Procedure PopCount0(k.q)
    !mov rax, [p.v_k]
    !popcnt rax, rax
    ProcedureReturn
  EndProcedure
  
  Procedure Popcount1(x.q); Idle
                          ;x - (x >> 1) &  $5555555555555555           
    !mov rax, [p.v_x]
    !mov rdx, rax
    !shr rdx, 1
    !mov r15, 0x5555555555555555
    !and rdx, r15
    !sub rax,rdx 
    ;x = (x & $3333333333333333) + ((x >> 2) & $3333333333333333)
    !mov rdx, rax       ;x 
    !mov r15, 0x3333333333333333 
    !and rax, r15  
    !shr rdx, 2 
    !and rdx, r15
    !add rax,rdx
    ;x = (x + (x >> 4)) & $0f0f0f0f0f0f0f0f
    !mov rdx, rax 
    !shr rdx, 4 
    !add rax,rdx 
    !mov r15, 0x0f0f0f0f0f0f0f0f
    !and rax, r15 
    ;x * 0101010101010101 >> 56
    !mov r15, 0x0101010101010101
    !imul rax, r15
    !shr rax, 56 
    ProcedureReturn  
  EndProcedure 
  
  Procedure PopCount2(x.q)
    !mov rax, [p.v_x]
    !mov rdx, rax
    !shr rdx, 1
    !and rdx, [popcount2_v55] 
    !sub rax, rdx 
    ;x = (x & $3333333333333333) + ((x >> 2) & $3333333333333333)
    !mov rdx, rax       ;x 
    !and rax, [popcount2_v33] 
    !shr rdx, 2 
    !and rdx, [popcount2_v33]
    !add rax, rdx
    ;x = (x + (x >> 4)) & $0f0f0f0f0f0f0f0f0f0f
    !mov rdx, rax 
    !shr rdx, 4 
    !add rax, rdx 
    !and rax, [popcount2_v0f]
    ;x * $0101010101010101 >> 56
    !imul rax, [popcount2_v01]
    !shr rax, 56
    ProcedureReturn
    !popcount2_v01: dq 0x0101010101010101
    !popcount2_v0f: dq 0x0f0f0f0f0f0f0f0f
    !popcount2_v33: dq 0x3333333333333333
    !popcount2_v55: dq 0x5555555555555555
  EndProcedure 
  
CompilerEndIf

Procedure PopCount3(x.q); SSE2 (both x86 and x64)
  !movd xmm0, [p.v_x]
  !movd xmm1, [p.v_x + 4]
  !movq xmm2, [popcount3_v55]
  !movq xmm3, [popcount3_v33]
  !movq xmm4, [popcount3_v0f]
  !punpckldq xmm0, xmm1
  !movdqa xmm1, xmm0
  !psrlq xmm1, 1
  !pand xmm1, xmm2
  !psubb xmm0, xmm1 
  !movdqa xmm1, xmm0 
  !pand xmm0, xmm3
  !psrlq xmm1, 2 
  !pand xmm1, xmm3
  !paddb xmm0, xmm1
  !movdqa xmm1, xmm0
  !psrlq xmm1, 4
  !paddb xmm0, xmm1
  !pand xmm0, xmm4
  !pxor xmm1, xmm1
  !psadbw xmm0, xmm1
  !movd eax, xmm0
  ProcedureReturn
  !popcount3_v0f: dq 0x0f0f0f0f0f0f0f0f
  !popcount3_v33: dq 0x3333333333333333
  !popcount3_v55: dq 0x5555555555555555
EndProcedure

Procedure Popcount4(x.q)
  !mov eax, [p.v_x]
  !mov ecx, [p.v_x + 4]
  !mov edx, ecx
  !shr edx, 1
  !jz popcount64_l0
  !and edx, 0x55555555
  !sub ecx, edx
  !mov edx, ecx
  !and ecx, 0x33333333
  !shr edx, 2
  !and edx, 0x33333333
  !add ecx, edx
  !mov edx, ecx
  !shr edx, 4
  !add ecx, edx
  !and ecx, 0x0f0f0f0f
  !popcount64_l0:
  !mov edx, eax
  !shr edx, 1
  !and edx, 0x55555555
  !sub eax, edx
  !mov edx, eax
  !and eax, 0x33333333
  !shr edx, 2
  !and edx, 0x33333333
  !add eax, edx
  !mov edx, eax
  !shr edx, 4
  !add eax, edx
  !and eax, 0x0f0f0f0f
  !add eax, ecx
  !imul eax, 0x01010101
  !shr eax, 24
  ProcedureReturn 
EndProcedure

Procedure PopCount5(i.q)
  Protected m.q = $0101010101010101
  i - i >> 1 & $5555555555555555
  i = i & $3333333333333333 + i >> 2 & $3333333333333333
  i = (i + i >> 4) & $0f0f0f0f0f0f0f0f
  ProcedureReturn (i * m) >> 56
EndProcedure


; ---- Initialisation ----
Define.i i, x, x0, x1, x2, x3, x4, x5, y, rep=10000000
Define.i t0, t1, t2, t3, t4, t5
Define.s s

Dim Rnd.q(rep)
For i = 1 To rep
  Rnd(i) = Random(2147483647) << 32 | Random(2147483647)
Next


; ---- Small check whether all procedures Return the same results ----
For i = 1 To 100
  x5 = PopCount5(Rnd(i))
  x4 = PopCount4(Rnd(i))
  x3 = PopCount3(Rnd(i))
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64 
    x2 = PopCount2(Rnd(i))
    x1 = PopCount1(Rnd(i))
    x0 = PopCount0(Rnd(i))
  CompilerElse
    x2 = x3
    x1 = x2
    x0 = x1
  CompilerEndIf
  If x0 <> x1 Or
     x0 <> x2 Or
     x0 <> x3 Or
     x0 <> x4 Or
     x0 <> x5
    MessageRequester("Error",
                     "Different results for PopCount(" + Rnd(i) + ")")
    End
  EndIf
Next   


; ---- Speed test ----
CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
  t0 = ElapsedMilliseconds()
  For i = 1 To rep
    x = PopCount0(Rnd(i))
  Next   
  t0 = ElapsedMilliseconds() - t0
  
  t1 = ElapsedMilliseconds()
  For i = 1 To rep
    x = PopCount1(Rnd(i))
  Next   
  t1 = ElapsedMilliseconds() - t1
  
  t2 = ElapsedMilliseconds()
  For i = 1 To rep
    x = PopCount2(Rnd(i))
  Next   
  t2 = ElapsedMilliseconds() - t2
CompilerEndIf

t3 = ElapsedMilliseconds()
For i = 1 To rep
  x = PopCount3(Rnd(i))
Next   
t3 = ElapsedMilliseconds() - t3

t4 = ElapsedMilliseconds()
For i = 1 To rep
  x = PopCount4(Rnd(i))
Next   
t4 = ElapsedMilliseconds() - t4

t5 = ElapsedMilliseconds()
For i = 1 To rep
  x = PopCount5(Rnd(i))
Next   
t5 = ElapsedMilliseconds() - t5


s = "t0 = " + t0 + " ms   (" + Int(100*t0/t4) + "%)" + #LF$ +
    "t1 = " + t1 + " ms   (" + Int(100*t1/t4) + "%)" + #LF$ +
    "t2 = " + t2 + " ms   (" + Int(100*t2/t4) + "%)" + #LF$ +
    "t3 = " + t3 + " ms   (" + Int(100*t3/t4) + "%)" + #LF$ +
    "t4 = " + t4 + " ms   (" + Int(100*t4/t4) + "%)" + #LF$ +
    "t5 = " + t5 + " ms   (" + Int(100*t5/t4) + "%)"
SetClipboardText(s)
MessageRequester("PopCount speed test", s)
Results OSX , PB x86 wrote:t0 = 0 ms (0%)
t1 = 0 ms (0%)
t2 = 0 ms (0%)
t3 = 56 ms (80%)
t4 = 70 ms (100%)
t5 = 415 ms (592%)
Results OSX , PBx64 wrote:t0 = 33 ms (46%)
t1 = 54 ms (76%)
t2 = 50 ms (70%)
t3 = 52 ms (73%)
t4 = 71 ms (100%)
t5 = 71 ms (100%)
Last edited by wilbert on Mon May 18, 2015 7:30 am, edited 4 times in total.
Windows (x64)
Raspberry Pi OS (Arm64)
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Population count

Post by Little John »

@BasicallyPure:
I'm using the Population Count for generating the Thue-Morse sequence.
User avatar
BasicallyPure
Enthusiast
Enthusiast
Posts: 536
Joined: Thu Mar 24, 2011 12:40 am
Location: Iowa, USA

Re: Population count

Post by BasicallyPure »

Ahh I see, it's the old Thue-Morse sequence, I should have guessed. :lol:

Nicely documented by the way.
Thanks.
BasicallyPure
Until you know everything you know nothing, all you have is what you believe.
User avatar
idle
Always Here
Always Here
Posts: 5018
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Population count

Post by idle »

seems I forgot to post this but it the x86 specific popcountmem functions need checking

Code: Select all


Procedure _Popcount32(x.l)
  ;x - (x >> 1) &  $55555555           
  !mov eax, [p.v_x]
  !mov edx, eax
  !shr edx, 1
  !and edx, 0x55555555 
  !sub eax,edx 
  ;x = (x & $33333333) + ((x >> 2) & $33333333)
  !mov edx, eax       ;x 
  !and eax, 0x33333333 
  !shr edx, 2 
  !and edx, 0x33333333
  !add eax,edx
  ;x = (x + (x >> 4)) & $0f0f0f0f0f
  !mov edx, eax 
  !shr edx, 4 
  !add eax,edx 
  !and eax, 0x0f0f0f0f
  ;x * 0x01010101 >> 24
  !imul eax, 0x01010101
  !shr eax, 24
  ProcedureReturn  
EndProcedure 

CompilerIf SizeOf(Integer) = 4

Procedure _popcountmem(*mem.long,len.l) 
  Protected result,a,b 
       
  !mov eax, [p.v_len] 
  !and eax, 3 
  !mov edx, [p.v_len] 
  !sub edx,eax 
  !mov [p.v_a],edx
  !mov [p.v_b],eax 
  !xor ecx,ecx
  !lwhile:
  !cmp ecx, [p.v_a] 
  !jge lend
  !mov eax, [p.p_mem] 
  !mov eax, [eax + ecx]
  !mov edx, eax
  !shr edx, 1
  !and edx, 0x55555555 
  !sub eax,edx 
  !mov edx, eax       
  !and eax, 0x33333333 
  !shr edx, 2 
  !and edx, 0x33333333
  !add eax,edx
  !mov edx, eax 
  !shr edx, 4 
  !add eax,edx 
  !and eax, 0x0f0f0f0f
  !imul eax, 0x01010101
  !shr eax, 24
  !add [p.v_result],eax   
  !add ecx,4
  !jmp lwhile
  !lend:
  
  !mov eax, [p.p_mem] 
  !mov eax, [eax + ecx]
    
  !mov edx, [p.v_b]
  !jmp dword [JT_remain + edx * 4]
  !JT_remain dd le,l1,l2,l3,l4
    
  !l1: 
  !and eax, 0xff
  !jmp le
  !l2:
  !and eax, 0xffff
  !jmp le 
  !l3:
  !and eax, 0xffffff
  !jmp le 
  !l4:
  !le: 
   
  !mov edx, eax
  !shr edx, 1
  !and edx, 0x55555555 
  !sub eax,edx 
  !mov edx, eax       
  !and eax, 0x33333333 
  !shr edx, 2 
  !and edx, 0x33333333
  !add eax,edx
  !mov edx, eax 
  !shr edx, 4 
  !add eax,edx 
  !and eax, 0x0f0f0f0f
  !imul eax, 0x01010101
  !shr eax, 24
  !add [p.v_result],eax   
   
  ProcedureReturn result   
EndProcedure   

Procedure _popcountmemSSE(*mem,len) 
  Protected result,a,b 
       
  !mov eax, [p.v_len] 
  !and eax, 3 
  !mov edx, [p.v_len] 
  !sub edx,eax 
  !mov [p.v_a],edx
  !mov [p.v_b],eax 
  !xor ecx,ecx
  !lwhile1:
  !cmp ecx, [p.v_a] 
  !jge lend1
  
  !mov eax, [p.p_mem] 
  !mov eax, [eax + ecx]
  !popcnt eax,eax   
  !add [p.v_result],eax   
  !add ecx,4
   
  !jmp lwhile1
  !lend1:
  
  !mov eax, [p.p_mem] 
  !mov eax, [eax + ecx]
    
  !mov edx, [p.v_b]
  !jmp dword [JT_remain1 + edx * 4]
  !JT_remain1 dd lle,ll1,ll2,ll3,ll4
    
  !ll1: 
  !and eax, 0xff
  !jmp lle
  !ll2:
  !and eax, 0xffff
  !jmp lle 
  !ll3:
  !and eax, 0xffffff
  !jmp lle 
  !ll4:
  !lle: 
  
  !mov edx,eax
  !popcnt eax,edx   
  !add [p.v_result],eax   
  
  ProcedureReturn result  
EndProcedure   


CompilerElse 

Procedure _Popcount64(x.i)
  ;x - (x >> 1) &  $5555555555555555           
  !mov rax, [p.v_x]
  !mov rdx, rax
  !shr rdx, 1
  !mov r15, 0x5555555555555555
  !and rdx, r15
  !sub rax,rdx 
  ;x = (x & $3333333333333333) + ((x >> 2) & $3333333333333333)
  !mov rdx, rax       ;x 
  !mov r15, 0x3333333333333333 
  !and rax, r15  
  !shr rdx, 2 
  !and rdx, r15
  !add rax,rdx
  ;x = (x + (x >> 4)) & $0f0f0f0f0f0f0f0f
  !mov rdx, rax 
  !shr rdx, 4 
  !add rax,rdx 
  !mov r15, 0x0f0f0f0f0f0f0f0f
  !and rax, r15 
  ;x * 0101010101010101 >> 56
  !mov r15, 0x0101010101010101
  !imul rax, r15
  !shr rax, 56 
  ProcedureReturn  
EndProcedure 

Procedure _popcountmem(*mem,len) 
  Protected result,a,b 
       
  !mov rax, [p.v_len] 
  !And rax, 3 
  !mov rdx, [p.v_len] 
  !sub rdx,rax 
  !mov [p.v_a],rdx
  !mov [p.v_b],rax 
  !XOr rcx,rcx
  !lwhile:
  !cmp rcx, [p.v_a] 
  !jge lend
  !mov rax, [p.p_mem] 
  !mov eax, [rax + rcx]
  !mov edx, eax
  !shr edx, 1
  !and edx, 0x55555555 
  !sub eax,edx 
  !mov edx, eax       
  !and eax, 0x33333333 
  !shr edx, 2 
  !and edx, 0x33333333
  !add eax,edx
  !mov edx, eax 
  !shr edx, 4 
  !add eax,edx 
  !and eax, 0x0f0f0f0f
  !imul eax, 0x01010101
  !shr eax, 24
  !add [p.v_result],eax   
  !add rcx,4
  !jmp lwhile
  !lend:
  
  !mov rax, [p.p_mem] 
  !mov eax, [rax + rcx]
  
  !lea rdx, [JT_remain]
  !mov rcx, [p.v_b]
  !jmp qword [rdx + rcx * 8]
  !JT_remain dq le,l1,l2,l3,l4
    
  !l1: 
  !and eax, 0xff
  !jmp le
  !l2:
  !and eax, 0xffff
  !jmp le 
  !l3:
  !and eax, 0xffffff
  !jmp le 
  !l4:
  !le: 
   
  !mov edx, eax
  !shr edx, 1
  !and edx, 0x55555555 
  !sub eax,edx 
  !mov edx, eax       
  !and eax, 0x33333333 
  !shr edx, 2 
  !and edx, 0x33333333
  !add eax,edx
  !mov edx, eax 
  !shr edx, 4 
  !add eax,edx 
  !and eax, 0x0f0f0f0f
  !imul eax, 0x01010101
  !shr eax, 24
  !add [p.v_result],eax   
   
  ProcedureReturn result   
EndProcedure   

Procedure _popcountmemSSE(*mem,len) 
  Protected result,a,b 
       
  !mov rax, [p.v_len] 
  !and rax, 7 
  !mov rdx, [p.v_len] 
  !sub rdx,rax 
  !mov [p.v_a],rdx
  !mov [p.v_b],rax 
  !xor rcx,rcx
  !lwhile1:
  !cmp rcx, [p.v_a] 
  !jge lend1
  
  !mov rax, [p.p_mem] 
  !mov rax, [rax + rcx]
  !popcnt rax,rax   
  !add [p.v_result],rax   
  !add rcx,8
   
  !jmp lwhile1
  !lend1:
  
  !mov rax, [p.p_mem] 
  !mov rax, [rax + rcx]
    
  !lea rdx, [JT_remain1]
  !mov rcx, [p.v_b]
  !jmp qword [rdx + rcx * 8]
  !JT_remain1 dq lle,ll1,ll2,ll3,ll4,ll5,ll6,ll7,ll8
    
  !ll1: 
  !and rax, 0xff
  !jmp lle
  !ll2:
  !and rax, 0xffff
  !jmp lle 
  !ll3:
  !and rax, 0xffffff
  !jmp lle 
  !ll4:
  !mov rdx,0xffffffff
  !and rax, rdx
  !jmp lle 
  !ll5:
  !mov rdx, 0xffffffffff
  !and rax, rdx
  !jmp lle 
  !ll6:
  !mov rdx,0xffffffffffff
  !and rax, rdx
  !jmp lle
  !ll7:
  !mov rdx,0xffffffffffffff 
  !and rax, rdx
  !jmp lle
  !ll8:
  !lle: 
  
  !mov rdx,rax
  !popcnt rax,rdx   
  !add [p.v_result],rax   
  
  ProcedureReturn result  
EndProcedure   
  
CompilerEndIf 

Procedure _PopCountSSE4(x.i)
  CompilerIf SizeOf(Integer) = 8 
    !popcnt rax,[p.v_x]   
  CompilerElse 
    !popcnt eax,[p.v_x]   
  CompilerEndIf   
  ProcedureReturn 
EndProcedure   
 
Prototype PopCount(x.i) 
Prototype PopCountMem(*mem,len) 
Global PopCount.PopCount 
Global PopCountMem.PopCountMem 

Procedure InitpopCount(FallBackMode.i=64) 
  Protected result 
  !mov eax, 1
  !cpuid
  !shr ecx, 23
  !and ecx, 1
  !mov [p.v_result],ecx 
  If result 
    popcount = @_PopCountSSE4()
    popcountmem = @_popcountmemSSE() 
  Else 
    CompilerIf SizeOf(Integer) = 8
      If FallBackMode = 64 
        popcount = @_popcount64()
        popcountmem = @_popcountmem()
      Else 
        popcount = @_popcount32()
        popcountmem = @_popcountmem()
      EndIf   
    CompilerElse 
      popcount = @_popcount32() 
      popcountmem = @_popcountmem()
    CompilerEndIf 
 EndIf    
EndProcedure 


CompilerIf #PB_Compiler_IsMainFile 
  
  InitpopCount() ; sets popcount to the best supported function 
  ;on x64 if you want to set the mode to the 32 function range, call it with InitpopCount(#PB_Long)
  
  x.i = %10101010101010101010101010101010
  Debug PopCount(x)  ;16
  
  mx = Random(1024,128)
  *mem = AllocateMemory(mx) 
  *pa.Ascii = *mem  
  For a = 0 To mx-1 
   *pa\a = 3 
   *pa+1
   r+2
  Next   
  Debug r     
  Debug PopCountMem(*mem,mx)   
  Debug _popcountmem(*mem,mx)  
  
CompilerEndIf
Windows 11, Manjaro, Raspberry Pi OS
Image
User avatar
Keya
Addict
Addict
Posts: 1891
Joined: Thu Jun 04, 2015 7:10 am

Re: Population count

Post by Keya »

great work everyone!:) Ive simply added a CPUID check for SSE4 POPCNT, and then just combined the 32&64bit asm routines posted here along with a Prototype function pointer so it's always pointing at the best function

Code: Select all

Prototype.i protoGetBitCnt(number.i)

Procedure.i HasSSE4_POPCNT()
  EnableASM
  Protected.i reg_bx
  CompilerIf SizeOf(Integer) = 8
    !mov [p.v_reg_bx], rbx
  CompilerElse
    !mov [p.v_reg_bx], ebx
  CompilerEndIf  
  ! mov eax, 1
  ! cpuid   
  ! xor eax, eax  
  ! bt ecx, 23   ;popcnt
  ! setc al   
  CompilerIf SizeOf(Integer) = 8
    !mov rbx, [p.v_reg_bx]
  CompilerElse
    !mov ebx, [p.v_reg_bx]
  CompilerEndIf
  ProcedureReturn
  DisableASM
EndProcedure


Procedure GetBitCountSSE4(x.i)
  CompilerIf SizeOf(Integer) = 8
    !popcnt rax,[p.v_x]   
  CompilerElse
    !popcnt eax,[p.v_x]   
  CompilerEndIf   
  ProcedureReturn
EndProcedure  


Procedure GetBitCountAsm(x.i)
  CompilerIf SizeOf(Integer) = 8
    !mov rax, [p.v_x]
    !mov rdx, rax
    !shr rdx, 1
    !mov r15, 0x5555555555555555
    !and rdx, r15
    !sub rax,rdx
    !mov rdx, rax       
    !mov r15, 0x3333333333333333
    !and rax, r15 
    !shr rdx, 2
    !and rdx, r15
    !add rax,rdx
    !mov rdx, rax
    !shr rdx, 4
    !add rax,rdx
    !mov r15, 0x0f0f0f0f0f0f0f0f
    !and rax, r15
    !mov r15, 0x0101010101010101
    !imul rax, r15
    !shr rax, 56
  CompilerElse
    !mov eax, [p.v_x]
    !mov edx, eax
    !shr edx, 1
    !and edx, 0x55555555
    !sub eax,edx
    !mov edx, eax       
    !and eax, 0x33333333
    !shr edx, 2
    !and edx, 0x33333333
    !add eax,edx
    !mov edx, eax
    !shr edx, 4
    !add eax,edx
    !and eax, 0x0f0f0f0f
    !imul eax, 0x01010101
    !shr eax, 24
  CompilerEndIf
  ProcedureReturn 
EndProcedure

 
;Self-initialize
If HasSSE4_POPCNT() = 1
  Global GetBitCnt.protoGetBitCnt = @GetBitCountSSE4()
Else
  Global GetBitCnt.protoGetBitCnt = @GetBitCountAsm()
EndIf

;---
;from now you can simply call GetBitCnt(value), and it will automatically
;be pointing to either the SSE4 or asm procedure depending on your CPU
ps. i really wish the Thread Subject included the words COUNT and BITS as I nearly missed this thread! (and besides, 32 is more of a party than a population, so yes Intel shouldve named the instruction BITCNT if not PARTYCNT) :D
Post Reply