PureBasic Forums - English

Posted: **Wed Mar 24, 2010 11:59 pm**

The normal x86 version can actually be made even faster (here) for large string by a lookup table:

Structure SCountCharsLookupTable
  A.b[256]
EndStructure

Procedure CountCharX86Lookup(*B.long, L, Char.a)
  Protected Table.SCountCharsLookupTable
  Table\A[Char] = 1
  !mov  ecx, [p.v_L]
  !mov  eax, [p.p_B]

  !push edi
  !PS_CC = 4
  !xor  edi, edi
  
  !lp2:
    !movzx edx, byte [ecx+eax-1]
    !movzx edx, byte [p.v_Table+edx+PS_CC]
    !add edi, edx
    !add ecx, -1
    !jnz lp2
    
  !mov eax, edi
  !pop edi
  ProcedureReturn
EndProcedure

Old x86: 3000-3050 ms
Lookup x86: 2700-2750 ms

Posted: **Thu Mar 25, 2010 12:24 am**

Trond wrote: Btw: can you explain how popcnt helps you check if the characters are equal??

Yes.
To understand it you must know how SSE works or lets say how SIMD works. MMX and SSE are just names for sets of instuctions. But MMX introduced a new concept of paralel processing known as SIMD (single instruction multible data).

For MMX SIMD instructions work with a new kind of register, the MMX registers. This are 64bit registers but don't hold a 64bit number instead they hold so called "packed data". Packed data just means the 64bit is not a quad word, it's 2 double worlds or 4 words or 8 byte. So what you do is put 8 byte into one register and put another 8 byte into another register. Now you can use a SIMD instruction like padd to perform a add operation between this to registers. The padd instruction will add byte 1 of the first register with byte 1 in the second register and byte 2 in the first register with byte 2 in the second register and so on. It does this in paralel, so it just takes 1 tick to do 8 add operations. With SSE they introduced 128bit registers, so you can even do 16 add operations with a single instruction in just on tick. Thats the basic concept.

You also can do a paralel compare operation. With this operation the cpu compares the corresponding bytes and set all bits of the source byte to 1 if it matches and to 0 if not. Now you can use another SIMD instruction, which will extract the sign bit of every byte in the MMX register and put it into a 32bit register like EAX. So you get a byte bit mask on MMX or a word bit mask on SSE. For our procedure we now just have to count the bits in the bit mask and we know how many matches the SIMD compare had.

It's a little bit complex at first, but it kicks ass. You can theoreticly speed up your traditional assembly procedures by 16 times. But you will not reach that, because you have some overhead.

Of course you dont need to work with packed bytes on SIMD, you also can work with packed words, packed doublewords and even packed floats and doubles.

I hope that is understandable, if not just search google for SIMD.
My english is not the best. ^^

Posted: **Thu Mar 25, 2010 12:43 am**

I understand, thanks.

Posted: **Thu Mar 25, 2010 1:02 am**

Actually, since the bitmask can only have 255 values, and the bits are packed into one "side" of the register, you can use a lookup table. Counting is not necessary.

Code: Select all

DataSection
  BitCountLookup:
  Data.a 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4
  Data.a 3,4,4,5,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5
  Data.a 3,4,4,5,4,5,5,6,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4
  Data.a 3,4,4,5,3,4,4,5,4,5,5,6,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6
  Data.a 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,1,2,2,3,2,3,3,4,2,3,3,4
  Data.a 3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,2,3,3,4,3,4,4,5
  Data.a 3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,2,3,3,4
  Data.a 3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7
  Data.a 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
EndDataSection

Posted: **Thu Mar 25, 2010 2:31 am**

Yes, thats what Helle has done. ^^

Posted: **Thu Mar 25, 2010 3:14 am**

What am I missing? There was a contest a few years ago here:
http://www.powerbasic.com/support/forum ... 01733.html

Some of the code submitted there scans a 4MB file and counts ALL characters, not just 1, and it runs in 200ms on a really old processor, faster than any of the code posted here, and is written in basic.
What file is being used for your tests? Is it really only 1MByte?

Anyway, regardless of what happened in that other place, the limit of speed of processing and large (large means > CPU cache size) blocks of data is the speed at which you can read the data from main memory. You need to use the PREFETCH instructions at some point in your code to get the data into the cache ahead of time otherwise the fastest of code will sit idle for most of the time waiting for main memory.

I've not used the POPCNT opcode but it really doesn't look appropriate for the job you're doing here.

Code: Select all

      !CountCharSSE42_Loop:
      
        !movdqa xmm1,[edi]
        !pcmpeqb xmm1,xmm0
        !pmovmskb eax,xmm1
        
        !popcnt eax,eax
        !add ebp,eax
        
        !add edi,16
        !dec ecx

If you were to split the bulk of the job into 2048 byte blocks (that's 8 bytes at a time and 256 blocks) and just add the results of the pcmpeqb to a spare xmm register than that register will accumulate the result for you without the slow transfer to eax. At the end of each block you can then sum the 8 bytes in the register, maybe using popcnt once for every 2048 bytes instead of every 8 bytes.

Better still, fetch 7 xmm registers worth of data in 1 go then do the compare. xmm instructions like to be done in bursts, that's why the instruction set is SSE for STREAMING SIMD extensions. Stream them when you can, it's what they're designed to do.

Posted: **Thu Mar 25, 2010 4:18 am**

Road Runner wrote: Some of the code submitted there scans a 4MB file and counts ALL characters, not just 1, and it runs in 200ms on a really old processor, faster than any of the code posted here, and is written in basic.
What file is being used for your tests? Is it really only 1MByte?

What you are missing is, that the test scans the data 100,000 times to get a messurable result.

Road Runner wrote: Anyway, regardless of what happened in that other place, the limit of speed of processing and large (large means > CPU cache size) blocks of data is the speed at which you can read the data from main memory. You need to use the PREFETCH instructions at some point in your code to get the data into the cache ahead of time otherwise the fastest of code will sit idle for most of the time waiting for main memory.

CPU's that support SSE4 have a automatic prefetcher which is pretty impossible to beat, just look on the Intel technical docs. It's even better to not unrole and not prefetch because the CPU will rebuild the loop with micro ops and optimize it. But thats only for the newer generation, Core iX.

Road Runner wrote: I've not used the POPCNT opcode but it really doesn't look appropriate for the job you're doing here.

I dont get how adding the results should do any good. Read the code carefully. It's not just transfering to eax, it's extracting. The result of the compare is not the count of matches, it's a bitmask. So you have to count bits, so POPCNT is the best for the job.

Edit:
But if you think you can optimize the code, just do it. I am sure you can get a bit performance with some smart changes. But prefetching doesnt do any good on new CPU's. Guess whats the fastest way to copy memory on a Core i7. ^^ Its REP MOVSD! I am not kidding, no SSE registers, no prefetching, no unrolling, beats every SSE memory copy routine on the Core i7.

Posted: **Fri Mar 26, 2010 2:52 pm**

I know little about the i7 and I don't have one to test but if the code is to be run on a range of

processors then i7 quirks can't be relied on. PREFETCH will benefit most CPUs and it shouldn't harm the i7 performance.

I did see the 100,000 but that gave unreasonably high figures. Maximum theoretical memory bandwidth for

an i7 is around 25GB/s? This code is running at 100,000 x 1MB / 0.234s = 427GB/s. 16 times the maximum possible.
The reason must be that the data is all stored in the big cache after the first run.
For a more realistic test you need to increase the size of your test file to at least 10MB. The 1MB file

will just sit in the cache for the last 99,999 times around the loops so no wonder PREFETCH has no effect.

And no wonder the i7 results are so much faster than the other CPUs being used which don't have a big enough cache to store the whole file and need to fetch it each time from main memory.

I'm not sure the i7 does away with the need for a PREFETCH instruction, I thought it still benefitted from

PREFETCH but cleared up some problems with PREFETCH slowing things down in certain circumstances.

My suggestion to improve things is explained below.

Use xmm7 as an accumulator. Zero it at the start.
Every PCMPEQB which matches will put the value 255 in the corresponding byte of the XMM register. Those

that don't match will put 0 in the byte.
Now just sum the XMM registers in xmm7 and you'll see that xmm7 counts all 16 bytes at the same time (it counts down but that can easily be corrected at the end)
So suppose after the PCMPEQB you get this (only half of the bytes are shown here)

Code: Select all

xmm1      0    0   0  255   0   0   0    0
xmm2      0    0   0  255   0   0   0  255
xmm3    255  255   0  255   0   0 255    0
xmm4    255  255   0    0   0   0   0  255

Sum them in xmm7
xmm7    254  254   0  253   0   0 255  254
which is the equivalent of
xmm7     -2   -2   0   -3   0   0  -1   -2   <- sum these at the end to get -10 i.e 10 matches found

In order not to overflow the byte values you can only do a block of <4096 bytes at a time so the scan will

need to be broken into 2048 byte chunks but even just doing 4 lots of registers at a time it should run faster because you're looping 1/4 as much, the slow pmovmskb and popcnt instructions are avoided on every loop and the streaming from mamory should be better.

Code: Select all

original inner loop:
      !CountCharSSE42_Loop:
      
        !movdqa xmm1,[edi]
        !pcmpeqb xmm1,xmm0
        !pmovmskb eax,xmm1
        
        !popcnt eax,eax
        !add ebp,eax
        
        !add edi,16
        !dec ecx


more efficient inner loop:
      !CountCharSSE42_Loop:

      !pxor xmm7,xmm7             'this will accumulate the counts
        !movdqa xmm1,[edi]        'get 4 or more at a time
        !movdqa xmm2,[edi+16]
        !movdqa xmm3,[edi+32]
        !movdqa xmm4,[edi+48]

	!prefechnta [edi+1024]    'see if fetching future data in advance will help, comment it out if it doesn't

        !pcmpeqb xmm1,xmm0        'compare ..
        !paddb xmm7,xmm1          '.. and accumulate the results for 64 bytes per loop
        !pcmpeqb xmm2,xmm0
        !paddb xmm7,xmm2
        !pcmpeqb xmm3,xmm0
        !paddb xmm7,xmm3
        !pcmpeqb xmm4,xmm0
        !paddb xmm7,xmm4


        !add edi,64

	;at this point xmm7 contains 16 bytes, each is a sum of the matches. 
	;add these bytes for the result now or, more efficient, repeat this loop up to 63 times and then add these bytes. 
	;remember the bytes are counting down so 255 is 1, 254 is 2 etc. Just correct for that once at the end.

Posted: **Fri Mar 26, 2010 3:10 pm**

I have to ask, but what size files would you be using where it becomes necessary to use ASM rather than the built in CountString() function?

Posted: **Fri Mar 26, 2010 8:49 pm**

So tested your code but please add a effective way to sume the xmm7 register.

Your code runs 3 times faster than mine but without the final sume, returns allways 0 of course, please add the code for that so i can do a real test.
As i said prefetching doesnt do anything on the Core i7, remember Core i7 is prefetching automaticly, they build in a memory access and loop analysator to calculate the perfect prefetching. But i agree it's helpfull on older CPU's, so it's usefull if we don't need POPCNT because the code will run on older CPU's.

Foz wrote:I have to ask, but what size files would you be using where it becomes necessary to use ASM rather than the built in CountString() function?

I dont know, maybe never.

Posted: **Fri Mar 26, 2010 11:16 pm**

Code: Select all

        !pxor xmm6,xmm6       'zero xmm6
        !psadbw xmm6,xmm7     'sum the 8 upper and 8 lower bytes
        !movdqa xmm7,xmm6     'take a copy
        !psrldq xmm6,8        'shift to line up
        !paddw xmm7,xmm6      'add the 2 8byte sums to give a 16 byte sum
        
        'the low word of xmm7 should now contain the sum required.
        'but it needs correcting because it's negative.
                
	'You can get the result into a standard register at the end using:
        !movd eax,xmm7

Posted: **Sat Mar 27, 2010 4:33 pm**

Try this. It replaces the main loop in your code but will still need you to add the part before (while getting into alignment) and after (for the leftover bytes). Because it's done 2048 bytes at a time the leftover bytes may be as many as 2047.

Code: Select all

'on entry to this part xmm0 contains 16x the byte to search for,
'esi contains the remaining buffer size,
'edi points to the next buffer byte, 16 byte aligned

'break the string into 2048 byte segments and process each segment as follows
'xmm0 : every byte of xmm0 contains the character to search for in the string
'xmm1 : used to load the next few bytes from the string
'xmm2 : used to load the next few bytes from the string
'xmm3 : used to load the next few bytes from the string
'xmm4 : used to load the next few bytes from the string
'xmm5 : zero. Used to sum the bytes within a register
'xmm6 : Used to store the grand total (sum of all blocks)
'xmm7 : the temporary sum of the block of matches


        'at this point xmm0 needs to contain the bytes to match

        !pxor xmm6,xmm6         'zero the grand total so far, or could be loaded with the result of matches in the
                                'small chuck of data already processed to get the data into alignment

        'FOR Each Block of 2048 bytes
        !cmp esi,2048           'is there a full block to process
        !jl DoTailBytes         'not enough bytes so skip to the end to calculate the leftover bytes
BlockLoop:
        !pxor xmm5,xmm5         'zero xmm5
        !mov ecx,32             'do 32 chunks of 64 bytes = 2048 at a time
        !pxor xmm7,xmm7         'zero xmm7 before accumulating results for this block
MainLoop:
        !movdqa xmm1,[edi]      'get a block of 64 bytes
        !movdqa xmm2,[edi+16]
        !movdqa xmm3,[edi+32]
        !movdqa xmm4,[edi+48]

        !prefetchnta [edi+1024] 'just in case it helps

        !pcmpeqb xmm1,xmm0      'check for matches
        !paddb xmm7,xmm1        'and accumulate results
        !pcmpeqb xmm2,xmm0
        !paddb xmm7,xmm2
        !pcmpeqb xmm3,xmm0
        !paddb xmm7,xmm3
        !pcmpeqb xmm4,xmm0
        !paddb xmm7,xmm4

        !add edi,64             'point to next block

        !dec ecx                'loop counter
        !jnz MainLoop

        !pxor xmm4,xmm4       'zero xmm4
        !psubb xmm4,xmm7      'xmm4=0 so zero-xmm7 will correct for counting down
        !psadbw xmm4,xmm5     'sum the 8 upper and 8 lower bytes
        !paddq xmm6,xmm4      'add to the grand total so far

        'NEXT BLOCK
        !sub esi,2048
        !cmp esi,2048           'is there a full block left to process
        !jge BlockLoop

        'after all 2048 byte blocks are done xmm6 contains the 2 halves of the result
        !movdqa xmm7,xmm6     'take a copy
        !psrldq xmm6,8        'shift to line up
        !paddw xmm7,xmm6      'add the 2 8byte half sums to give the full result
        !movd eax,xmm7        'copy to a general register

DoTailBytes:
  '      !add eax,StartCount   'add count of matches at start to get alignment (if not done earlier)
  '      !add eax,EndCount     'you'll need to count the extra matches after the last 2048byte block to the end of data and add that too


        'job done. EAX contains the result

Posted: **Sat Mar 27, 2010 8:02 pm**

I tested this (thanks to Road Runner!):

Code: Select all

Procedure CountChars(*B.Ascii, L, Char.a)
;posted by Trond
  !mov  ecx, [p.v_L]
  !mov  dl, [p.v_Char]
  !mov  eax, [p.p_B]
 
  !push edi
  !push ebp
  !mov  ebp, eax
 
  !xor  edi, edi
  !lp:
    !xor eax, eax
    !cmp byte [ecx+ebp-1], dl
    !sete al
    !add edi, eax
    !add ecx, -1
    !jnz lp
  !mov eax, edi
 
  !pop ebp
  !pop edi
  ProcedureReturn
EndProcedure

Procedure CountCharAsmSSE2_x86(*B.Ascii, Length, Char.a)
;Windows, 32-Bit
;based on posts by Road Runner. Excellent! 
;without prefetch
;"old" loop, sorry
;not full tested!

  !mov ecx,[p.v_Length]
  !movzx edx,byte[p.v_Char]
  !push esi
  !mov esi,[p.p_B+4]  
  !push edi 
  !push ebx

  ;Duplicate Char (Byte to Dqword)
  !mov dh,dl
  !mov eax,edx
  !shl edx,10h
  !mov dx,ax
  !movd xmm0,edx
  !shufps xmm0,xmm0,0        ;xmm0=16x Char
   
  ;Check Length
  !xor eax,eax               ;eax=0, General-Counter
  !mov edx,ecx               ;ecx=Length
  !shr ecx,4                 ;div.16
  !or ecx,ecx                ;ecx=Number of 16-Byte-Chunks
  !jz Rest                   ;Length<16, edx=Length

  !mov ebx,esi               ;Test for 16-Byte-Alignment
  !and ebx,0fh
  !jz IsAli16                ;is 16-Byte-Alignment
  !dec ecx                   ;Chunk-1
  !mov edi,10h 
  !sub edi,ebx               ;Alignment-Diff

  !sub edx,edi               ;Length-edi 
  !movd ebx,xmm0             ;Char
  !Rest0:
  !cmp byte[esi],bl
  !jne @f
  !inc eax
  !@@:
  !inc esi
  !dec edi
  !jnz Rest0

  !IsAli16:                  ;Alignment16

  !or ecx,ecx                ;ecx=Number of 16-Byte-Chunks
  !jz Rest                   ;Length<16 or Chunk set to Zero, edx=Length

  !mov ebx,ecx
  !shl ebx,4                 ;mul.16
  !sub edx,ebx               ;edx=Rest-Bytes without 16-Byte-Chunks

  !movdqa xmm1,xmm0          ;Save Char
  !pxor xmm2,xmm2            ;Counter in Chunk-Loop
  
  !mov edi,ecx 
  !cmp ecx,0ffh
  !jbe @f
  !mov edi,0ffh
  !@@:
  !pcmpeqb xmm0,[esi]
  !paddb xmm2,xmm0
  !movdqa xmm0,xmm1          ;Restore Char
  !add esi,16
  !dec edi
  !jnz @b
  
  !pxor xmm3,xmm3            ;xmm3=0
  !psubb xmm3,xmm2           ;neg -> pos

  !pxor xmm4,xmm4            ;xmm4=0
  !psadbw xmm4,xmm3          ;add the 8 upper and 8 lower Bytes, Results in Bits 0-15 and Bits 64-79
  !movdqa xmm3,xmm4          ;take a copy
  !psrldq xmm4,8             ;shift right 8 Bytes
  !paddw xmm3,xmm4           ;add the 2 8 byte sums to give a 16 byte sum
  !movd ebx,xmm3
  !add eax,ebx               ;add to General-Counter  

  !sub ecx,0ffh
  !jna @f                    ;zero or less, no Chunks more
  !pxor xmm2,xmm2
  !mov edi,0ffh
  !cmp ecx,0ffh
  !ja @b
  !mov edi,ecx
  !jmp @b
  !@@:

  !movdqa xmm0,xmm1          ;Restore Char in xmm0

  !Rest:
  !or edx,edx                ;edx=Rest-Bytes         
  !jz Ende

  !movd ecx,xmm0             ;Char
  !Rest1:
  !cmp byte[esi],cl
  !jne @f
  !inc eax
  !@@:
  !inc esi
  !dec edx
  !jnz Rest1
  !Ende:

  !pop ebx
  !pop edi 
  !pop esi
 
  ProcedureReturn
EndProcedure

B.s
B = "abcdefabcdefaaaaaffffbdddffff"
B = ReplaceString(Space(1000), " ", B) ;thousand copies of B
L = Len(B)

#Tries = 100000

time = ElapsedMilliseconds()
For U = 0 To #Tries
 NumHelle = CountCharAsmSSE2_x86(@B, L, 'a')
Next
TimeHelle = ElapsedMilliseconds()-time
Helle$ = "NumHelle :  "+Str(NumHelle)+"       Time : "+Str(TimeHelle)

time = ElapsedMilliseconds()
For U = 0 To #Tries
  NumTrond = CountChars(@B, L, 'a')
Next
TimeTrond = ElapsedMilliseconds()-time
Trond$ = "NumTrond : "+Str(NumTrond)+"       Time : "+Str(TimeTrond)

MessageRequester("Results", Helle$ + #CRLF$ + Trond$)

Greetings
Helle

Posted: **Thu Jun 09, 2011 1:39 am**

These are great, thanks!

Does anyone have an assembler version of ReplaceString?

PureBasic Forums - English

Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2

Re: Count char in string, ultra fast SSE4.2