PureBasic Forums - English

Posted: **Tue Aug 19, 2014 5:26 am**

Thanks everyone again for testing

It's nice to see there's so much room for speed improvement using SSE2 instructions compared to the default implementation.
It probably is possible in a similar way to also create faster unicode versions of procedures like Mid or FindString.

Tenaja wrote:Is there a way to force (or ensure) string alignment on the Word boundaries? Fred?

If you want to use this, I recommend using the second version since in almost every case it is faster compared to my first attempt.
Even with non aligned strings, it should perform similar to the first version.
Normally when memory is allocated, it always is allocated on a DWord (4 byte) boundary which of course is also a Word boundary.
So I presume all PB strings with the .s type will be aligned. Maybe Fred can confirm this.
The thing is that PB supports non aligned structures. The code below is an example where the fixed length string is not aligned on a Word boundary.

Code: Select all

Structure MyStruct
  a.a
  s.s{20}
EndStructure

Debug @A.MyStruct\s

Posted: **Tue Aug 19, 2014 7:15 am**

@wilbert

Thank you. Very useful.

Would it be advisable to use a dummy, in structures, to force word alignment?

Posted: **Tue Aug 19, 2014 7:35 am**

davido wrote:Would it be advisable to use a dummy, in structures, to force word alignment?

It's only important if the structure contains fixed length strings.
One way is do use a dummy, another solution to use align

Code: Select all

Structure MyStruct Align 2
  a.a
  s.s{20}
EndStructure

Debug @A.MyStruct\s

Posted: **Tue Aug 19, 2014 9:25 am**

Strings uses system memory allocator which usually align memory on a DWORD boundary (4 bytes).

Posted: **Tue Aug 19, 2014 3:22 pm**

Here's an updated version of my code with an additional procedure to get the length of a string with the ability to specify a maximum value and a mid procedure.
Unfortunately the mid procedure is only faster for longer strings (at least on OSX)

Code: Select all

; Procedure.i UCS2Len(*UCS2String)
; Procedure.i UCS2LenM(*UCS2String, MaxLen.i = -1)
; Procedure.s UCS2Mid(*UCS2String, StartPos.i, Length.i = -1)


CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
  ; x64 Macro definitions
  Macro UCS2Len_MOV
    !movdqa xmm0, [rdx]
  EndMacro
  Macro UCS2Len_ADD
    !lea rdx, [rdx + 16]
  EndMacro    
  Macro UCS2Len_CMP
    !cmp rdx, r8
  EndMacro  
CompilerElse
  ; x86 Macro definitions  
  Macro UCS2Len_MOV
    !movdqa xmm0, [edx]
  EndMacro
  Macro UCS2Len_ADD
    !lea edx, [edx + 16]
  EndMacro    
  Macro UCS2Len_CMP
    !cmp edx, ebx
  EndMacro  
CompilerEndIf  


Procedure.i UCS2Len(*UCS2String)
  ; init and check word boundary alignment
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    !mov rdx, [p.p_UCS2String]
    !mov ecx, edx
    !and rdx, -16
  CompilerElse
    !mov edx, [p.p_UCS2String]
    !mov ecx, edx
    !and edx, -16
  CompilerEndIf
  UCS2Len_MOV
  !pxor xmm1, xmm1
  !sub ecx, edx
  !test ecx, 1
  !jnz ucs2len1
  
  ; handle strings aligned to word boundary
  !pcmpeqw xmm0, xmm1
  !pmovmskb eax, xmm0
  !shr eax, cl
  !shl eax, cl
  UCS2Len_ADD
  !and eax, eax
  !jnz ucs2len3
  !ucs2len0:
  UCS2Len_MOV
  !pcmpeqw xmm0, xmm1
  !pmovmskb eax, xmm0
  UCS2Len_ADD
  !and eax, eax
  !jz ucs2len0
  !jmp ucs2len3
  
  ; handle strings not aligned to word boundary
  !ucs2len1:
  !movdqa xmm3, xmm0
  !pslldq xmm0, 1
  !inc cl
  !pcmpeqw xmm0, xmm1
  !pmovmskb eax, xmm0
  !shr eax, cl
  !shl eax, cl
  UCS2Len_ADD
  !and eax, eax
  !jnz ucs2len3
  !ucs2len2:
  UCS2Len_MOV
  !movdqa xmm2, xmm0
  !pslldq xmm0, 1
  !psrldq xmm3, 15
  !por xmm0, xmm3
  !pcmpeqw xmm0, xmm1
  !pmovmskb eax, xmm0
  UCS2Len_ADD
  !movdqa xmm3, xmm2
  !and eax, eax
  !jz ucs2len2
  
  ; exit procedure
  !ucs2len3:
  !bsf ecx, eax
  !lea eax, [ecx + edx - 16]
  !sub eax, [p.p_UCS2String]
  !shr eax, 1
  ProcedureReturn
EndProcedure


Procedure.i UCS2LenM(*UCS2String, MaxLen.i = -1)
  ; init and check word boundary alignment
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    !mov rdx, [p.p_UCS2String]
    !mov rax, [p.v_MaxLen]
    !lea r8, [rdx + rax * 2]
    !sar rax, 63
    !or r8, rax
    !mov ecx, edx
    !and rdx, -16
  CompilerElse
    !mov edx, [p.p_UCS2String]
    !mov eax, [p.v_MaxLen]
    !push ebx
    !lea ebx, [edx + eax * 2]
    !sar eax, 31
    !or ebx, eax
    !mov ecx, edx
    !and edx, -16
  CompilerEndIf
  UCS2Len_MOV
  !pxor xmm1, xmm1
  !sub ecx, edx
  !test ecx, 1
  !jnz ucs2lenm1
  
  ; handle strings aligned to word boundary
  !pcmpeqw xmm0, xmm1
  !pmovmskb eax, xmm0
  !shr eax, cl
  !shl eax, cl
  UCS2Len_ADD
  !and eax, eax
  !jnz ucs2lenm3
  !ucs2lenm0:
  UCS2Len_MOV
  !pcmpeqw xmm0, xmm1
  !pmovmskb eax, xmm0
  UCS2Len_CMP
  UCS2Len_ADD
  !ja ucs2lenm3
  !and eax, eax
  !jz ucs2lenm0
  !jmp ucs2lenm3
  
  ; handle strings not aligned to word boundary
  !ucs2lenm1:
  !movdqa xmm3, xmm0
  !pslldq xmm0, 1
  !inc cl
  !pcmpeqw xmm0, xmm1
  !pmovmskb eax, xmm0
  !shr eax, cl
  !shl eax, cl
  UCS2Len_ADD
  !and eax, eax
  !jnz ucs2lenm3
  !ucs2lenm2:
  UCS2Len_MOV
  !movdqa xmm2, xmm0
  !pslldq xmm0, 1
  !psrldq xmm3, 15
  !por xmm0, xmm3
  !pcmpeqw xmm0, xmm1
  !pmovmskb eax, xmm0
  UCS2Len_CMP
  UCS2Len_ADD
  !ja ucs2lenm3
  !movdqa xmm3, xmm2
  !and eax, eax
  !jz ucs2lenm2
  
  ; exit procedure
  !ucs2lenm3:
  !bsf ecx, eax
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    !lea rax, [rcx + rdx - 16]
    !cmp rax, r8
    !cmova rax, r8
  CompilerElse
    !lea eax, [ecx + edx - 16]
    !cmp eax, ebx
    !cmova eax, ebx
    !pop ebx  
  CompilerEndIf  
  !sub eax, [p.p_UCS2String]
  !shr eax, 1
  ProcedureReturn
EndProcedure


Procedure.s UCS2Mid(*UCS2String, StartPos.i, Length.i = -1)
  If StartPos > 0
    ProcedureReturn PeekS(*UCS2String + UCS2LenM(*UCS2String, StartPos - 1) << 1, Length)
  Else
    ProcedureReturn PeekS(*UCS2String + UCS2LenM(*UCS2String, 0) << 1, Length)
  EndIf
EndProcedure

Test code for the UCS2Mid procedure (important to compile with unicode enabled and debugger disabled !)

Code: Select all

S.s = "Unicode mid string, SSE2 vs PB" + #CRLF$ + #CRLF$  

t1 = ElapsedMilliseconds()
For i = 1 To 1000000
  r.s = UCS2Mid(@"Short", 3)
Next
t2 = ElapsedMilliseconds()
For i = 1 To 1000000
  r.s = Mid("Short", 3)
Next
t3 = ElapsedMilliseconds()
S + "Short string : " + Str(t2 - t1) + "(SSE2) vs " + Str(t3 - t2) + "(PB)" + #CRLF$


t1 = ElapsedMilliseconds()
For i = 1 To 1000000
  r.s = UCS2Mid(@"A bit longer string to test with", 19, 7)
Next
t2 = ElapsedMilliseconds()
For i = 1 To 1000000
  r.s = Mid("A bit longer string to test with", 19, 7)
Next
t3 = ElapsedMilliseconds()
S + "Longer string : " + Str(t2 - t1) + "(SSE2) vs " + Str(t3 - t2) + "(PB)" + #CRLF$


SpaceString.s = Space(20000)
t1 = ElapsedMilliseconds()
For i = 1 To 1000000
  r.s = UCS2Mid(@SpaceString, 3500, 20)
Next
t2 = ElapsedMilliseconds()
For i = 1 To 1000000
  r.s = Mid(SpaceString, 3500, 20)
Next
t3 = ElapsedMilliseconds()
S + "Very long string : " + Str(t2 - t1) + "(SSE2) vs " + Str(t3 - t2) + "(PB)" + #CRLF$

S = Str(OSVersion()) + #CRLF$ + CPUName() + #CRLF$ + #CRLF$ + S
SetClipboardText("[quote]" + S + "[/quote]")

MessageRequester("Test results", S)

Posted: **Tue Aug 19, 2014 3:43 pm**

I got the following results:

Unicode mid string, SSE2 vs PB

Short string : 114(SSE2) vs 99(PB)
Longer string : 101(SSE2) vs 96(PB)
Very long string : 562(SSE2) vs 2399(PB)
Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz
80 - Windows 7 64 bit - PureBasic 5.30

Might I suggest adding:

Code: Select all

S + CPUName() +  #CRLF$
S + Str(OSVersion())
SetClipboardText(S)

Posted: **Tue Aug 19, 2014 4:00 pm**

---------------------------
Test results
---------------------------
Unicode mid string, SSE2 vs PB

Windows 7 x64

AMD FX(tm)-6300 Six-Core Processor

Short string : 111(SSE2) vs 99(PB)

Longer string : 109(SSE2) vs 102(PB)

Very long string : 688(SSE2) vs 2269(PB)

---------------------------
OK
---------------------------

---------------------------
Test results
---------------------------
Unicode mid string, SSE2 vs PB

Windows 7 x86

AMD FX(tm)-6300 Six-Core Processor

Short string : 110(SSE2) vs 99(PB)

Longer string : 107(SSE2) vs 107(PB)

Very long string : 688(SSE2) vs 1936(PB)

---------------------------
OK
---------------------------

Posted: **Wed Aug 20, 2014 4:06 am**

Windows 8.1 x64-PB 5.30 x86
Intel(R) Core(TM) i3-3227U CPU @ 1.90GHz

Unicode mid string, SSE2 vs PB

Short string : 146(SSE2) vs 135(PB)
Longer string : 152(SSE2) vs 158(PB)
Very long string : 690(SSE2) vs 4085(PB)

Posted: **Wed Aug 20, 2014 4:12 am**

Linux x64-PB 5.30
AMD FX(tm)-6100 Six-Core

Unicode mid string, SSE2 vs PB

Short string : 93(SSE2) vs 64(PB)
Longer string : 83(SSE2) vs 88(PB)
Very long string : 680(SSE2) vs 2273(PB)

Posted: **Wed Aug 20, 2014 11:05 am**

Windows 7 SP1, PB 5.30 x64
Intel(R) Core(TM) i5-3450 CPU @ 3.10GHz

Unicode mid string, SSE2 vs PB

Short string : 101(SSE2) vs 86(PB)
Longer string : 82(SSE2) vs 82(PB)
Very long string : 391(SSE2) vs 2229(PB)

I think I'd prefer this version that takes a hit on small strings but I get a performance equivlaence or gain on medium to long strings.

It's when I've got to start meddling with big strings that I want the high performance

Posted: **Wed Aug 20, 2014 2:39 pm**

Foz wrote:I think I'd prefer this version that takes a hit on small strings but I get a performance equivlaence or gain on medium to long strings.

It's when I've got to start meddling with big strings that I want the high performance

I agree

I'll try if I can improve the timing for short strings a bit more but I guess there's not much room left for improvement.

Posted: **Thu Aug 21, 2014 1:55 am**

Thanks wilbert, this is real good

Posted: **Thu Aug 21, 2014 3:30 am**

Do you have any idea Wilbert how this code would compare with your SSE42 code you had posted previously in the Assembler section?

Posted: **Thu Aug 21, 2014 7:20 am**

RichAlgeni wrote:Do you have any idea Wilbert how this code would compare with your SSE42 code you had posted previously in the Assembler section?

I'm sure someone else created the SSE4.2 code you are mentioning since my computer doesn't support it.

SSE4.2 was designed for string handling so it probably would be faster.
The main problem with it would be that there are still a lot of computers that don't support it while there are very few that don't support SSE2 nowadays.
What I also like about SSE2 is that it is part of the x64 standard. All x64 processors support it. With SSE3 and later this isn't always the case.

Posted: **Thu Aug 21, 2014 6:17 pm**

Understood.

Thanks for your time and efforts Wilbert!

Rich

PureBasic Forums - English

Unicode string Len and Mid procedures using SSE2

Re: Unicode (UCS2) string length using SSE2

Re: Unicode (UCS2) string length using SSE2

Re: Unicode (UCS2) string length using SSE2

Re: Unicode (UCS2) string length using SSE2

Re: Unicode (UCS2) string length using SSE2

Re: Unicode (UCS2) string length using SSE2

Re: Unicode (UCS2) string length using SSE2

Re: Unicode string Len and Mid procedures using SSE2

Re: Unicode string Len and Mid procedures using SSE2

Re: Unicode string Len and Mid procedures using SSE2

Re: Unicode string Len and Mid procedures using SSE2

Re: Unicode string Len and Mid procedures using SSE2

Re: Unicode string Len and Mid procedures using SSE2

Re: Unicode string Len and Mid procedures using SSE2

Re: Unicode string Len and Mid procedures using SSE2