Here is a general Template Function which includes all the complicated stuff what has do be done!
The example is for CountChar(). Exchange code in the individual section for your own functions!
Many thanks for the background infos!
With this knowledge I updated the codes.
A lot of lessons learned:
- in x64 XMM-Register use ist faster (AMD & Intel)
- in x32 MMX-Register use is faster (AMD & Intel)
- There are a lot of speed differences between different CPU's especally between AMD and Intel in Classic ASM Code
(on AMD a Character move with movzx edx and compare full Register is very fast. On Intel slow. Intel prefer mov cx and compare cx)
- at x64 a 4-Char-Version ist best choice on x32 a 2-Char-Version. I implemented an 8-Char-Version for x64 but needs nearly same time
as the 4-Char-Version.
Code: Select all
; V1.04 2024/07/09 ; added Macros for Register Backup. Changed to Function Macros
; V1.03 2024/02/04 ; added a 2nd version with XMM Register instead of MMX (but 20% slower on Ryzen)
; V1.02 2024/02/04 ; Removed the double Compare Chars=0 and add EMMS instruction to end MMX
; V1.01 2024/02/03 ; fixed Bug in Register Backup
; Caller/callee saved registers
; The x64 ABI considers the registers RAX, RCX, RDX, R8, R9, R10, R11, And XMM0-XMM5 volatile.
; When present, the upper portions of YMM0-YMM15 And ZMM0-ZMM15 are also volatile. On AVX512VL;
; the ZMM, YMM, And XMM registers 16-31 are also volatile. When AMX support is present,
; the TMM tile registers are volatile. Consider volatile registers destroyed on function calls
; unless otherwise safety-provable by analysis such As whole program optimization.
; The x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, And XMM6-XMM15 nonvolatile.
; They must be saved And restored by a function that uses them.
; MMX and SSE Registers
; MM0..MM7 : MMX : Pentium P55C (Q5 1995) and AMD K6 (Q2 1997)
; XMM0..XMM15 : SSE : Intel Core2 and AMD K8 Athlon64 (2003)
; YMM0..YMM15 : AVX256 : Intel SandyBridge (Q1 2011) and AMD Bulldozer (Q4 2011)
; [X/Y/Z]MM0..[X/Y/Z]MM31 : AVX512 : Tiger Lake (Q4 2020) and AMD Zen4 (Q4 2022)
EnableExplicit
; @f = Jump forward to next @@; @b = Jump backward to next @@
; ----------------------------------------------------------------------
; Structures to reserve Space on the Stack for ASM_PUSH, ASM_POP
; ----------------------------------------------------------------------
Structure TStack_16Byte
R.q[2]
EndStructure
Structure TStack_32Byte
R.q[4]
EndStructure
Structure TStack_48Byte
R.q[6]
EndStructure
Structure TStack_64Byte
R.q[8]
EndStructure
Structure TStack_96Byte
R.q[12]
EndStructure
Structure TStack_128Byte
R.q[16]
EndStructure
; seperate Macros for EBX,RBX because this is often needed expecally for x32
Macro ASM_PUSH_EBX()
Protected mEBX
!MOV [p.v_mEBX], EBX
EndMacro
Macro ASM_POP_EBX(ptrREG)
!MOV EBX, [p.v_mEBX]
EndMacro
;- ----------------------------------------------------------------------
;- MMX Registers
;- ----------------------------------------------------------------------
; All MMX-Registers are non volatile (shard with FPU-Reisters)
; After the end of use of MMX-Regiters an EMMS Command mus follow to enable
; correct FPU operations again!
Macro ASM_PUSH_MM_0to3(ptrREG)
Protected M03.TStack_32Byte
!LEA ptrREG, [p.v_M03] ; RDX = @M03 = Pionter to RegisterBackupStruct
!MOVQ [ptrREG], MM0
!MOVQ [ptrREG+8], MM1
!MOVQ [ptrREG+16], MM2
!MOVQ [ptrREG+24], MM3
EndMacro
Macro ASM_POP_MM_0to3(ptrREG)
!LEA ptrREG, [p.v_M03] ; RDX = @M03 = Pionter to RegisterBackupStruct
!MOVQ MM0, [ptrREG]
!MOVQ MM1, [ptrREG+8]
!MOVQ MM2, [ptrREG+16]
!MOVQ MM3, [ptrREG+24]
EndMacro
Macro ASM_PUSH_MM_4to5(ptrREG)
Protected M45.TStack_32Byte
!LEA ptrREG, [p.v_M45] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ [ptrREG], MM4
!MOVQ [ptrREG+8], MM5
EndMacro
Macro ASM_POP_MM_4to5(ptrREG)
!LEA ptrREG, [p.v_M45] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ MM4, [ptrREG]
!MOVQ MM5, [ptrREG+8]
EndMacro
Macro ASM_PUSH_MM_4to7(ptrREG)
Protected M47.TStack_32Byte
!LEA ptrREG, [p.v_M47] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ [ptrREG], MM4
!MOVQ [ptrREG+8], MM5
!MOVQ [ptrREG+16], MM6
!MOVQ [ptrREG+24], MM7
EndMacro
Macro ASM_POP_MM_4to7(ptrREG)
!LEA ptrREG, [p.v_M47] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ MM4, [ptrREG]
!MOVQ MM5, [ptrREG+8]
!MOVQ MM6, [ptrREG+16]
!MOVQ MM7, [ptrREG+24]
EndMacro
;- ----------------------------------------------------------------------
;- XMM Registers
;- ----------------------------------------------------------------------
; because of unaligend Memory latency we use 2x64 Bit MOV instead of 1x128 Bit MOV
; MOVDQU [ptrREG], XMM4 -> MOVLPS [ptrREG], XMM4 and MOVHPS [ptrREG+8], XMM4
; x64 Prozessor can do 2 64Bit Memory transfers parallel
; XMM4:XMM5 normally are volatile and we do not have to preserve it
; ATTENTION: XMM4:XMM5 must be preserved only when __vectorcall is used
; as I know PB don't use __vectorcall in ASM Backend. But if we use it
; within a Procedure where __vectorcall isn't used. We don't have to preserve.
; So wee keep the Macro empty. If you want to activate, just activate the code
Macro ASM_PUSH_XMM_4to5(ptrREG)
EndMacro
Macro ASM_POP_XMM_4to5(ptrREG)
EndMacro
; Macro ASM_PUSH_XMM_4to5(ptrREG)
; Protected X45.TStack_32Byte
; !LEA ptrREG, [p.v_X45] ; RDX = @X45 = Pionter to RegisterBackupStruct
; !MOVLPS [ptrREG], XMM4
; !MOVHPS [ptrREG+8], XMM4
; !MOVLPS [ptrREG+16], XMM5
; !MOVHPS [ptrREG+24], XMM5
; EndMacro
; Macro ASM_POP_XMM_4to5(ptrREG)
; !LEA ptrREG, [p.v_X45] ; RDX = @X45 = Pionter to RegisterBackupStruct
; !MOVLPS XMM4, [ptrREG]
; !MOVHPS XMM4, [ptrREG+8]
; !MOVLPS XMM5, [ptrREG+16]
; !MOVHPS XMM5, [ptrREG+24]
; EndMacro
; ======================================================================
Macro ASM_PUSH_XMM_6to7(ptrREG)
Protected X67.TStack_32Byte
!LEA ptrREG, [p.v_X67] ; RDX = @X67 = Pionter to RegisterBackupStruct
!MOVLPS [ptrREG], XMM6
!MOVHPS [ptrREG+8], XMM6
!MOVLPS [ptrREG+16], XMM7
!MOVHPS [ptrREG+24], XMM7
EndMacro
Macro ASM_POP_XMM_6to7(ptrREG)
!LEA ptrREG, [p.v_X67] ; RDX = @X67 = Pionter to RegisterBackupStruct
!MOVLPS XMM6, [ptrREG]
!MOVHPS XMM6, [ptrREG+8]
!MOVLPS XMM6, [ptrREG+16]
!MOVHPS XMM6, [ptrREG+24]
EndMacro
; Fast LOAD/SAVE XMM-Register; MOVDQU command for 128Bit has long latency.
; 2 x64Bit loads are faster! Processed parallel in 1 cycle with low or 0 latency
; this optimation is token from AMD code optimation guide
Macro ASM_LD_XMMM(REGX, ptrREG)
!MOVLPS REGX, [ptrREG]
!MOVHPS REGX, [ptrREG+8]
EndMacro
Macro ASM_SAV_XMMM(REGX, ptrREG)
!MOVLPS [ptrREG], REGX
!MOVHPS [ptrREG+8], REGX
EndMacro
;- --------------------------------------------------
;- CountChar()
;- --------------------------------------------------
; **************************************************
; x64 Assembler Version with XMM-Registers
; **************************************************
; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro ASMx64_CountChar()
; Used Registers:
; RAX : Pointer *String
; RCX : operating Register and Bool: 1 if NullChar was found
; RDX : operating Register
; R8 : Counter
; R9 : operating Register
; XMM0 : the 4 Chars
; XMM1 : cSearch shuffeled to all Words
; XMM2 : 0 to search for EndOfString
; XMM3 : the 4 Chars Backup
; If you use XMM4..XMM7 you have to backup it first
; XMM4 :
; XMM5 :
; XMM6 :
; XMM7 :
; ASM_PUSH_XMM_4to5(RDX) ; optional PUSH() see PbFw_ASM_Macros.pbi
; ----------------------------------------------------------------------
; Check *String-Pointer and MOV it to RAX as operating register
; ----------------------------------------------------------------------
!MOV RAX, [p.p_String] ; load String address
!CMP RAX, 0 ; If *String = 0
!JE .Return ; Exit
!SUB RAX, 8 ; Sub 8 to start with Add 8 in the Loop
; ----------------------------------------------------------------------
; Setup start parameter for registers
; ----------------------------------------------------------------------
; your indiviual setup parameters
!MOV DX, [p.v_cSearch] ; should be DX not RDX because of 1 Word
!MOVQ XMM1, RDX
!PSHUFLW XMM1, XMM1, 0 ; Shuffle/Copy Word0 to all Words
; here are the standard setup parameters
!XOR RCX, RCX ; operating Register and BOOL for EndOfStringFound
!XOR R8, R8 ; Counter = 0
!PXOR XMM2, XMM2 ; XMM2 = 0 ; Mask to search for NullChar = EndOfString
; ----------------------------------------------------------------------
; Main Loop
; ----------------------------------------------------------------------
!.Loop:
!ADD RAX, 8 ; *String + 8 => NextChars
!MOVQ XMM0, [RAX] ; load 4 Chars to XMM0
!MOVQ XMM3, [RAX] ; load 4 Chars to XMM3
!PCMPEQW XMM0, XMM2 ; Compare with 0
!MOVQ RDX, XMM0 ; RDX CompareResult contains FFFF for each NullChar
!TEST RDX, RDX ; If 0 : No NullChar found
!JZ .EndIf ; JumpIfEqual 0 => JumpToEndif if Not EndOfString
; If EndOfStringFound
; Caclulate the Bytepostion of EndOfString [0..3] using Bitscan
!BSF RDX, RDX ; BitSanForward => No of the LSB
!SHR RDX, 3 ; BitNo to ByteNo
!ADD RAX, RDX ; Actual StringPointer + OffsetOf_NullChar
!MOV RCX, RDX ; Save ByteOfsett of NullChar in RCX
!SUB RAX, [p.p_String] ; RAX *EndOfString - *String
!SHR RAX, 1 ; NoOfBytes to NoOfWord => Len(String)
;check for Return of Length and and move it to *outLength
!MOV RDX, [p.p_outLength]
!CMP RDX, 0
!JE @f ; If *outLength
!MOV [RDX], RAX ; *outLength = Len()
!@@: ; Endif
; If a Nullchar was found Then create a Bitmask for setting all Chars after the NullChar to 00h
; In RCX ist the Backup of the ByteOffset of NullChahr
!CMP RCX, 6 ; If NullChar is the last Char : Byte[7,6]=Word[3])
!JGE @f ; => we don't have to eliminate chars from testing
; If WordPos(EndOfString) <> 3 ; Word3 if EndOfString is in Bit 48..63 = Word 3
!SHL RCX, 3 ; ByteNo to BitNo
!NEG RCX ; RCX = -LSB
!ADD RCX, 63 ; RCX = (63-LSB)
!XOR RDX, RDX ; RDX = 0
!BTS RDX, 63 ; set Bit 63 => RDX = 8000000000000000h
!SAR RDX, CL ; Do an arithmetic Shift Right (63-LSB) : EndOfString=Word2 => Mask $FFFF.FFFF.0000.0000, Word1 $FFFF.FFFF.FFFF.0000
!NOT RDX ; Now invert our Mask so we get a Mask to fileter out all Chars after EndOfString $0000.0000.FFFF.FFFF or $0000.0000.0000.FFFF
!MOVQ XMM0, RDX ; Now move this Mask to XMM0, the operating Register
!PAND XMM3, XMM0 ; XMM3 the CharBackup AND Mask => we select only Chars up to EndOfString
!@@:
!MOV RCX, 1 ; BOOL EndOfStringFound = #TRUE
!.EndIf: ; Endif ; EndOfStringFound
; ------------------------------------------------------------
; Start of function individual code! Do not use RCX here!
; ------------------------------------------------------------
; Count number of found Chars
!MOVQ XMM0, XMM3 ; Load the 4 Chars to operating Register
!PCMPEQW XMM0, XMM1 ; Compare the 4 Chars with cSearch
!MOVQ RDX, XMM0 ; CompareResult to RDX
!TEST RDX, RDX
!JZ @f ; Jump to Endif if cSearch not found
!POPCNT RDX, RDX ; Count number of set Bits (16 for each found Char)
!SHR RDX, 4 ; NoOfBits [0..64] to NoOfWords [0..4]
!ADD R8, RDX ; ADD NoOfFoundChars to Counter R8
!@@:
; ------------------------------------------------------------
!TEST RCX, RCX ; Check BOOL EndOfStringFound
!JZ .Loop ; Continue Loop if Not EndOfStringFound
!.EndLoop:
; ----------------------------------------------------------------------
; Handle Return value an POP-Registers
; ----------------------------------------------------------------------
!MOV RAX, R8 ; ReturnValue to RAX
!.Return:
; ASM_POP_XMM_4to5(RDX) ; POP non volatile Registers we PUSH'ed at start
ProcedureReturn ; RAX
EndMacro
; **************************************************
; x64 Assembler 8-Char Version with XMM-Registers
; **************************************************
; ATTENTION! BUG! Not working correct!
; The 8-Chrar Version is only to see speed difference between
; 4 and 8 Char Version.
; It has still a Bug in counting (maybe a problem of the Shuffles)
; As I expected: the 8 Char version doen't make sense.
; It's not faster than the 4 Char version but more complicated
; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro ASMx64_8C_CountChar()
; 8 Char-version
; Used Registers:
; RAX : Pointer *String
; RCX : operating Register and Bool: 1 if NullChar was found
; RDX : operating Register
; R8 : Counter
; R9 : operating Register
; R10 :
; R11 :
; XMM0 : the 4 Chars
; XMM1 : operating Register
; XMM2 : 0 to search for EndOfString
; XMM3 : the 4 Chars Backup
; If you use XMM4..XMM7 you have to backup it first
; XMM4 : cSearch shuffeled to all Words
; XMM5 : operating Register
; XMM6 :
; XMM7 :
ASM_PUSH_XMM_4to5(RDX) ; optional PUSH() see PbFw_ASM_Macros.pbi
; ----------------------------------------------------------------------
; Check *String-Pointer and MOV it to RAX as operating register
; ----------------------------------------------------------------------
!MOV RAX, [p.p_String] ; load String address
!CMP RAX, 0 ; If *String = 0
!JE .Return ; Exit
!SUB RAX, 16 ; Sub 8 to start with Add 8 in the Loop
; ----------------------------------------------------------------------
; Setup start parameter for registers
; ----------------------------------------------------------------------
; your indiviual setup parameters
!MOV DX, [p.v_cSearch] ; should be DX not RDX because of 1 Word
!MOVQ XMM4, RDX
!PSHUFLW XMM4, XMM4, 0 ; Shuffle/Copy Word0 to all Words
!PSHUFD XMM4, XMM4, 01000100b ; Copy Lo-Qword th Hi-Qword
; here are the standard setup parameters
!XOR RCX, RCX ; operating Register and BOOL for EndOfStringFound
!XOR R8, R8 ; Counter = 0
!PXOR XMM2, XMM2 ; XMM2 = 0 ; Mask to search for NullChar = EndOfString
; ----------------------------------------------------------------------
; Main Loop
; ----------------------------------------------------------------------
!.Loop:
!ADD RAX, 16 ; *String + 16 => NextChars
;!MOVDQU XMM0, [RAX]
ASM_LD_XMMM(XMM0, RAX) ; optimized load 8 Chars to XMM0
!MOVDQA XMM3, XMM0 ; copy 8 Chars to XMM3
!PCMPEQW XMM0, XMM2 ; Compare with 0
!PSHUFD XMM5, XMM0, 01001110b ; Switch Hi/Lo QWord of XMM0 to XMM5
!MOVQ RDX, XMM0 ; RDX CompareResult contains FFFF for each NullChar
!MOVQ RCX, XMM5
!XOR R9, R9 ; Clear R9
!ADD R9, RCX ; add ByteOffset Hi-Qword
!ADD R9, RDX ; add ByteOffset Lo-Qword
!TEST R9, R9 ; Check RDX + RCX = 0
!JZ .EndIf ; JumpIfZer0 0 => JumpToEndif if Not EndOfString
; If EndOfStringFound
; Caclulate the Bytepostion of EndOfString [0..3] using Bitscan
; start with the lo-Qword
!BSF RDX, RDX ; BitSanForward => No of the LSB
!SHR RDX, 3 ; BitNo to ByteNo
!ADD RAX, RDX ; Actual StringPointer + OffsetOf_NullChar
!MOV RCX, RDX ; Offset to RCX
!TEST RDX, RDX
!JNZ @f ; Test NullChar in Hi-QWORD
!MOVQ RDX, XMM5
!BSF RDX, RDX ; BitSanForward => No of the LSB
!SHR RDX, 3 ; BitNo to ByteNo
!ADD RDX, 8 ; 4 Word, Char Offset for No. in Hi-Part
!ADD RAX, RDX ; Actual StringPointer + OffsetOf_NullChar
!MOV RCX, RDX ; Offset to RCX
!@@:
!SUB RAX, [p.p_String] ; RAX *EndOfString - *String
!SHR RAX, 1 ; NoOfBytes to NoOfWord => Len(String)
;check for Return of Length and and move it to *outLength
!MOV RDX, [p.p_outLength]
!TEST RDX, RDX
!JZ @f ; If *outLength
!MOV [RDX], RAX ; *outLength = Len()
!@@: ; Endif
!JZ .Hi_Endif
!CMP RCX, 8 ; ByteNo >= 8 : NullChar in Hi-Qword
!JGE .Hi
; NullChar in Lo-Qord ; If ByteNo < 8
!SHL RCX, 3
!NEG RCX
!ADD RCX, 63
!XOR RDX, RDX
!BTS RDX, 63
!SAR RDX, CL ; Do an arithmetic Shift Right (63-LSB) : EndOfString=Word2 => Mask $FFFF.FFFF.0000.0000, Word1 $FFFF.FFFF.FFFF.0000
!NOT RDX ; Now invert our Mask so we get a Mask to fileter out all Chars after EndOfString $0000.0000.FFFF.FFFF or $0000.0000.0000.FFFF
!PXOR MM0, MM0 ; MM0 = 0
!XOR R9, R9 ; R9 = 0
!NOT R9 ; R9 = FFFF.FFFF. FFFF.FFFF
!MOVQ XMM0, R9
!PSLLQ XMM0, 1 ; Lo to Hi
!MOVQ XMM0, RDX ; Now move this Mask to XMM0, the operating Register
!JMP .Hi_Endif
!.Hi: ; NullChar in Hi-Qword
!SUB RCX, 8
!SHL RCX, 3
!NEG RCX
!ADD RCX, 63
!XOR RDX, RDX
!BTS RDX, 63
!SAR RDX, CL ; Do an arithmetic Shift Right (63-LSB) : EndOfString=Word2 => Mask $FFFF.FFFF.0000.0000, Word1 $FFFF.FFFF.FFFF.0000
!NOT RDX ; Now invert our Mask so we get a Mask to fileter out all Chars after EndOfString $0000.0000.FFFF.FFFF or $0000.0000.0000.FFFF
!MOVQ XMM0, RDX ; Now move this Mask to XMM0, the operating Register
!PSLLQ XMM0, 1
!XOR R9, R9
!NOT R9
!MOVQ XMM0, R9
!.Hi_Endif:
!PAND XMM3, XMM0 ; XMM3 the CharBackup AND Mask => we select only Chars up to EndOfString
!MOV RCX, 1 ; BOOL EndOfStringFound = #TRUE
!.EndIf: ; Endif ; EndOfStringFound
; ------------------------------------------------------------
; Start of function individual code! Do not use RCX here!
; ------------------------------------------------------------
; Count number of found Chars
!MOVQ XMM0, XMM3 ; Load the 4 Chars to operating Register
!PCMPEQW XMM0, XMM4 ; Compare the 4 Chars with cSearch
!MOVQ RDX, XMM0 ; CompareResult to RDX
!TEST RDX, RDX
!JZ @f ; Jump to Endif if cSearch not found
!POPCNT RDX, RDX ; Count number of set Bits (16 for each found Char)
!SHR RDX, 4 ; NoOfBits [0..64] to NoOfWords [0..4]
!ADD R8, RDX ; ADD NoOfFoundChars to Counter R8
!@@:
!PSRLQ XMM0, 1 ; Packed Shift Right Logical QWords
!MOVQ RDX, XMM0 ; CompareResult to RDX
!TEST RDX, RDX
!JZ @f ; Jump to Endif if cSearch not found
!POPCNT RDX, RDX ; Count number of set Bits (16 for each found Char)
!SHR RDX, 4 ; NoOfBits [0..64] to NoOfWords [0..4]
!ADD R8, RDX ; ADD NoOfFoundChars to Counter R8
!@@:
; ------------------------------------------------------------
!TEST RCX, RCX ; Check BOOL EndOfStringFound
!JZ .Loop ; Continue Loop if Not EndOfStringFound
!.EndLoop:
; ----------------------------------------------------------------------
; Handle Return value an POP-Registers
; ----------------------------------------------------------------------
!MOV RAX, R8 ; ReturnValue to RAX
!.Return:
ASM_POP_XMM_4to5(RDX) ; POP non volatile Registers we PUSH'ed at start
ProcedureReturn ; RAX
EndMacro
; **************************************************
; x32 Assembler Version with MMX-Registers
; **************************************************
; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro ASMx32_MMX_CountChar()
Protected N
Protected eos ; Bool EndOfString
; Used Registers:
; EAX : Pointer *String
; ECX : operating Register and Bool: 1 if NullChar was found
; EDX : operating Register
; MM0 : the 4 Chars
; MM1 : cSearch shuffeled to all Words
; MM2 : 0 to search for EndOfString
; MM3 : the 4 Chars Backup
; MM4 :
; MM5 :
; MM6 :
; MM7 :
ASM_PUSH_MM_0to3(EDX) ; PUSH nonvolatile MMX-Registers
; ASM_PUSH_MM_4to5(EDX) ; PUSH nonvolatile MMX-Registers
; ----------------------------------------------------------------------
; Check *String-Pointer and MOV it to EAX as operating register
; ----------------------------------------------------------------------
!MOV EAX, [p.p_String] ; load String address
!CMP EAX, 0 ; If *String = 0
!JE .Return ; Exit
!SUB EAX, 4 ; Sub 4 to start with Add 8 in the Loop
; ----------------------------------------------------------------------
; Setup start parameter for registers
; ----------------------------------------------------------------------
; your indiviual setup parameters
!MOV DX, [p.v_cSearch] ; should be DX not EDX because of 1 Word
!MOVD MM1, EDX
!PSHUFW MM1, MM1, 0 ; Shuffle/Copy Word0 to all Words
; here are the standard setup parameters
!XOR ECX, ECX ; operating Register and BOOL for EndOfStringFound
!PXOR MM2, MM2 ; MM2 = 0 ; Mask to search for NullChar = EndOfString
; ----------------------------------------------------------------------
; Main Loop
; ----------------------------------------------------------------------
!.Loop:
!ADD EAX, 4 ; *String + 8 => NextChars
!MOVD MM0, [EAX] ; load 4 Chars to MM0
!MOVD MM3, [EAX] ; load 4 Chars to MM3
!PCMPEQW MM0, MM2 ; Compare with 0
!MOVD EDX, MM0 ; EDX CompareResult contains FFFF for each NullChar
!CMP EDX, 0 ; If 0 : No NullChar found
!JE .EndIf ; JumpIfEqual 0 => JumpToEndif if Not EndOfString
; If EndOfStringFound
; Caclulate the Bytepostion of EndOfString [0..3] using Bitscan
!BSF EDX, EDX ; BitSanForward => No of the LSB
!SHR EDX, 3 ; BitNo to ByteNo
!ADD EAX, EDX ; Actual StringPointer + OffsetOf_NullChar
!SUB EAX, [p.p_String] ; EAX *EndOfString - *String
!SHR EAX, 1 ; NoOfBytes to NoOfWord => Len(String)
;check for Return of Length and and move it to *outLength
!MOV EDX, [p.p_outLength]
!CMP EDX, 0
!JE @f ; If *outLength
!MOV [EDX], EAX ; *outLength = Len()
!@@: ; Endif
; If a Nullchar was found Then create a Bitmask for setting all Chars after the NullChar to 00h
!MOVD ECX, MM0 ; Load compare Result of 4xChars=0 to ECX
!BSF ECX, ECX ; Find No of LSB [0..31] (if no Bit found it returns 0 too)
!CMP ECX, 16 ; If LSB >= 16 the EndOfString is the last of the 4 Chars
!JGE @f ; => we don't have to eliminate chars from testing
; If WordPos(EndOfString) <> 3 ; Word3 if EndOfString is in Bit 16..31 = Word 3
!NEG ECX ; ECX = -LSB
!ADD ECX, 31 ; ECX = (31-LSB)
!XOR EDX, EDX ; EDX = 0
!BTS EDX, 31 ; set Bit 31 => EDX = 8000000000000000h
!SAR EDX, CL ; Do an arithmetic Shift Right (31-LSB) : EndOfString=Word2 => Mask $FFFF.FFFF.0000.0000, Word1 $FFFF.FFFF.FFFF.0000
!NOT EDX ; Now invert our Mask so we get a Mask to fileter out all Chars after EndOfString $0000.0000.FFFF.FFFF or $0000.0000.0000.FFFF
!MOVD MM0, EDX ; Now move this Mask to MM0, the operating Register
!PAND MM3, MM0 ; MM3 the CharBackup AND Mask => we select only Chars up to EndOfString
!@@:
;!MOV ECX, 1 ; BOOL EndOfStringFound = #TRUE
!MOV [p.v_eos], DWORD 1 ; at x32 we need ECX, so Bool EndOfstring in Var
!.EndIf: ; Endif ; EndOfStringFound
; ------------------------------------------------------------
; Start of function individual code! You can use ECX here!
; ------------------------------------------------------------
; Count number of found Chars
!MOVQ MM0, MM3 ; Load the 4 Chars to operating Register
!PCMPEQW MM0, MM1 ; Compare the 4 Chars with cSearch
!MOVD EDX, MM0 ; CompareResult to EDX
!CMP EDX, 0
!JE @f ; Jump to Endif if cSearch not found
!POPCNT EDX, EDX ; Count number of set Bits (16 for each found Char)
!SHR EDX, 4 ; NoOfBits [0..64] to NoOfWords [0..4]
!MOV ECX, DWORD [p.v_N]
!ADD ECX, EDX ; ADD NoOfFoundChars to Counter
!MOV [p.v_N], ECX
!@@:
; ------------------------------------------------------------
!MOV ECX, DWORD [p.v_eos]
!Test ECX, ECX ; Check BOOL EndOfStringFound
!JZ .Loop ; Continue Loop if Not EndOfStringFound
!.EndLoop:
; Move yur ReturnValue to EAX, here it's the counter
!MOV EAX, [p.v_N] ; ReturnValue to EAX
!.Return:
ASM_POP_MM_0to3(EDX) ; POP non volatile Registers we PUSH'ed at start
; ASM_POP_MM_4to5(EDX) ; POP non volatile Registers we PUSH'ed at start
!EMMS ; Empty MMX Technology State, enables FPU Register use
ProcedureReturn ; EAX
EndMacro
; **************************************************
; x32 Assembler Version Classic
; **************************************************
; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro ASMx32_CountChar()
; At x32 optimized classic ASM ist a good choice! On modern CPU like Ryzen
; it is nearly same speed as MMX-Code. XMM-Code is on all CPU's much slower
; On older CPU's back to 2010 AMD and Intel MMX is faster.
Protected memEBX
; @f = Jump forward to next @@; @b = Jump backward to next @@
; used Registers
; EAX : *String
; EBX : Counter
; ECX : operating Register
; EDX : cSearch
!MOV [p.v_memEBX], EBX ; PUSH EBX
!MOV EAX, [p.p_String] ; load String Adress
!TEST EAX, EAX ; If *String = 0
!JZ .Return ; Then End
!SUB EAX, 2 ; *String
; ----------------------------------------------------------------------
; Setup start parameter for registers
; ----------------------------------------------------------------------
!XOR EBX, EBX
!XOR ECX, ECX
!XOR EDX, EDX
!MOV DX, WORD[p.v_cSearch] ; cSearch\c
!.Loop:
!ADD EAX, 2 ; *String
!MOV CX, WORD [EAX] ; load Char to EDX
!TEST CX, CX ; Test EndOfString
!JZ .EndLoop
; ------------------------------------------------------------
; Start of function individual code!
; ------------------------------------------------------------
; Count number of found Chars
!CMP CX, DX
!JNE @f ; If cSearch\c found
!INC EBX
!@@:
; ------------------------------------------------------------
!JMP .Loop
!.EndLoop:
; optional Return Lenth
!MOV ECX, [p.p_outLength]
!TEST ECX, ECX
!JZ @f
!SUB EAX, [p.p_String]
!SHR EAX,1
!MOV [ECX], EAX ; *outLength = Len
!@@:
!.Return:
!MOV EAX, EBX ; Return Counter in EAX
!MOV EBX, [p.v_memEBX] ; POP EBX
ProcedureReturn ; counter
EndMacro
; **************************************************
; Purebasic Version with *Pointer-Code
; **************************************************
; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro PB_CountChar_ptr()
Protected *pRead.Character = *String
Protected N
If Not *String
ProcedureReturn 0
EndIf
While *pRead\c ; Step trough the String
If *pRead\c = cSearch.c
N + 1
EndIf
*pRead + SizeOf(Character)
Wend
If *outLength ; If Return Length
*outLength\i = (*pRead - *String)/2
EndIf
ProcedureReturn N
EndMacro
; **************************************************
; Purebasic Version using PB integrated functions
; **************************************************
; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro PB_CountChar_PB()
; especally on Intel CPU's PB's CountString() and Len()
; performs better than a individual PB PointerCode
Protected N
Protected sStr.String ; String Struct
Protected *ptr.Integer = @sStr ; Pointer to String Struct
If *String
*ptr\i = *String ; Hook *String into String Struct @sStr = *String
N = CountString(sStr\s, Chr(cSearch))
If *outLength ; If Return Length
*outLength\i = Len(sStr\s)
EndIf
*ptr\i = 0 ; Unhook String otherwise PB delete the String
EndIf
ProcedureReturn N
EndMacro
Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
; ============================================================================
; NAME: CountChar
; DESC: Counts Characters in a String
; DESC: This example is for CountChar
; DESC:
; VAR(*String) : Pointer to the String
; VAR(*outLength.Integer): Optional a Pointer to an Int to receive the Length
; RET.i : Number of Characters found
; ============================================================================
CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm And #PB_Compiler_Unicode
CompilerIf #PB_Compiler_64Bit
; ************************************************************
; x64 Assembler
; ************************************************************
ASMx64_CountChar() ; XMM-Version
CompilerElse ; #PB_Compiler_32Bit
; ************************************************************
; x32 Assembler
; ************************************************************
;ASMx32_CountChar() ; classic x86 Assembler
ASMx32_MMX_CountChar() ; x32 with MMX
CompilerEndIf
CompilerElseIf #PB_Compiler_Backend = #PB_Backend_C And #PB_Compiler_Unicode
CompilerIf #PB_Compiler_64Bit
; ************************************************************
; x64 C
; ************************************************************
PB_CountChar_PB()
CompilerElse #PB_Compiler_32Bit
; ************************************************************
; x32 C
; ************************************************************
PB_CountChar_PB()
CompilerEndIf
CompilerElse ; Ascii
; ************************************************************
; Ascii Strings < PB 5.5
; ************************************************************
PB_CountChar_PB()
CompilerEndIf
EndProcedure
;CountChar = @CountChar()
Procedure CountChar8(*String, cSearch.c, *outLength.Integer=0)
; ============================================================================
; NAME: CountChar
; DESC: Counts Characters in a String
; DESC: This example is for CountChar
; DESC:
; VAR(*String) : Pointer to the String
; VAR(*outLength.Integer): Optional a Pointer to an Int to receive the Length
; RET.i : Number of Characters found
; ============================================================================
CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm And #PB_Compiler_Unicode
CompilerIf #PB_Compiler_64Bit
; ************************************************************
; x64 Assembler
; ************************************************************
ASMx64_8C_CountChar() ; XMM-Version 8-Char simultan
CompilerElse ; #PB_Compiler_32Bit
; ************************************************************
; x32 Assembler
; ************************************************************
;ASMx32_CountChar() ; classic x86 Assembler
ASMx32_CountChar() ; x32 with MMX
CompilerEndIf
CompilerElseIf #PB_Compiler_Backend = #PB_Backend_C And #PB_Compiler_Unicode
CompilerIf #PB_Compiler_64Bit
; ************************************************************
; x64 C
; ************************************************************
PB_CountChar_PB()
CompilerElse #PB_Compiler_32Bit
; ************************************************************
; x32 C
; ************************************************************
PB_CountChar_PB()
CompilerEndIf
CompilerElse ; Ascii
; ************************************************************
; Ascii Strings < PB 5.5
; ************************************************************
PB_CountChar_PB()
CompilerEndIf
EndProcedure
Procedure CountChar_PB(*String, cSearch.c, *outLength.Integer=0)
PB_CountChar_ptr() ; PureBasic Code with *pRead
EndProcedure
Procedure CountChar_BuildIn_Wrapped(*String, cSearch.c, *outLength.Integer=0)
PB_CountChar_PB() ; PureBasic build in functions wrappen into Proc
EndProcedure
EnableExplicit
Define Test$, PB$
Define L, N
CompilerIf #PB_Compiler_32Bit
PB$= "Running in x32 Version!"
CompilerElse
PB$ = "Running in x64 Version!"
CompilerEndIf
Debug PB$
Debug ""
Test$ = "I am a String for testing FastString MMX and SSE functions!"
Test$ + Test$
; Test$ = Test$ + Test$ + Test$ + Test$ + Test$ + Test$
Debug Test$
Debug ""
Debug "PB's Len() = " + Len(Test$)
Debug "PB's CountString('a') = " + CountString(Test$, Chr('a'))
Debug ""
N = CountChar_PB(@Test$, 'a', @L)
Debug "CountChar_PB('a') = " + N
Debug "CountChar_PB Len = " + L
Debug ""
Debug "x64: XMM 4-Char Version / x32 MMX-Version"
N = CountChar(@Test$, 'a', @L)
Debug "Count('a') = " + N
Debug "XMM Len = " + L
Debug ""
Debug "XMM 8-Char Version / x32 ASMx32 Classic"
N = CountChar8(@Test$, 'a', @L)
Debug "Count('a') = " + N
Debug "Len = " + L
CompilerIf Not #PB_Compiler_Debugger
Define I, t1, t2, t3, t4, t5
Define msg$
#Loops = 2000000 ; 1Mio
; PB's buid in functions
t1 = ElapsedMilliseconds()
For I = 1 To #Loops
L = Len(Test$)
N = CountString(Test$, Chr('a'))
Next
t1 = ElapsedMilliseconds() - t1
; PB's buid in functions
t2 = ElapsedMilliseconds()
For I = 1 To #Loops
N = CountChar_BuildIn_Wrapped(@Test$, 'a', @L)
Next
t2 = ElapsedMilliseconds() - t2
t3 = ElapsedMilliseconds()
For I = 1 To #Loops
N = CountChar_PB(@Test$, 'a', @L)
Next
t3 = ElapsedMilliseconds() - t3
; XMM 4-Char Version
t4 = ElapsedMilliseconds()
For I = 1 To #Loops
N=CountChar(@Test$, 'a', @L)
Next
t4 = ElapsedMilliseconds() - t4
; XMM 8-Char Version
t5 = ElapsedMilliseconds()
For I = 1 To #Loops
N=CountChar8(@Test$, 'a', @L)
Next
t5 = ElapsedMilliseconds() - t5
msg$ = PB$ + #CRLF$ + #CRLF$ +"PB build in = " + Str(t1) + #CRLF$ + "PB build in wrapped = " +Str(t2) + #CRLF$
msg$ + "PB *Ptr = " + Str(t3)+ #CRLF$ + "x64: XMM 4-Char / x32: MMX = " + Str(t4) + #CRLF$ + "x64: XMM 8-Char / x32 ASMx32 Classic = "+ Str(t5)
SetClipboardText(msg$)
MessageRequester("Timing[ms] CountChar for " + Str(#Loops) + " Loops - Result copied to clipboard!", msg$)
CompilerEndIf
PB build in = 176
PB build in wrapped = 188
PB *Ptr = 170
x64: XMM 4-Char / x32: MMX = 52
x64: XMM 8-Char / x32 ASMx32 Classic = 45
Running in x32 Version!
PB build in = 234
PB build in wrapped = 243
PB *Ptr = 152
x64: XMM 4-Char / x32: MMX = 115
x64: XMM 8-Char / x32 ASMx32 Classic = 161