Fast String Functions using MMX

Share your advanced PureBasic knowledge/code with the community.
SMaag
Enthusiast
Enthusiast
Posts: 325
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Fast String Functions using MMX

Post by SMaag »

Here I start a series of FastString functions using MMX Assembler-Code, which is optimated for x64!

Here is a general Template Function which includes all the complicated stuff what has do be done!
The example is for CountChar(). Exchange code in the individual section for your own functions!

Many thanks for the background infos!
With this knowledge I updated the codes.

A lot of lessons learned:
- in x64 XMM-Register use ist faster (AMD & Intel)
- in x32 MMX-Register use is faster (AMD & Intel)
- There are a lot of speed differences between different CPU's especally between AMD and Intel in Classic ASM Code
(on AMD a Character move with movzx edx and compare full Register is very fast. On Intel slow. Intel prefer mov cx and compare cx)

- at x64 a 4-Char-Version ist best choice on x32 a 2-Char-Version. I implemented an 8-Char-Version for x64 but needs nearly same time
as the 4-Char-Version.

Code: Select all

; V1.04   2024/07/09  ; added Macros for Register Backup. Changed to Function Macros
; V1.03   2024/02/04  ; added a 2nd version with XMM Register instead of MMX (but 20% slower on Ryzen)
; V1.02   2024/02/04  ; Removed the double Compare Chars=0 and add EMMS instruction to end MMX
; V1.01   2024/02/03  ; fixed Bug in Register Backup

; Caller/callee saved registers
; The x64 ABI considers the registers RAX, RCX, RDX, R8, R9, R10, R11, And XMM0-XMM5 volatile.
; When present, the upper portions of YMM0-YMM15 And ZMM0-ZMM15 are also volatile. On AVX512VL;
; the ZMM, YMM, And XMM registers 16-31 are also volatile. When AMX support is present, 
; the TMM tile registers are volatile. Consider volatile registers destroyed on function calls
; unless otherwise safety-provable by analysis such As whole program optimization.
; The x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, And XMM6-XMM15 nonvolatile.
; They must be saved And restored by a function that uses them. 

; MMX and SSE Registers
; MM0..MM7    :  MMX : Pentium P55C (Q5 1995) and AMD K6 (Q2 1997)
; XMM0..XMM15 :  SSE : Intel Core2 and AMD K8 Athlon64 (2003)
; YMM0..YMM15 :  AVX256 : Intel SandyBridge (Q1 2011) and AMD Bulldozer (Q4 2011)
; [X/Y/Z]MM0..[X/Y/Z]MM31 : AVX512 : Tiger Lake (Q4 2020) and AMD Zen4 (Q4 2022)

EnableExplicit
; @f = Jump forward to next @@;  @b = Jump backward to next @@  

; ----------------------------------------------------------------------
;  Structures to reserve Space on the Stack for ASM_PUSH, ASM_POP
; ----------------------------------------------------------------------

Structure TStack_16Byte
  R.q[2]  
EndStructure

Structure TStack_32Byte
  R.q[4]  
EndStructure

Structure TStack_48Byte
  R.q[6]  
EndStructure

Structure TStack_64Byte
  R.q[8]  
EndStructure

Structure TStack_96Byte
  R.q[12]  
EndStructure

Structure TStack_128Byte
  R.q[16]  
EndStructure

; seperate Macros for EBX,RBX because this is often needed expecally for x32
Macro ASM_PUSH_EBX()
  Protected mEBX
  !MOV [p.v_mEBX], EBX
EndMacro

Macro ASM_POP_EBX(ptrREG)
  !MOV EBX, [p.v_mEBX]
EndMacro

;- ----------------------------------------------------------------------
;- MMX Registers
;- ----------------------------------------------------------------------

; All MMX-Registers are non volatile (shard with FPU-Reisters)
; After the end of use of MMX-Regiters an EMMS Command mus follow to enable
; correct FPU operations again!

Macro ASM_PUSH_MM_0to3(ptrREG)
  Protected M03.TStack_32Byte
  !LEA ptrREG, [p.v_M03]          ; RDX = @M03 = Pionter to RegisterBackupStruct 
  !MOVQ [ptrREG], MM0
  !MOVQ [ptrREG+8], MM1
  !MOVQ [ptrREG+16], MM2
  !MOVQ [ptrREG+24], MM3
EndMacro

Macro ASM_POP_MM_0to3(ptrREG)
  !LEA ptrREG, [p.v_M03]          ; RDX = @M03 = Pionter to RegisterBackupStruct  
  !MOVQ MM0, [ptrREG]
  !MOVQ MM1, [ptrREG+8]
  !MOVQ MM2, [ptrREG+16]
  !MOVQ MM3, [ptrREG+24]
EndMacro

Macro ASM_PUSH_MM_4to5(ptrREG)
  Protected M45.TStack_32Byte
  !LEA ptrREG, [p.v_M45]          ; RDX = @M47 = Pionter to RegisterBackupStruct 
  !MOVQ [ptrREG], MM4
  !MOVQ [ptrREG+8], MM5
EndMacro

Macro ASM_POP_MM_4to5(ptrREG)
  !LEA ptrREG, [p.v_M45]          ; RDX = @M47 = Pionter to RegisterBackupStruct  
  !MOVQ MM4, [ptrREG]
  !MOVQ MM5, [ptrREG+8]
EndMacro

Macro ASM_PUSH_MM_4to7(ptrREG)
  Protected M47.TStack_32Byte
  !LEA ptrREG, [p.v_M47]          ; RDX = @M47 = Pionter to RegisterBackupStruct 
  !MOVQ [ptrREG], MM4
  !MOVQ [ptrREG+8], MM5
  !MOVQ [ptrREG+16], MM6
  !MOVQ [ptrREG+24], MM7
EndMacro

Macro ASM_POP_MM_4to7(ptrREG)
  !LEA ptrREG, [p.v_M47]          ; RDX = @M47 = Pionter to RegisterBackupStruct  
  !MOVQ MM4, [ptrREG]
  !MOVQ MM5, [ptrREG+8]
  !MOVQ MM6, [ptrREG+16]
  !MOVQ MM7, [ptrREG+24]
EndMacro

;- ----------------------------------------------------------------------
;- XMM Registers
;- ----------------------------------------------------------------------

; because of unaligend Memory latency we use 2x64 Bit MOV instead of 1x128 Bit MOV
; MOVDQU [ptrREG], XMM4 -> MOVLPS [ptrREG], XMM4  and  MOVHPS [ptrREG+8], XMM4
; x64 Prozessor can do 2 64Bit Memory transfers parallel

;  XMM4:XMM5 normally are volatile and we do not have to preserve it

; ATTENTION: XMM4:XMM5 must be preserved only when __vectorcall is used
; as I know PB don't use __vectorcall in ASM Backend. But if we use it 
; within a Procedure where __vectorcall isn't used. We don't have to preserve.
; So wee keep the Macro empty. If you want to activate, just activate the code


Macro ASM_PUSH_XMM_4to5(ptrREG) 
EndMacro

Macro ASM_POP_XMM_4to5(ptrREG)
EndMacro

; Macro ASM_PUSH_XMM_4to5(ptrREG)
;   Protected X45.TStack_32Byte
;   !LEA ptrREG, [p.v_X45]          ; RDX = @X45 = Pionter to RegisterBackupStruct 
;   !MOVLPS [ptrREG], XMM4
;   !MOVHPS [ptrREG+8], XMM4 
;   !MOVLPS [ptrREG+16], XMM5
;   !MOVHPS [ptrREG+24], XMM5
; EndMacro

; Macro ASM_POP_XMM_4to5(ptrREG)
;   !LEA ptrREG, [p.v_X45]          ; RDX = @X45 = Pionter to RegisterBackupStruct
;   !MOVLPS XMM4, [ptrREG]
;   !MOVHPS XMM4, [ptrREG+8]  
;   !MOVLPS XMM5, [ptrREG+16]
;   !MOVHPS XMM5, [ptrREG+24]
; EndMacro
; ======================================================================

Macro ASM_PUSH_XMM_6to7(ptrREG)
  Protected X67.TStack_32Byte
  !LEA ptrREG, [p.v_X67]          ; RDX = @X67 = Pionter to RegisterBackupStruct    
  !MOVLPS [ptrREG], XMM6
  !MOVHPS [ptrREG+8], XMM6 
  !MOVLPS [ptrREG+16], XMM7
  !MOVHPS [ptrREG+24], XMM7
EndMacro

Macro ASM_POP_XMM_6to7(ptrREG)
  !LEA ptrREG, [p.v_X67]          ; RDX = @X67 = Pionter to RegisterBackupStruct  
  !MOVLPS XMM6, [ptrREG]
  !MOVHPS XMM6, [ptrREG+8]
  !MOVLPS XMM6, [ptrREG+16]
  !MOVHPS XMM6, [ptrREG+24]  
EndMacro

; Fast LOAD/SAVE XMM-Register; MOVDQU command for 128Bit has long latency.
; 2 x64Bit loads are faster! Processed parallel in 1 cycle with low or 0 latency
; this optimation is token from AMD code optimation guide
Macro ASM_LD_XMMM(REGX, ptrREG)
  !MOVLPS REGX, [ptrREG]
  !MOVHPS REGX, [ptrREG+8]
EndMacro

Macro ASM_SAV_XMMM(REGX, ptrREG)
  !MOVLPS [ptrREG], REGX
  !MOVHPS [ptrREG+8], REGX 
EndMacro

;- --------------------------------------------------
;- CountChar()
;- --------------------------------------------------
  
; **************************************************
; x64 Assembler Version with XMM-Registers
; ************************************************** 
; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro ASMx64_CountChar()
  
  ; Used Registers:
  ;   RAX : Pointer *String
  ;   RCX : operating Register and Bool: 1 if NullChar was found
  ;   RDX : operating Register
  ;   R8  : Counter
  ;   R9  : operating Register
  
  ;   XMM0 : the 4 Chars
  ;   XMM1 : cSearch shuffeled to all Words
  ;   XMM2 : 0 to search for EndOfString
  ;   XMM3 : the 4 Chars Backup
  
  ; If you use XMM4..XMM7 you have to backup it first
  ;   XMM4 : 
  ;   XMM5 : 
  ;   XMM6 :
  ;   XMM7 :
  
  ; ASM_PUSH_XMM_4to5(RDX)     ; optional PUSH() see PbFw_ASM_Macros.pbi
  
  ; ----------------------------------------------------------------------
  ; Check *String-Pointer and MOV it to RAX as operating register
  ; ----------------------------------------------------------------------
  !MOV RAX, [p.p_String]    ; load String address
  !CMP RAX, 0               ; If *String = 0
  !JE .Return               ; Exit    
  !SUB RAX, 8               ; Sub 8 to start with Add 8 in the Loop     
  ; ----------------------------------------------------------------------
  ; Setup start parameter for registers 
  ; ----------------------------------------------------------------------     
  ; your indiviual setup parameters
  !MOV DX, [p.v_cSearch]    ; should be DX not RDX because of 1 Word
  !MOVQ XMM1, RDX
  !PSHUFLW XMM1, XMM1, 0    ; Shuffle/Copy Word0 to all Words 
  
  ; here are the standard setup parameters
  !XOR RCX, RCX             ; operating Register and BOOL for EndOfStringFound
  !XOR R8, R8               ; Counter = 0
  !PXOR XMM2, XMM2          ; XMM2 = 0 ; Mask to search for NullChar = EndOfString         
  ; ----------------------------------------------------------------------
  ; Main Loop
  ; ----------------------------------------------------------------------     
  !.Loop:
    !ADD RAX, 8                     ; *String + 8 => NextChars    
    !MOVQ XMM0, [RAX]               ; load 4 Chars to XMM0  
    !MOVQ XMM3, [RAX]               ; load 4 Chars to XMM3
    !PCMPEQW XMM0, XMM2             ; Compare with 0
    !MOVQ RDX, XMM0                 ; RDX CompareResult contains FFFF for each NullChar 
    !TEST RDX, RDX                  ; If 0 : No NullChar found
    !JZ .EndIf                   ; JumpIfEqual 0 => JumpToEndif if Not EndOfString  
    ; If EndOfStringFound  
      ; Caclulate the Bytepostion of EndOfString [0..3] using Bitscan
      !BSF RDX, RDX                 ; BitSanForward => No of the LSB   
      !SHR RDX, 3                   ; BitNo to ByteNo
      !ADD RAX, RDX                 ; Actual StringPointer + OffsetOf_NullChar
      !MOV RCX, RDX                 ; Save ByteOfsett of NullChar in RCX
      !SUB RAX, [p.p_String]        ; RAX *EndOfString - *String
      !SHR RAX, 1                   ; NoOfBytes to NoOfWord => Len(String)
      ;check for Return of Length and and move it to *outLength 
      !MOV RDX, [p.p_outLength]
      !CMP RDX, 0
      !JE @f                        ; If *outLength
        !MOV [RDX], RAX             ;   *outLength = Len()
      !@@:                          ; Endif
    
      ; If a Nullchar was found Then create a Bitmask for setting all Chars after the NullChar to 00h 
      ; In RCX ist the Backup of the ByteOffset of NullChahr
      !CMP RCX, 6                   ; If NullChar is the last Char : Byte[7,6]=Word[3])
      !JGE @f                       ;  => we don't have to eliminate chars from testing
        ; If WordPos(EndOfString) <> 3  ; Word3 if EndOfString is in Bit 48..63 = Word 3
        !SHL RCX, 3                   ; ByteNo to BitNo
        !NEG RCX                      ; RCX = -LSB 
        !ADD RCX, 63                  ; RCX = (63-LSB)
        !XOR RDX, RDX                 ; RDX = 0
        !BTS RDX, 63                  ; set Bit 63 => RDX = 8000000000000000h
        !SAR RDX, CL                  ; Do an arithmetic Shift Right (63-LSB) : EndOfString=Word2 => Mask $FFFF.FFFF.0000.0000, Word1 $FFFF.FFFF.FFFF.0000
        !NOT RDX                      ; Now invert our Mask so we get a Mask to fileter out all Chars after EndOfString $0000.0000.FFFF.FFFF or $0000.0000.0000.FFFF
        !MOVQ XMM0, RDX               ; Now move this Mask to XMM0, the operating Register
        !PAND XMM3, XMM0              ; XMM3 the CharBackup AND Mask => we select only Chars up to EndOfString 
      !@@:
      
      !MOV RCX, 1                     ; BOOL EndOfStringFound = #TRUE
    !.EndIf:                     ; Endif ; EndOfStringFound    
    
    ; ------------------------------------------------------------
    ; Start of function individual code! Do not use RCX here!
    ; ------------------------------------------------------------
    ; Count number of found Chars
    !MOVQ XMM0, XMM3              ; Load the 4 Chars to operating Register
    !PCMPEQW XMM0, XMM1           ; Compare the 4 Chars with cSearch
    !MOVQ RDX, XMM0               ; CompareResult to RDX
    !TEST RDX, RDX
    !JZ @f                        ; Jump to Endif if cSearch not found
      !POPCNT RDX, RDX            ; Count number of set Bits (16 for each found Char)
      !SHR RDX, 4                 ; NoOfBits [0..64] to NoOfWords [0..4]
      !ADD R8, RDX                ; ADD NoOfFoundChars to Counter R8
    !@@: 
    ; ------------------------------------------------------------
    
    !TEST RCX, RCX                ; Check BOOL EndOfStringFound      
    !JZ .Loop                  ; Continue Loop if Not EndOfStringFound
  !.EndLoop:
  
  ; ----------------------------------------------------------------------
  ; Handle Return value an POP-Registers
  ; ----------------------------------------------------------------------     
  !MOV RAX, R8      ; ReturnValue to RAX
  !.Return:
  
  ; ASM_POP_XMM_4to5(RDX)     ; POP non volatile Registers we PUSH'ed at start
  
  ProcedureReturn   ; RAX
  
EndMacro

; **************************************************
; x64 Assembler 8-Char Version with XMM-Registers
; ************************************************** 

; ATTENTION! BUG! Not working correct! 

; The 8-Chrar Version is only to see speed difference between
; 4 and 8 Char Version.
; It has still a Bug in counting (maybe a problem of the Shuffles)
; As I expected: the 8 Char version doen't make sense.
; It's not faster than the 4 Char version but more complicated

; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro ASMx64_8C_CountChar()
  ; 8 Char-version
  ; Used Registers:
  ;   RAX : Pointer *String
  ;   RCX : operating Register and Bool: 1 if NullChar was found
  ;   RDX : operating Register
  ;   R8  : Counter
  ;   R9  : operating Register
  ;   R10 :
  ;   R11 :
  
  ;   XMM0 : the 4 Chars
  ;   XMM1 : operating Register
  ;   XMM2 : 0 to search for EndOfString
  ;   XMM3 : the 4 Chars Backup
  
  ; If you use XMM4..XMM7 you have to backup it first
  ;   XMM4 : cSearch shuffeled to all Words
  ;   XMM5 : operating Register
  ;   XMM6 :
  ;   XMM7 :
  
  ASM_PUSH_XMM_4to5(RDX)     ; optional PUSH() see PbFw_ASM_Macros.pbi
  
  ; ----------------------------------------------------------------------
  ; Check *String-Pointer and MOV it to RAX as operating register
  ; ----------------------------------------------------------------------
  !MOV RAX, [p.p_String]    ; load String address
  !CMP RAX, 0               ; If *String = 0
  !JE .Return            ; Exit    
  !SUB RAX, 16               ; Sub 8 to start with Add 8 in the Loop     
  ; ----------------------------------------------------------------------
  ; Setup start parameter for registers 
  ; ----------------------------------------------------------------------     
  ; your indiviual setup parameters
  !MOV DX, [p.v_cSearch]    ; should be DX not RDX because of 1 Word
  !MOVQ XMM4, RDX
  !PSHUFLW XMM4, XMM4, 0    ; Shuffle/Copy Word0 to all Words 
  !PSHUFD XMM4, XMM4, 01000100b ; Copy Lo-Qword th Hi-Qword
  
  ; here are the standard setup parameters
  !XOR RCX, RCX             ; operating Register and BOOL for EndOfStringFound
  !XOR R8, R8               ; Counter = 0
  !PXOR XMM2, XMM2          ; XMM2 = 0 ; Mask to search for NullChar = EndOfString         
  ; ----------------------------------------------------------------------
  ; Main Loop
  ; ----------------------------------------------------------------------     
  !.Loop:
    !ADD RAX, 16                    ; *String + 16 => NextChars    
    ;!MOVDQU XMM0, [RAX]
    ASM_LD_XMMM(XMM0, RAX)          ; optimized load 8 Chars to XMM0
    !MOVDQA XMM3, XMM0              ; copy 8 Chars to XMM3
    !PCMPEQW XMM0, XMM2             ; Compare with 0
    !PSHUFD XMM5, XMM0, 01001110b   ; Switch Hi/Lo QWord of XMM0 to XMM5
    
    !MOVQ RDX, XMM0                 ; RDX CompareResult contains FFFF for each NullChar
    !MOVQ RCX, XMM5
    !XOR R9, R9                     ; Clear R9
    !ADD R9, RCX                    ; add ByteOffset Hi-Qword
    !ADD R9, RDX                    ; add ByteOffset Lo-Qword
    !TEST R9, R9                    ; Check RDX + RCX = 0       
    !JZ .EndIf                      ; JumpIfZer0 0 => JumpToEndif if Not EndOfString  
    ; If EndOfStringFound  
      ; Caclulate the Bytepostion of EndOfString [0..3] using Bitscan
      ; start with the lo-Qword
      !BSF RDX, RDX                 ; BitSanForward => No of the LSB   
      !SHR RDX, 3                   ; BitNo to ByteNo
      !ADD RAX, RDX                 ; Actual StringPointer + OffsetOf_NullChar
      !MOV RCX, RDX                 ; Offset to RCX
      !TEST RDX, RDX
      !JNZ @f                        ; Test NullChar in Hi-QWORD
        !MOVQ RDX, XMM5
        !BSF RDX, RDX                 ; BitSanForward => No of the LSB   
        !SHR RDX, 3                   ; BitNo to ByteNo
        !ADD RDX, 8                   ; 4 Word, Char Offset for No. in Hi-Part
        !ADD RAX, RDX                 ; Actual StringPointer + OffsetOf_NullChar
        !MOV RCX, RDX                 ; Offset to RCX
      !@@:
      !SUB RAX, [p.p_String]        ; RAX *EndOfString - *String
      !SHR RAX, 1                   ; NoOfBytes to NoOfWord => Len(String)
      ;check for Return of Length and and move it to *outLength 
      !MOV RDX, [p.p_outLength]
      !TEST RDX, RDX
      !JZ @f                        ; If *outLength
        !MOV [RDX], RAX             ;   *outLength = Len()
      !@@:                          ; Endif
      
      !JZ .Hi_Endif
      
      !CMP RCX, 8                   ; ByteNo >= 8  : NullChar in Hi-Qword
      !JGE .Hi                          
        ; NullChar in Lo-Qord       ; If ByteNo < 8 
        !SHL RCX, 3
        !NEG RCX
        !ADD RCX, 63
        !XOR RDX, RDX
        !BTS RDX, 63
        !SAR RDX, CL                  ; Do an arithmetic Shift Right (63-LSB) : EndOfString=Word2 => Mask $FFFF.FFFF.0000.0000, Word1 $FFFF.FFFF.FFFF.0000
        !NOT RDX                      ; Now invert our Mask so we get a Mask to fileter out all Chars after EndOfString $0000.0000.FFFF.FFFF or $0000.0000.0000.FFFF
        !PXOR MM0, MM0                ; MM0 = 0
        !XOR R9, R9                   ; R9 = 0
        !NOT R9                       ; R9 = FFFF.FFFF. FFFF.FFFF
        !MOVQ XMM0, R9                
        !PSLLQ XMM0, 1                ; Lo to Hi
        !MOVQ XMM0, RDX               ; Now move this Mask to XMM0, the operating Register
        !JMP .Hi_Endif  
      !.Hi:    ; NullChar in Hi-Qword 
        !SUB RCX, 8
        !SHL RCX, 3
        !NEG RCX
        !ADD RCX, 63
        !XOR RDX, RDX
        !BTS RDX, 63
        !SAR RDX, CL                  ; Do an arithmetic Shift Right (63-LSB) : EndOfString=Word2 => Mask $FFFF.FFFF.0000.0000, Word1 $FFFF.FFFF.FFFF.0000
        !NOT RDX                      ; Now invert our Mask so we get a Mask to fileter out all Chars after EndOfString $0000.0000.FFFF.FFFF or $0000.0000.0000.FFFF
        !MOVQ XMM0, RDX               ; Now move this Mask to XMM0, the operating Register
        !PSLLQ XMM0, 1
        !XOR R9, R9
        !NOT R9
        !MOVQ XMM0, R9
      !.Hi_Endif:
      
      !PAND XMM3, XMM0              ; XMM3 the CharBackup AND Mask => we select only Chars up to EndOfString          
      !MOV RCX, 1                   ; BOOL EndOfStringFound = #TRUE
    !.EndIf:                      ; Endif ; EndOfStringFound    
    
    ; ------------------------------------------------------------
    ; Start of function individual code! Do not use RCX here!
    ; ------------------------------------------------------------
    ; Count number of found Chars
    !MOVQ XMM0, XMM3              ; Load the 4 Chars to operating Register
    !PCMPEQW XMM0, XMM4           ; Compare the 4 Chars with cSearch
    !MOVQ RDX, XMM0               ; CompareResult to RDX
     !TEST RDX, RDX
     !JZ @f                        ; Jump to Endif if cSearch not found
      !POPCNT RDX, RDX            ; Count number of set Bits (16 for each found Char)
      !SHR RDX, 4                 ; NoOfBits [0..64] to NoOfWords [0..4]
      !ADD R8, RDX                ; ADD NoOfFoundChars to Counter R8
    !@@: 
    
    !PSRLQ XMM0, 1                ; Packed Shift Right Logical QWords
    !MOVQ RDX, XMM0               ; CompareResult to RDX
    !TEST RDX, RDX
    !JZ @f                        ; Jump to Endif if cSearch not found
      !POPCNT RDX, RDX            ; Count number of set Bits (16 for each found Char)
      !SHR RDX, 4                 ; NoOfBits [0..64] to NoOfWords [0..4]
      !ADD R8, RDX                ; ADD NoOfFoundChars to Counter R8
   !@@: 
    ; ------------------------------------------------------------
    
    !TEST RCX, RCX                ; Check BOOL EndOfStringFound      
    !JZ .Loop                  ; Continue Loop if Not EndOfStringFound
  !.EndLoop:
  
  ; ----------------------------------------------------------------------
  ; Handle Return value an POP-Registers
  ; ----------------------------------------------------------------------     
  !MOV RAX, R8      ; ReturnValue to RAX
  !.Return:
  
  ASM_POP_XMM_4to5(RDX)     ; POP non volatile Registers we PUSH'ed at start
  
  ProcedureReturn   ; RAX
  
EndMacro

; **************************************************
; x32 Assembler Version with MMX-Registers
; **************************************************
; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro ASMx32_MMX_CountChar()
  Protected N
  Protected eos       ; Bool EndOfString
  
  ; Used Registers:
  ;   EAX : Pointer *String
  ;   ECX : operating Register and Bool: 1 if NullChar was found
  ;   EDX : operating Register        
  
  ;   MM0 : the 4 Chars
  ;   MM1 : cSearch shuffeled to all Words
  ;   MM2 : 0 to search for EndOfString
  ;   MM3 : the 4 Chars Backup
  ;   MM4 : 
  ;   MM5 :
  ;   MM6 :
  ;   MM7 :
    
  ASM_PUSH_MM_0to3(EDX)     ; PUSH nonvolatile MMX-Registers
  ; ASM_PUSH_MM_4to5(EDX)     ; PUSH nonvolatile MMX-Registers
      
  ; ----------------------------------------------------------------------
  ; Check *String-Pointer and MOV it to EAX as operating register
  ; ----------------------------------------------------------------------
  !MOV EAX, [p.p_String]    ; load String address
  !CMP EAX, 0               ; If *String = 0
  !JE .Return               ; Exit    
  !SUB EAX, 4               ; Sub 4 to start with Add 8 in the Loop     
  
  ; ----------------------------------------------------------------------
  ; Setup start parameter for registers 
  ; ----------------------------------------------------------------------     
  ; your indiviual setup parameters
  !MOV DX, [p.v_cSearch]    ; should be DX not EDX because of 1 Word
  !MOVD MM1, EDX
  !PSHUFW MM1, MM1, 0      ; Shuffle/Copy Word0 to all Words 
 
  ; here are the standard setup parameters
  !XOR ECX, ECX             ; operating Register and BOOL for EndOfStringFound
  !PXOR MM2, MM2            ; MM2 = 0 ; Mask to search for NullChar = EndOfString         
  ; ----------------------------------------------------------------------
  ; Main Loop
  ; ----------------------------------------------------------------------     
  !.Loop:
    !ADD EAX, 4                     ; *String + 8 => NextChars    
    !MOVD MM0, [EAX]                ; load 4 Chars to MM0  
    !MOVD MM3, [EAX]                ; load 4 Chars to MM3
    !PCMPEQW MM0, MM2               ; Compare with 0
    !MOVD EDX, MM0                  ; EDX CompareResult contains FFFF for each NullChar 
    !CMP EDX, 0                     ; If 0 : No NullChar found
    !JE .EndIf                      ; JumpIfEqual 0 => JumpToEndif if Not EndOfString  
    ; If EndOfStringFound  
      ; Caclulate the Bytepostion of EndOfString [0..3] using Bitscan
      !BSF EDX, EDX                 ; BitSanForward => No of the LSB   
      !SHR EDX, 3                   ; BitNo to ByteNo
      !ADD EAX, EDX                 ; Actual StringPointer + OffsetOf_NullChar
      !SUB EAX, [p.p_String]        ; EAX *EndOfString - *String
      !SHR EAX, 1                   ; NoOfBytes to NoOfWord => Len(String)
      ;check for Return of Length and and move it to *outLength 
      !MOV EDX, [p.p_outLength]
      !CMP EDX, 0
      !JE @f                        ; If *outLength
        !MOV [EDX], EAX             ;   *outLength = Len()
      !@@:                          ; Endif
    
      ; If a Nullchar was found Then create a Bitmask for setting all Chars after the NullChar to 00h 
      !MOVD ECX, MM0                ; Load compare Result of 4xChars=0 to ECX
      !BSF ECX, ECX                 ; Find No of LSB [0..31] (if no Bit found it returns 0 too)
      !CMP ECX, 16                  ; If LSB >= 16 the EndOfString is the last of the 4 Chars
      !JGE @f                       ;  => we don't have to eliminate chars from testing
      ; If WordPos(EndOfString) <> 3  ; Word3 if EndOfString is in Bit 16..31 = Word 3
        !NEG ECX                      ; ECX = -LSB 
        !ADD ECX, 31                  ; ECX = (31-LSB)
        !XOR EDX, EDX                 ; EDX = 0
        !BTS EDX, 31                  ; set Bit 31 => EDX = 8000000000000000h
        !SAR EDX, CL                  ; Do an arithmetic Shift Right (31-LSB) : EndOfString=Word2 => Mask $FFFF.FFFF.0000.0000, Word1 $FFFF.FFFF.FFFF.0000
        !NOT EDX                      ; Now invert our Mask so we get a Mask to fileter out all Chars after EndOfString $0000.0000.FFFF.FFFF or $0000.0000.0000.FFFF
        !MOVD MM0, EDX                ; Now move this Mask to MM0, the operating Register
        !PAND MM3, MM0                ; MM3 the CharBackup AND Mask => we select only Chars up to EndOfString 
      !@@:
                
      ;!MOV ECX, 1                   ; BOOL EndOfStringFound = #TRUE
      !MOV [p.v_eos], DWORD 1        ; at x32 we need ECX, so Bool EndOfstring in Var 
    !.EndIf:                      ; Endif ; EndOfStringFound    
    
    ; ------------------------------------------------------------
    ; Start of function individual code! You can use ECX here!
    ; ------------------------------------------------------------
    ; Count number of found Chars
    !MOVQ MM0, MM3                ; Load the 4 Chars to operating Register
    !PCMPEQW MM0, MM1             ; Compare the 4 Chars with cSearch
    !MOVD EDX, MM0                ; CompareResult to EDX
    !CMP EDX, 0
    !JE @f                        ; Jump to Endif if cSearch not found
      !POPCNT EDX, EDX            ; Count number of set Bits (16 for each found Char)
      !SHR EDX, 4                 ; NoOfBits [0..64] to NoOfWords [0..4]
      !MOV ECX, DWORD [p.v_N]
      !ADD ECX, EDX               ; ADD NoOfFoundChars to Counter 
      !MOV [p.v_N], ECX
    !@@:
    ; ------------------------------------------------------------
              
    !MOV ECX, DWORD [p.v_eos]
    !Test ECX, ECX                ; Check BOOL EndOfStringFound      
    !JZ .Loop                   ; Continue Loop if Not EndOfStringFound
  !.EndLoop:     
        
  ;  Move yur ReturnValue to EAX, here it's the counter
  !MOV EAX, [p.v_N]         ; ReturnValue to EAX
  !.Return:
  
  ASM_POP_MM_0to3(EDX)      ; POP non volatile Registers we PUSH'ed at start
  ; ASM_POP_MM_4to5(EDX)      ; POP non volatile Registers we PUSH'ed at start
  !EMMS                     ; Empty MMX Technology State, enables FPU Register use
  ProcedureReturn   ; EAX   
EndMacro

; **************************************************
; x32 Assembler Version Classic 
; **************************************************
; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro ASMx32_CountChar()     

  ; At x32 optimized classic ASM ist a good choice! On modern CPU like Ryzen
  ; it is nearly same speed as MMX-Code. XMM-Code is on all CPU's much slower
  ; On older CPU's back to 2010 AMD and Intel MMX is faster.
  Protected memEBX
  ; @f = Jump forward to next @@;  @b = Jump backward to next @@  
  
  ; used Registers
  ;   EAX : *String
  ;   EBX : Counter
  ;   ECX : operating Register
  ;   EDX : cSearch
  
  !MOV [p.v_memEBX], EBX      ; PUSH EBX
  !MOV EAX, [p.p_String]      ; load String Adress
  !TEST EAX, EAX              ; If *String = 0
  !JZ .Return                 ;   Then End
  !SUB EAX, 2                 ; *String
  
  ; ----------------------------------------------------------------------
  ; Setup start parameter for registers 
  ; ----------------------------------------------------------------------     
  !XOR EBX, EBX
  !XOR ECX, ECX
  !XOR EDX, EDX
  !MOV DX, WORD[p.v_cSearch]  ; cSearch\c 
  
  !.Loop:
    !ADD EAX, 2            ; *String
    !MOV CX, WORD [EAX]    ; load Char to EDX   
    !TEST CX, CX           ; Test EndOfString
    !JZ .EndLoop
    
    ; ------------------------------------------------------------
    ; Start of function individual code!
    ; ------------------------------------------------------------
    ; Count number of found Chars 
    !CMP CX, DX 
    !JNE @f                 ; If cSearch\c found
       !INC EBX
    !@@:
    ; ------------------------------------------------------------
       
    !JMP .Loop
  !.EndLoop:
  
  ; optional Return Lenth
  !MOV ECX, [p.p_outLength]
  !TEST ECX, ECX
  !JZ @f
    !SUB EAX,  [p.p_String]
    !SHR EAX,1
    !MOV [ECX], EAX       ; *outLength = Len
  !@@:
  
  !.Return:
  !MOV EAX, EBX           ; Return Counter in EAX         
  !MOV EBX, [p.v_memEBX]  ; POP EBX
  ProcedureReturn   ; counter
EndMacro

; **************************************************
; Purebasic Version with *Pointer-Code
; **************************************************

; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro PB_CountChar_ptr()
  Protected *pRead.Character = *String
  Protected N
    
  If Not *String
    ProcedureReturn 0
  EndIf
  
  While *pRead\c    ; Step trough the String
    If *pRead\c = cSearch.c
      N + 1
    EndIf
    *pRead + SizeOf(Character)
  Wend
 
  If *outLength       ; If Return Length
    *outLength\i = (*pRead - *String)/2
  EndIf
  ProcedureReturn N
EndMacro

; **************************************************
; Purebasic Version using PB integrated functions
; **************************************************

; Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
Macro PB_CountChar_PB()
  ; especally on Intel CPU's PB's CountString() and Len()
  ; performs better than a individual PB PointerCode
  Protected N
  Protected sStr.String           ; String Struct
  Protected *ptr.Integer = @sStr  ; Pointer to String Struct
    
  If *String
    *ptr\i = *String          ; Hook *String into String Struct @sStr = *String     
    N = CountString(sStr\s, Chr(cSearch))   
    If *outLength             ; If Return Length
      *outLength\i = Len(sStr\s)
    EndIf  
    *ptr\i = 0                ; Unhook String otherwise PB delete the String 
  EndIf 
  ProcedureReturn N  
EndMacro
 
Procedure CountChar(*String, cSearch.c, *outLength.Integer=0)
; ============================================================================
; NAME: CountChar
; DESC: Counts Characters in a String
; DESC: This example is for CountChar
; DESC: 
; VAR(*String) : Pointer to the String
; VAR(*outLength.Integer): Optional a Pointer to an Int to receive the Length
; RET.i : Number of Characters found
; ============================================================================

  CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm And #PB_Compiler_Unicode
    
    CompilerIf #PB_Compiler_64Bit
    ; ************************************************************
    ; x64 Assembler
    ; ************************************************************
      ASMx64_CountChar()        ; XMM-Version
      
    CompilerElse ; #PB_Compiler_32Bit
    ; ************************************************************
    ; x32 Assembler
    ; ************************************************************
      ;ASMx32_CountChar()       ; classic x86 Assembler
      ASMx32_MMX_CountChar()    ; x32 with MMX
    CompilerEndIf 
     
  CompilerElseIf #PB_Compiler_Backend = #PB_Backend_C And #PB_Compiler_Unicode
     
    CompilerIf #PB_Compiler_64Bit
    ; ************************************************************
    ; x64 C
    ; ************************************************************
      PB_CountChar_PB()
      
    CompilerElse #PB_Compiler_32Bit
    ; ************************************************************
    ; x32 C
    ; ************************************************************
      PB_CountChar_PB()
      
    CompilerEndIf
    
  CompilerElse ; Ascii
  ; ************************************************************
  ; Ascii Strings < PB 5.5
  ; ************************************************************
    PB_CountChar_PB()   
    
  CompilerEndIf 
    
EndProcedure
;CountChar = @CountChar()

Procedure CountChar8(*String, cSearch.c, *outLength.Integer=0)
; ============================================================================
; NAME: CountChar
; DESC: Counts Characters in a String
; DESC: This example is for CountChar
; DESC: 
; VAR(*String) : Pointer to the String
; VAR(*outLength.Integer): Optional a Pointer to an Int to receive the Length
; RET.i : Number of Characters found
; ============================================================================

  CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm And #PB_Compiler_Unicode
    
    CompilerIf #PB_Compiler_64Bit
    ; ************************************************************
    ; x64 Assembler
    ; ************************************************************
      ASMx64_8C_CountChar()        ; XMM-Version 8-Char simultan
      
    CompilerElse ; #PB_Compiler_32Bit
    ; ************************************************************
    ; x32 Assembler
    ; ************************************************************
      ;ASMx32_CountChar()       ; classic x86 Assembler
      ASMx32_CountChar()    ; x32 with MMX
    CompilerEndIf 
     
  CompilerElseIf #PB_Compiler_Backend = #PB_Backend_C And #PB_Compiler_Unicode
     
    CompilerIf #PB_Compiler_64Bit
    ; ************************************************************
    ; x64 C
    ; ************************************************************
      PB_CountChar_PB()
      
    CompilerElse #PB_Compiler_32Bit
    ; ************************************************************
    ; x32 C
    ; ************************************************************
      PB_CountChar_PB()
      
    CompilerEndIf
    
  CompilerElse ; Ascii
  ; ************************************************************
  ; Ascii Strings < PB 5.5
  ; ************************************************************
    PB_CountChar_PB()   
    
  CompilerEndIf 
    
EndProcedure

Procedure CountChar_PB(*String, cSearch.c, *outLength.Integer=0)
  PB_CountChar_ptr()    ; PureBasic Code with *pRead
EndProcedure

Procedure CountChar_BuildIn_Wrapped(*String, cSearch.c, *outLength.Integer=0)
  PB_CountChar_PB()   ; PureBasic build in functions wrappen into Proc
EndProcedure

EnableExplicit
  
Define Test$, PB$
Define L, N

CompilerIf #PB_Compiler_32Bit
  PB$= "Running in x32 Version!"  
CompilerElse
  PB$ = "Running in x64 Version!"  
CompilerEndIf
Debug PB$
Debug ""

Test$ = "I am a String for testing FastString MMX and SSE functions!"
Test$ + Test$
; Test$ = Test$ + Test$ + Test$ + Test$ + Test$ + Test$
Debug Test$
Debug ""
Debug "PB's Len() = " + Len(Test$)
Debug "PB's CountString('a') = " + CountString(Test$, Chr('a'))
Debug ""

N = CountChar_PB(@Test$, 'a', @L)
Debug "CountChar_PB('a') = " + N
Debug "CountChar_PB Len = " + L

Debug ""
Debug "x64: XMM 4-Char Version / x32 MMX-Version"
N = CountChar(@Test$, 'a', @L)
Debug "Count('a') = " + N
Debug "XMM Len = " + L

Debug ""
Debug "XMM 8-Char Version / x32 ASMx32 Classic"
N = CountChar8(@Test$, 'a', @L)
Debug "Count('a') = " + N
Debug "Len = " + L


CompilerIf Not #PB_Compiler_Debugger
  Define I, t1, t2, t3, t4, t5
  Define msg$
  
  #Loops = 2000000   ; 1Mio
  
  ; PB's buid in functions
  t1 = ElapsedMilliseconds()
  For I = 1 To #Loops
    L = Len(Test$)
    N = CountString(Test$, Chr('a'))
  Next
  t1 = ElapsedMilliseconds() - t1
   
  ; PB's buid in functions
  t2 = ElapsedMilliseconds()
  For I = 1 To #Loops
    N = CountChar_BuildIn_Wrapped(@Test$, 'a', @L)
  Next
  t2 = ElapsedMilliseconds() - t2
  
 
  t3 = ElapsedMilliseconds()
  For I = 1 To #Loops
    N = CountChar_PB(@Test$, 'a', @L)
  Next
  t3 = ElapsedMilliseconds() - t3

  ; XMM 4-Char Version
  t4 = ElapsedMilliseconds()
  For I = 1 To #Loops
    N=CountChar(@Test$, 'a', @L)
  Next
  t4 = ElapsedMilliseconds() - t4
  
  
  ; XMM 8-Char Version
  t5 = ElapsedMilliseconds()
  For I = 1 To #Loops
    N=CountChar8(@Test$, 'a', @L)
  Next
  t5 = ElapsedMilliseconds() - t5
  
  msg$ = PB$ +  #CRLF$ + #CRLF$ +"PB build in = " + Str(t1) + #CRLF$ + "PB build in wrapped = " +Str(t2) + #CRLF$ 
  msg$   + "PB *Ptr = " + Str(t3)+ #CRLF$ + "x64: XMM 4-Char / x32: MMX = " + Str(t4) + #CRLF$ + "x64: XMM 8-Char / x32 ASMx32 Classic  = "+ Str(t5)
  SetClipboardText(msg$)
  
 MessageRequester("Timing[ms] CountChar for " + Str(#Loops) + " Loops - Result copied to clipboard!", msg$)
  
CompilerEndIf
 
Running in x64 Version!
PB build in = 176
PB build in wrapped = 188
PB *Ptr = 170
x64: XMM 4-Char / x32: MMX = 52
x64: XMM 8-Char / x32 ASMx32 Classic = 45

Running in x32 Version!
PB build in = 234
PB build in wrapped = 243
PB *Ptr = 152
x64: XMM 4-Char / x32: MMX = 115
x64: XMM 8-Char / x32 ASMx32 Classic = 161
Last edited by SMaag on Fri Feb 09, 2024 8:58 pm, edited 4 times in total.
User avatar
ChrisR
Addict
Addict
Posts: 1466
Joined: Sun Jan 08, 2017 10:27 pm
Location: France

Re: Fast String Functions using MMX

Post by ChrisR »

Removed, my results were, by mistake, with the x86 compiler.
Last edited by ChrisR on Sat Feb 03, 2024 6:24 pm, edited 1 time in total.
fryquez
Enthusiast
Enthusiast
Posts: 391
Joined: Mon Dec 21, 2015 8:12 pm

Re: Fast String Functions using MMX

Post by fryquez »

It's about 3 times faster here :)

MMX CountChar = 67
PB's Len() + CountString() = 204
PB-Code CountChar = 203
User avatar
ChrisR
Addict
Addict
Posts: 1466
Joined: Sun Jan 08, 2017 10:27 pm
Location: France

Re: Fast String Functions using MMX

Post by ChrisR »

I hadn't been paying attention, I had copied the code in a tab previously opened with the x86 compiler defined
It's better with the x64 compiler to compare times:
MMX CountChar = 69
PB's Len() + CountString() = 325
PB-Code CountChar = 345
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Fast String Functions using MMX

Post by wilbert »

SMaag wrote: Sat Feb 03, 2024 12:31 pmI used only standard MMX commands existing since the late '90s.
As far as I understand you have to use the EMMS instruction at the end of your procedure if you use MMX registers.
That's one of the reasons I prefer the XMM registers from the SSE standard.
Windows (x64)
Raspberry Pi OS (Arm64)
SMaag
Enthusiast
Enthusiast
Posts: 325
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: Fast String Functions using MMX

Post by SMaag »

wilbert wrote: Sun Feb 04, 2024 9:01 am As far as I understand you have to use the EMMS instruction at the end of your procedure if you use MMX registers.
That's one of the reasons I prefer the XMM registers from the SSE standard.
O.k.! I did't now this intstruction until now:
The EMMS instruction must be used to clear the MMX technology state at the end of all MMX technology procedures or subroutines and before calling other procedures or subroutines that may execute x87 floating-point instructions!

SSE is extra set of eight 128-bit registers, separate to those in the FPU!
Until now my opinion was MM0 is the lo-part of XMM0. But that's not the case! XMM-Registers are seperate Registers.

So I have to add EMMS instruction! Thank's very much!

Using XMM-Register would be better! But this is not so easy because as I know, there is no direct MOV between 64-Bit-CPU-Registers and 128-Bit XMM Registers.(Update: I found out it is possible with MOVD in x32 and MOVQ in x64)

I tried PCMPISTRI from SSE instruction set for the String functions. It works but it reads 16Byte blocks from String. This reads over the EndOfString at
8Byte Aligned PB-Strings. At my tests it never crashed and I guess it will never crash because the OS memory pages are 256 Byte. So normally we will never read into a memory of an other process. But I'm not 100% sure! Maybe it will crash in the future! Do you know anything about this issue?
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Fast String Functions using MMX

Post by wilbert »

SMaag wrote: Sun Feb 04, 2024 7:27 pm Until now my opinion was MM0 is the lo-part of XMM0. But that's not the case! XMM-Registers are seperate Registers.

So I have to add EMMS instruction! Thank's very much!

Using XMM-Register would be better! But this is not so easy because as I know, there is no direct MOV between 64-Bit-CPU-Registers and 128-Bit XMM Registers.(Update: I found out it is possible with MOVD in x32 and MOVQ in x64)

I tried PCMPISTRI from SSE instruction set for the String functions. It works but it reads 16Byte blocks from String. This reads over the EndOfString at
8Byte Aligned PB-Strings. At my tests it never crashed and I guess it will never crash because the OS memory pages are 256 Byte. So normally we will never read into a memory of an other process. But I'm not 100% sure! Maybe it will crash in the future! Do you know anything about this issue?
MM0-MM7 share the same space as fpu registers st0-st7 that's why the EMMS instruction is needed.

You are right about MOVD and MOVQ.
They can be used to move 32 or 64 bits from and to a XMM register.

When you are using XMM registers, XMM0-XMM5 are volatile (XMM6 and XMM7 also on Linux and Mac) so you don't need to backup and restore XMM4 and XMM5.

As far as I know, the default OS page size is 4KiB, not 256 bytes.

Helle has posted some code before using PCMPISTRI
https://www.purebasic.fr/english/viewto ... 76#p375376
Windows (x64)
Raspberry Pi OS (Arm64)
SMaag
Enthusiast
Enthusiast
Posts: 325
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: Fast String Functions using MMX

Post by SMaag »

PureBasic Help:Assembler
- On x86 processors, the available volatile registers are: eax, ecx and edx, xmm0, xmm1, xmm2 and xmm3. All others must be always preserved.
- On x64 processors, the available volatile registers are: rax, rcx, rdx, r8, r9, xmm0, xmm1, xmm2 and xmm3. All others must be always preserved.
The help says only xmm0..xmm3 are volatile! Is the help wrong here?

But now I understand: because MMX are connected with FP Registers all MMX needs backup. So my code starting with backup at MM4 is wrong!
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Fast String Functions using MMX

Post by wilbert »

SMaag wrote: Mon Feb 05, 2024 5:06 pm The help says only xmm0..xmm3 are volatile! Is the help wrong here?
I don't know why the help says that. :?
The calling conventions say something different.

Windows uses the "x64 calling convention" for 64 bit applications.
Linux and macOS use the "AMD System V ABI" .

Here's a quote from the Windows x64 calling convention ...
x64 calling convention wrote:
Caller/callee saved registers

The x64 ABI considers the registers RAX, RCX, RDX, R8, R9, R10, R11, and XMM0-XMM5 volatile. When present, the upper portions of YMM0-YMM15 and ZMM0-ZMM15 are also volatile. On AVX512VL, the ZMM, YMM, and XMM registers 16-31 are also volatile. When AMX support is present, the TMM tile registers are volatile. Consider volatile registers destroyed on function calls unless otherwise safety-provable by analysis such as whole program optimization.

The x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, and XMM6-XMM15 nonvolatile. They must be saved and restored by a function that uses them.
Windows (x64)
Raspberry Pi OS (Arm64)
SMaag
Enthusiast
Enthusiast
Posts: 325
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: Fast String Functions using MMX

Post by SMaag »

It's like most times a little more complicated as it seems first!

For the Register use and the calling conventions i found at Microsoft

R10:R11 Volatile Must be preserved As needed by caller; used in syscall/sysret instructions
XMM4:XMM5 Volatile Must be preserved As needed by caller; fifth vector-type argument when __vectorcall is used

;https://learn.microsoft.com/en-us/cpp/b ... w=msvc-170
; x64 calling conventions - Register use

; --------------------
; x64 CPU Register
; --------------------
; RAX Volatile Return value register
; RCX Volatile First integer argument
; RDX Volatile Second integer argument
; R8 Volatile Third integer argument
; R9 Volatile Fourth integer argument
; R10:R11 Volatile Must be preserved As needed by caller; used in syscall/sysret instructions
; R12:R15 Nonvolatile Must be preserved by callee
; RDI Nonvolatile Must be preserved by callee
; RSI Nonvolatile Must be preserved by callee
; RBX Nonvolatile Must be preserved by callee
; RBP Nonvolatile May be used As a frame pointer; must be preserved by callee
; RSP Nonvolatile Stack pointer
; --------------------
; MMX-Register
; --------------------
; MM0:MM7 Nonvolatile Registers shared with FPU-Register. An EMMS Command is necessary after MMX-Register use
; to enable correct FPU functions again.
; --------------------
; SSE Register
; --------------------
; XMM0, YMM0 Volatile First FP argument; first vector-type argument when __vectorcall is used
; XMM1, YMM1 Volatile Second FP argument; second vector-type argument when __vectorcall is used
; XMM2, YMM2 Volatile Third FP argument; third vector-type argument when __vectorcall is used
; XMM3, YMM3 Volatile Fourth FP argument; fourth vector-type argument when __vectorcall is used
; XMM4, YMM4 Volatile Must be preserved As needed by caller; fifth vector-type argument when __vectorcall is used
; XMM5, YMM5 Volatile Must be preserved As needed by caller; sixth vector-type argument when __vectorcall is used
; XMM6:XMM15, YMM6:YMM15 Nonvolatile (XMM), Volatile (upper half of YMM) Must be preserved by callee. YMM registers must be preserved As needed by caller.
Post Reply