Page 1 of 1

SEE String functions! For Testing!

Posted: Wed Jan 10, 2024 10:41 am
by SMaag
now the SSE String functions are generally working!
LenStr(), StringCompare, FindStr

The practical benefit is low at the moment, because there are PB implemented functions for this!
It is a more a proof of concept! The benefit is the speed compared to PB implemented function!
But only at very havy use like parsing very long text files.

see the pevious discussion here: https://www.purebasic.fr/english/viewtopic.php?t=82158

what I'm interested in is: how to do the same in C-Backend with the intrinsics macros.
here the Intel Intrinsics Guide for the SSE4.2 functions
https://www.intel.com/content/www/us/en ... expand=924

Here the Code, if anyone want to test it!

Update: 2024/08/01 - added 16-Byte align test to prevent reading over EndOfMemoryPage!

Code: Select all


; SEE FastString at PB Forum from 2012: https://www.purebasic.fr/english/viewtopic.php?p=375376#p375376

; ===========================================================================
; FILE : PbFw_Module_StringSSE.pb
; NAME : PureBasic Framework : Module String SSE [StrSSE::]
; DESC : using the MMX, SSE Registers, to speed up String operations
; DESC : CPU SSE4.2 support is needed
; DESC : 
; SOURCES:  https://en.wikibooks.org/wiki/X86_Assembly/SSE#The_Four_Instructions
;           https://en.wikibooks.org/wiki/X86_Assembly/SSE
;           https://www.strchr.com/strcmp_and_strlen_using_sse_4.2
; ===========================================================================
;
; AUTHOR   :  Stefan Maag
; DATE     :  2022/12/04
; VERSION  :  0.53 Developper Version
; COMPILER :  PureBasic 6.0
;
; LICENCE  :  MIT License see https://opensource.org/license/mit/
;             or \PbFramWork\MitLicence.txt
; ===========================================================================
;{ ChangeLog: 
 ; 2024/03/01 S.Maag : 16 Byte Aling check and manually align unaligend Strings
 ; 2024/01/09 S.Maag : SSE_StringCompare: now return -1,0,1 instead of char difference
 ;                     to be compatible with PB Command CompareMemoryString().
 ;                     Tested and bugfixed FindStr
 ; 2023/07/31 S.Maag : SSE_StringCompare: compare Bug fixed

;} 
;{ TODO: For the C-Backend; add functions using SSE
; here the link to the Intel Intrsics Guide for SSE4.2
; https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ssetechs=SSE4_2&ig_expand=924

; Implement all the SSE optimations for the C-Backend on x86
; propably by using the C intrinsic Macros
;}
; ===========================================================================

;{ Description PCmpIStrI

; PCmpIStrI arg1, arg2, IMM8  ; ATTENTION PCmpIStrI/M needs 16Byte aligned Memory

; modified Flags
;     CF is reset If IntRes2 is zero, set otherwise
;     ZF is set If a null terminating character is found in arg2, reset otherwise
;     SF is set If a null terminating character is found in arg1, reset otherwise
;     OF is set To IntRes2[0]
;     AF is reset
;     PF is reset
; ----------------------------------------------------------------------
; IMM8[1:0] specifies the format of the 128-bit source data
; ----------------------------------------------------------------------
; 00b 	unsigned bytes(16 packed unsigned bytes)
; 01b 	unsigned words(8 packed unsigned words)
; 10b 	signed bytes(16 packed signed bytes)
; 11b 	signed words(8 packed signed words) 

; ----------------------------------------------------------------------
; IMM8[3:2] specifies the aggregation operation whose result will 
;           be placed in intermediate result 1, which we will refer to 
;           as IntRes1. The size of IntRes1 will depend on the format
;           of the source Data, 16-bit for packed bytes and 
;           8-bit For packed words: 
; ----------------------------------------------------------------------
; 00b Equal Any, arg1 is a character set, arg2 is the string to search in.
;     IntRes1[i] is set To 1 If arg2[i] is in the set represented by arg1
;
;       arg1    = "aeiou"
;       arg2    = "Example string 1"
;       IntRes1 =  0010001000010000

; 01b Ranges, arg1 is a set of character ranges i.e. "09az" means all
;     characters from 0 To 9 And from a To z., arg2 is the string To search over. 
;     IntRes1[i] is set To 1 If arg[i] is in any of the ranges represented by arg
;
;         arg1    = "09az"
;         arg2    = "Testing 1 2 3, T"
;         IntRes1 =  0111111010101000

; 10b Equal Each, arg1 is string one and arg2 is string two. 
;     IntRes1[i] is set To 1 If arg1[i] == arg2[i]
;
;         arg1    = "The quick brown "
;         arg2    = "The quack green "
;         IntRes1 =  1111110111010011

; 11b Equal Ordered, arg1 is a substring string to search for, arg2 is the 
;     string To search within. IntRes1[i] is set To 1 If the substring arg1 
;     can be found at position arg2[i]: 

;         arg1    = "he"
;         arg2    = ", he helped her "
;         IntRes1 =  0010010000001000

; ----------------------------------------------------------------------
; IMM8[5:4] specifies the polarity or the processing of IntRes1, into 
;           intermediate result 2, which will be referred To As IntRes2
; ----------------------------------------------------------------------
; 00b  Positive Polarity 	IntRes2 = IntRes1
; 01b  Negative Polarity 	IntRes2 = -1 XOr IntRes1
; 10b  Masked Positive 	  IntRes2 = IntRes1
; 11b  Masked Negative 	  IntRes2 = IntRes1 If reg/mem[i] is invalid Else ~IntRes1

; ----------------------------------------------------------------------
; IMM8[6] specifies the output selection, or how IntRes2 will be processed
;         into the output. For PCMPESTRI And PCMPISTRI, the output is an
;         index into the Data currently referenced by arg2
; ----------------------------------------------------------------------
; 0b 	Least Significant Index 	ECX contains the least significant set bit in IntRes2
; 1b 	Most Significant Index 	  ECX contains the most significant set bit in IntRes2 

; ----------------------------------------------------------------------
; IMM8[6] For PCMPESTRM and PCMPISTRM, the output is a mask reflecting 
;         all the set bits in IntRes2
; ----------------------------------------------------------------------
; 0b 	Least Significant Index 	Bit Mask, the least significant bits 
;                               of XMM0 contain the IntRes2 16(8) bit mask. 
;                               XMM0 is zero extended To 128-bits.
; 1b 	Most Significant Index 	  Byte/Word Mask, XMM0 contains IntRes2 expanded into byte/word mask 
; ----------------------------------------------------------------------

; EQUAL_ANY	        =   0000b
; RANGES		        =   0100b
; EQUAL_EACH	      =   1000b
; EQUAL_ORDERED	    =   1100b
; NEGATIVE_POLARITY = 010000b
; BYTE_MASK	       = 1000000b

; FLAGs
; OF : Overflow flag
; SF : Sign flag      ; #True if negative
; ZF : Zero flag      ; #True if zero
; AF: Auxillary (carry) flag
; PF: Parity flag
; CF: Carry flag
;}

;{ ----------------------------------------------------------------------
;    The Problem of 16 Byte operations on lower aligend memory
;  ----------------------------------------------------------------------

; If we process 16 Bytes on lower aligned memory we may run into an overflow at the
; end of meory pages when the end of String is located in the last bytes
; of the memory page and the following page is not allocated to our process.
; Yes this will happen very seldom but it can happen. So it is a source of
; crashes may happen in years in the future. Because it can happen, it will happen!
; It is only a question of time!

; A memory page in x64 Systems is 4096 Bytes
; We look on a 8 Byte aligned String at the end of memory page to show the problem

;  a String followed by a NullChar and a further NullChar then the page ends
;                          EndOfString at Byte 4092..93 and a void 00        
; | ..... 'I am a String at the end of a memroy page' 0000| 
; if we process 16 Bytes at 8 Byte align starting at Byte 4088 we read until Byte 4103
; we read 8 Bytes into the next page. Now it will crash if the next page is not
; allocated to our process! We can use 16 Byte PCMPISTRI operation
; only if we are not at the end of a memory page or we have a 16 Byte aligned memory. 
;}
;- ----------------------------------------------------------------------
;- Include Files
;  ----------------------------------------------------------------------

DeclareModule StrSSE
  
  EnableExplicit
   
  ; ----------------------------------------------------------------------
  ;- DECLARE
  ;- ----------------------------------------------------------------------
  
  Declare.i SSE_LenA(*String)
  Declare.i SSE_Len(*String)
  Declare.i SSE_StringCompare(*String1, *String2, Pos=0)
  Declare.i SSE_FindStr(*String, *StringToFind)

EndDeclareModule

Module StrSSE
    
  IncludeFile "PbFw_ASM_Macros.pbi"
  
  #EQUAL_ANY	        = %0000
  #RANGES		          = %0100
  #EQUAL_EACH	        = %1000
  #EQUAL_ORDERED	    = %1100
  #NEGATIVE_POLARITY  = %0010000
  #BYTE_MASK	        = %1000000
  
  Structure pChar   ; virtual CHAR-ARRAY, used as Pointer to overlay on strings 
    a.a[0]          ; fixed ARRAY Of CHAR Length 0
    c.c[0]          
  EndStructure

  ;- ----------------------------------------------------------------------
  ;- Module Public
  ;- ----------------------------------------------------------------------

  Procedure.i SSE_LenA(*String)
  ; ============================================================================
  ; NAME: SSE_LenA
  ; DESC: Length in number of characters of Ascii Strings
  ; DESC: Use SSE PCmpIStrI operation. This is aprox. 3 times faster than PB Len()
  ; VAR(*String): Pointer to String 1
  ; RET.i: Number of Characters
  ; ============================================================================
    
    ; ATTENTION PCmpIStrI needs 16Byte aligned Memory
    ; If memory isn't aligned we have to align it manually
    ; by processing unalinged bytes in classic way and
    ; start with PCmpIStrI at aligned psoition.
    ; The Problem of analigned reading is the end of memory page (4096Bytes)
    ; if the following page is not allocated by our process.
    ; Memory exception because we cand read memory or other process.
    
  	; IMM8[1:0]	= 00b
    ;	Src data is unsigned bytes(16 packed unsigned bytes)
    
  	; IMM8[3:2]	= 10b
    ; 	We are using Equal Each aggregation
    
  	; IMM8[5:4]	= 00b
    ;	Positive Polarity, IntRes2	= IntRes1
    
  	; IMM8[6]	= 0b
  	;	ECX contains the least significant set bit in IntRes2
  	;
    ; XMM0 XMM1 XMM2 XMM3 XMM4
    ; XMM1 = [String1] : XMM2=[String2] : XMM3=WideCharMask
    
    DisableDebugger
    CompilerIf #PB_Compiler_64Bit
      
      !XOR RDX, RDX           ; RDX = 0
      !XOR RCX, RCX           ; RCX = 0
      !MOV RAX, [p.p_String]  ; RAX = *String
      !@@:                    ; 
      !TEST RAX, 0Fh          ; Test for 16Byte align
      !JZ @f                  ; If NOT aligned
        !MOV DL, BYTE[RAX]    ;   process Char by Char until aligned
        !TEST RDX, RDX        ;   Check for EndOfString
        !JZ .Return           ;   Break if EndOfString
        !INC RAX              ; Pointer to NextChar
      !JMP @b                 ; Jump back to @@      
      !@@:                    ; from here we have 16Byte aligned address
      
      !PXOR XMM0, XMM0    
      !SUB RAX, 16
      
      !@@:  
        !ADD RAX, 16    
        !PCMPISTRI XMM0, [RAX], 0001000b ; EQUAL_EACH, unsigned_Bytes
      !JNZ @b
      ; ECX will contain the offset from eax where the first null
    	; terminating character was found.
      !ADD RAX, RCX    
      
      !.Return:
      !SUB RAX, [p.p_String] 
      ProcedureReturn
      
    CompilerElse   
      
      !XOR EDX, EDX           ; EDX = 0
      !XOR ECX, ECX           ; RCX = 0
      !MOV EAX, [p.p_String]  ; EAX = *String
      !@@:                    ; 
      !TEST EAX, 0Fh          ; Test for 16Byte align
      !JZ @f                  ; If NOT aligned
        !MOV DL, BYTE[EAX]    ;   process Char by Char until aligned
        !TEST EDX, EDX        ;   Check for EndOfString
        !JZ .Return           ;   Break if EndOfString
        !INC EAX              ;   Pointer to NextChar
      !JMP @b                 ; Jump back to @@      
      !@@:                    ; from here we have 16Byte aligned address
      
      !PXOR XMM0, XMM0    
      !SUB EAX, 16
      
      !@@:  
        !ADD EAX, 16    
        !PCMPISTRI XMM0, [EAX], 0001000b ; EQUAL_EACH, unsigned_Bytes
      !JNZ @b
      ; ECX will contain the offset from eax where the first null
    	; terminating character was found.
      !ADD EAX, ECX    
      
      !.Return:
      !SUB EAX, [p.p_String] 
      ProcedureReturn
      
    CompilerEndIf 
    
    EnableDebugger
  EndProcedure
  
  Procedure.i SSE_Len(*String)
  ; ============================================================================
  ; NAME: SSE_Len
  ; DESC: Length in number of characters of 2-Byte Char Strings
  ; DESC: Use SSE PCmpIStrI operation. This is aprox. 3 times faster than PB Len()
  ; VAR(*String): Pointer to String
  ; RET.i: Number of Characters
  ; ============================================================================
    
  	; IMM8[1:0]	= 00b
  	;	Src data is unsigned bytes(16 packed unsigned bytes)
  	; IMM8[3:2]	= 10b
  	; 	We are using Equal Each aggregation
  	; IMM8[5:4]	= 00b
  	;	Positive Polarity, IntRes2	= IntRes1
  	; IMM8[6]	= 0b
  	;	ECX contains the least significant set bit in IntRes2
    
    ; XMM0 XMM1 XMM2 XMM3 XMM4
    ; XMM1 = [String1] : XMM2=[String2] : XMM3=WideCharMask
    
    DisableDebugger
    
    CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm
      
      CompilerIf #PB_Compiler_64Bit     
        
        !XOR RDX, RDX
        !XOR RCX, RCX
        !MOV RAX, [p.p_String] 
        
        !@@:
        !TEST RAX, 0Fh            ; Test for 16Byte align
        !JZ @f                    ; If NOT aligned
          !MOV DX, WORD [RAX]     ;   process Char by Char until aligned
          !TEST RDX, RDX          ;   Check for EndOfString
          !JZ .Return             ;   Break if EndOfString
          !INC RAX                ;   Pointer to NextChar
        !JMP @b                   ; Jump back to @@   
        !@@:                      ; from here we have 16Byte aligned address
        
        !PXOR XMM0, XMM0
        !SUB RAX, 16      
        
        !@@:  
          !ADD RAX, 16
          !PCMPISTRI XMM0, [RAX], 0001001b  ; EQUAL_EACH WORD
        !JNZ @b
        
        ; RCX will contain the offset from RAX where the first null
      	; terminating character was found.
        !SHL RCX, 1   ; Word to Byte
        !ADD RAX, RCX
        
        !.Return:
        !SUB RAX, [p.p_String]
        !SHR RAX, 1               ; ByteCounter to Word
        ProcedureReturn
        
      CompilerElse ; #PB_Compiler_32Bit       
        
        !XOR EDX, EDX
        !XOR ECX, ECX
        !MOV EAX, [p.p_String] 
        
        !@@:
        !TEST EAX, 0Fh            ; Test for 16Byte align
        !JZ @f                    ; If NOT aligned
          !MOV DX, WORD [EAX]     ;   process Char by Char until aligned
          !TEST EDX, EDX          ;   Check for EndOfString
          !JZ .Return             ;   Break if EndOfString
          !INC EAX                ;   Pointer to NextChar
        !JMP @b                   ; Jump back to @@   
        
        !@@:                      ; from here we have 16Byte aligned address
        !PXOR XMM0, XMM0
        !SUB EAX, 16      
        
        !@@:  
          !ADD EAX, 16  
          !PCMPISTRI XMM0, [EAX], 0001001b  ; EQUAL_EACH WORD
        !JNZ @b
        
        ; ECX will contain the offset from EAX where the first null
      	; terminating character was found.
        !SHL ECX, 1   ; Word to Byte
        !ADD EAX, ECX
        
        !.Return:
        !SUB EAX, [p.p_String]
        !SHR EAX, 1               ; Byte to Word
      CompilerEndIf
      
    CompilerElse  ; #PB_Compiler_Backend = #PB_Backend_C
      
      Protected *pStr.String 
      *pStr = *String
      ProcedureReturn Len(*pStr\s)
        
    CompilerEndIf
    
    EnableDebugger
  EndProcedure

  Procedure.i SSE_StringCompare(*String1, *String2, *Pos=0)
  ; ============================================================================
  ; NAME: SSE_StringCompare
  ; DESC: Compares 2 Strings with SSE operation (PCmpIStrI)
  ; VAR(*String1): Pointer to String 1
  ; VAR(*String2): Pointer to String 2
  ; VAR(*Pos): optional Pointer to an Int to get the CharNo which do not match 
  ; RET.i: 0=(S1=S2), 1=(S1>S2), -1=(S1<S2) #PB_String_Lower/Equal/Greater
  ; ============================================================================
        
    ; XMM0 XMM1 XMM2 XMM3 XMM4
    ; XMM1 = [String1] : XMM2=[String2]
  ;  DisableDebugger
    
    CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm
      CompilerIf #PB_Compiler_64Bit 
        DisableDebugger
       
        ; used Registers
        ;   RAX : *String1
        ;   R8  : *String2
        ;   RCX : operating Register
        ;   RDX : operating Register
        
        ; ----------------------------------------------------------------------
        ; Check the *String1 and *String2 align
        ; The Problem of not aligend 16 is: Reading over the end of a 
        ; memory page. If the next page is not allocated to our programm we
        ; produce a crash! So we have to be sure do not read over the 
        ; EndOfPage if the String ends short before. PCMPISTRI process 16Bytes,
        ; so if we use PCMPISTRI at align 16 we can't read over the end of
        ; String without detecting EndOfString first!
        ; ----------------------------------------------------------------------
        !MOV RAX, [p.p_String1]
        !MOV R8, [p.p_String2]
        
        !MOV RCX, RAX           ; RCX = *String1
        !AND RCX, 0Fh           ; Filter the Aling Offset to 16Bytes      
        !MOV RDX, R8            ; RDX = *String2
        !SUB R8, RAX
        
        !AND RDX, 0Fh           ; Filter the Aling Offset to 16Bytes
        !TEST RDX, 1            ; Test for Odd align -> Align 16 not possible
        !JNZ .NotAligned        
        
        !CMP RDX, RCX           ; Test if align of String1 and String2 is identical
        !JNE .NotAligned      
        
        !TEST RCX, RCX          ; Test if it is aligend to 16Bytes (Offset ==0)
        !JZ .a16                ; aligned to 16 Bytes, we have to do nothing
        
        ; ----------------------------------------------------------------------
        ; Case I: Not aligned to 16 Bytes but it's possilbe to align manually
        ; ----------------------------------------------------------------------

        ; identical align but not to 16Bytes
        !SUB RAX, 2
        
        ; so first we compare Char by Char until the Address is 16 Byte aligned  
        !@@:                      ; Loop
          !ADD RAX, 2
          ;!ADD R8, 2
          !TEST RAX, 0Fh          ; (AND RAX, 0Fh) == 0
          !JZ .a16                ; Continue at Case III: aligend to 16 Byte
          !MOV CX, WORD[RAX]
          !CMP CX, WORD[RAX+R8]
          !JA .GREATER
          !JB .LOWER
          ; if identical check for EndOfString
          !TEST CX, 0             ; TEST results in 0 if CX==0
          !JZ .EQUAL
        !JMP @b                   ; Not EndOfString -> Repeat Loop  
        
        ; ----------------------------------------------------------------------
        ; Case II:
        ; A complete different align of *String1 and *String2, so it is not
        ; possible to aling both toghether to 16Byte. In this case we have
        ; 2 options:
        ;   I) we don't use PCMPISTRI and do a classic Char by Char compare
        ;  II) we have to check EndOfMemPage and use classic Char by Char
        ;      at end of MemoryPages (4096 Bytes). But that's more 
        ;      complicated
        ; ----------------------------------------------------------------------
        
        !.NotAligned:             ; Not aligned : a complet different align       
        !SUB RAX, 2
         
        !@@:
          !ADD RAX, 2
          !MOV CX, WORD[RAX]
          !CMP CX, WORD[RAX+R8]
          !JA .GREATER
          !JB .LOWER         
          ; if identical check for EndOfString
          !TEST CX, 0             ; TEST results in 0 if CX==0 
          !JZ .EQUAL
        !JMP @b                   ; Not EndOfString -> Repeat Loop
                       
        ; ----------------------------------------------------------------------
        ; Case III:
        ; If *String1 And *String is aligned to 16Bytes
        ; we can use PCMPISTRI what is a 16Byte operation
        ; ----------------------------------------------------------------------
        
        !.a16:
        ; Subtract s2(RDX) from s1(RAX). This admititedly looks odd, but we
      	; can now use RDX to index into s1 and s2. As we adjust RDX to move
      	; forward into s2, we can then add RDX to RAX and this will give us
      	; the comparable offset into s1 i.e. if we take RDX + 16 then:
      	;
      	;	RDX     = RDX + 16		        = RDX + 16
      	;	RAX+RDX	= RAX -RDX + RDX + 16	= RAX + 16
      	;
      	; therefore RDX points to s2 + 16 and RAX + RDX points to s1 + 16.
      	; We only need one index, convoluted but effective.
      
       	!SUB RAX, 16		         
        !XOR RCX, RCX   
        
        !@@:
         	!ADD RAX, 16
         	!MOVDQA XMM0, [RAX]         	
         	; IMM8[1:0]	= 00b
       	  ;	00b: Src data is unsigned bytes(16 packed unsigned bytes)
       	  ;	01b: Src data is unsigned words( 8 packed unsigned words)
        	
         	; IMM8[3:2]	= 10b
        	; 	We are using Equal Each aggregation
        	; IMM8[5:4]	= 01b
        	;	Negative Polarity, IntRes2	= -1 XOR IntRes1
        	; IMM8[6]	= 0b
        	;	ECX contains the least significant set bit in IntRes2  	
        	!PCMPISTRI XMM0, [RAX+R8], 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS  	
        	; Loop while ZF=0 and CF=0:
        	;	1) We find a null in s1(RDX+RAX) ZF=1
        	;	2) We find a char that does not match CF=1  	
        !JA	@b      ; IF CF=0 And ZF=0     	  	         
      	!JC	@f      ; IF CF=1 : Jump if CF=1, we found a mismatched char      	          
       	!JMP .EQUAL	; We terminated loop due to a null character i.e. CF=0 and ZF=1 -> The Strings are equal
       	
       	!@@:
          ; ECX is the offset from the current poition in NoOfChars where the two strings do not match,
        	; so copy the respective non-matching char into DX and compare it with the position in *String2
        	; in remaining bits w/ zero. Because of 2ByteChar we have to convert Word to Byte
          
          !SHL RCX, 1             ; Number of Chars to Adress Offset
          !ADD RAX, RCX
          !MOV DX, WORD[RAX]
         	; If S1=S2 : Return (0) ; #PB_String_Equal
          ; If S1>S2 : Return (+) ; #PB_String_Greater
          ; If S1<S2 : Return (-) ; #PB_String_Lower
          !CMP DX, WORD [RAX+R8]
          !JA .GREATER
          !JB .LOWER
          ;!JMP .EQUAL 
          
        !.EQUAL:                    ; The Strings are equal
          !MOV R8, RAX  
          !XOR RAX, RAX             ; #PB_String_Equal, 0  
          !JMP @f                      
        !.LOWER:                    ; String1 < String2
          !MOV R8, RAX  
          !XOR RAX, RAX
          !DEC RAX                  ; #PB_String_Lower, -1  
          !JMP @f                      
        !.GREATER:                  ; String1 > String2
          !MOV R8, RAX  
          !XOR RAX, RAX
          !INC RAX                  ; #PB_String_Greater, 1
        !@@:  
          ; check for Return of CharNo in Pos 
          !MOV RDX, [p.p_Pos]       ; RDX = *Pos
          !TEST RDX, RDX            ; 
          !JZ .return               ; If *Pos = 0 Then return  
            !SUB R8, [p.p_String1]
            !SHR R8, 1              ; Byte to Word
            !MOV [RDX], R8          ; Pos = CharNo which do not match
        !.return:
        ProcedureReturn ; RAX
        EnableDebugger
        
      CompilerElse  ; #PB_Compiler_32Bit
        
        DisableDebugger
       
        ; used Registers
        ;   EAX : *String1
        ;   EBX  : *String2
        ;   ECX : operating Register
        ;   EDX : operating Register
        
        ASM_PUSH_EBX()
        ; ----------------------------------------------------------------------
        ; Check the *String1 abd *String2 align
        ; The Problem of not aligend 16 is: Reading over the end of a 
        ; memory page. If the next page is not allocated to our programm we
        ; produce a crash! So we have to be sure do not read over the 
        ; EndOfPage if the String ends short before. PCMPISTRI process 16Bytes,
        ; so if we use PCMPISTRI at align 16 we can't read over the end of
        ; String without detecting EndOfString first!
        ; ----------------------------------------------------------------------
        !MOV EAX, [p.p_String1]
        !MOV EBX, [p.p_String2]
        
        !MOV EXC, EAX           ; EXC = *String1
        !AND EXC, 0Fh           ; Filter the Aling Offset to 16Bytes      
        !MOV EDX, EBX            ; EDX = *String2
        !SUB EBX, EAX
        
        !AND EDX, 0Fh           ; Filter the Aling Offset to 16Bytes
        !TEST EDX, 1            ; Test for Odd align -> Align 16 not possible
        !JNZ .NotAligned        
        
        !CMP EDX, EXC           ; Test if align of String1 and String2 is identical
        !JNE .NotAligned      
        
        !TEST EXC, EXC          ; Test if it is aligend to 16Bytes (Offset ==0)
        !JZ .a16                ; aligned to 16 Bytes, we have to to nothing
        
        ; ----------------------------------------------------------------------
        ; Case I: Not aligned to 16 Bytes but it's possilbe to align manually
        ; ----------------------------------------------------------------------

        ; identical align but not to 16Bytes
        !SUB EAX, 2
        
        ; so first we compare Char by Char until the Address is 16 Byte aligned  
        !@@:                      ; Loop
          !ADD EAX, 2
          ;!ADD EBX, 2
          !TEST EAX, 0Fh          ; (AND EAX, 0Fh) == 0
          !JZ .a16                ; Continue at Case III: aligend to 16 Byte
          !MOV CX, WORD[EAX]
          !CMP CX, WORD[EAX+EBX]
          !JA .GREATER
          !JB .LOWER
          ; if identical check for EndOfString
          !TEST CX, 0             ; TEST results in 0 if CX==0
          !JZ .EQUAL
        !JMP @b                   ; Not EndOfString -> Repeat Loop  
        
        ; ----------------------------------------------------------------------
        ; Case II:
        ; A complete different align of *String1 and *String2, so it is not
        ; possible to aling both toghether to 16Byte. In this case we have
        ; 2 options:
        ;   I) we don't use PCMPISTRI and do a classic Char by Char compare
        ;  II) we have to check EndOfMemPage and use classic Char by Char
        ;      at end of MemoryPages (4096 Bytes). But that's more 
        ;      complicated
        ; ----------------------------------------------------------------------
        
        !.NotAligned:             ; Not aligned : a complet different align       
        !SUB EAX, 2
         
        !@@:
          !ADD EAX, 2
          !MOV CX, WORD[EAX]
          !CMP CX, WORD[EAX+EBX]
          !JA .GREATER
          !JB .LOWER         
          ; if identical check for EndOfString
          !TEST CX, 0             ; TEST results in 0 if CX==0 
          !JZ .EQUAL
        !JMP @b                   ; Not EndOfString -> Repeat Loop
                       
        ; ----------------------------------------------------------------------
        ; Case III:
        ; If *String1 And *String is aligned to 16Bytes
        ; we can use PCMPISTRI what is a 16Byte operation
        ; ----------------------------------------------------------------------
        
        !.a16:
        ; Subtract s2(EDX) from s1(EAX). This admititedly looks odd, but we
      	; can now use EDX to index into s1 and s2. As we adjust EDX to move
      	; forward into s2, we can then add EDX to EAX and this will give us
      	; the comparable offset into s1 i.e. if we take EDX + 16 then:
      	;
      	;	EDX     = EDX + 16		        = EDX + 16
      	;	EAX+EDX	= EAX -EDX + EDX + 16	= EAX + 16
      	;
      	; therefore EDX points to s2 + 16 and EAX + EDX points to s1 + 16.
      	; We only need one index, convoluted but effective.
      
       	!SUB EAX, 16		         
        !XOR EXC, EXC   
        
        !@@:
         	!ADD EAX, 16
         	!MOVDQA XMM0, [EAX]         	
         	; IMM8[1:0]	= 00b
       	  ;	00b: Src data is unsigned bytes(16 packed unsigned bytes)
       	  ;	01b: Src data is unsigned words( 8 packed unsigned words)
        	
         	; IMM8[3:2]	= 10b
        	; 	We are using Equal Each aggregation
        	; IMM8[5:4]	= 01b
        	;	Negative Polarity, IntRes2	= -1 XOR IntRes1
        	; IMM8[6]	= 0b
        	;	ECX contains the least significant set bit in IntRes2  	
        	!PCMPISTRI XMM0, [EAX+EBX], 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS  	
        	; Loop while ZF=0 and CF=0:
        	;	1) We find a null in s1(EDX+EAX) ZF=1
        	;	2) We find a char that does not match CF=1  	
        !JA	@b      ; IF CF=0 And ZF=0     	  	         
      	!JC	@f      ; IF CF=1 : Jump if CF=1, we found a mismatched char      	          
       	!JMP .EQUAL	; We terminated loop due to a null character i.e. CF=0 and ZF=1 -> The Strings are equal
       	
       	!@@:
          ; ECX is the offset from the current poition in NoOfChars where the two strings do not match,
        	; so copy the respective non-matching char into DX and compare it with the position in *String2
        	; in remaining bits w/ zero. Because of 2ByteChar we have to convert Word to Byte
          
          !SHL EXC, 1             ; Number of Chars to Adress Offset
          !ADD EAX, EXC
          !MOV DX, WORD[EAX]
         	; If S1=S2 : Return (0) ; #PB_String_Equal
          ; If S1>S2 : Return (+) ; #PB_String_Greater
          ; If S1<S2 : Return (-) ; #PB_String_Lower
          !CMP DX, WORD [EAX+EBX]
          !JA .GREATER
          !JB .LOWER
          ;!JMP .EQUAL 
          
        !.EQUAL:                    ; The Strings are equal
          !MOV EBX, EAX  
          !XOR EAX, EAX             ; #PB_String_Equal, 0  
          !JMP @f                      
        !.LOWER:                    ; String1 < String2
          !MOV EBX, EAX  
          !XOR EAX, EAX
          !DEC EAX                  ; #PB_String_Lower, -1  
          !JMP @f                      
        !.GREATER:                  ; String1 > String2
          !MOV EBX, EAX  
          !XOR EAX, EAX
          !INC EAX                  ; #PB_String_Greater, 1
        !@@:  
          ; check for Return of CharNo in Pos 
          !MOV EDX, [p.p_Pos]       ; EDX = *Pos
          !TEST EDX, EDX            ; 
          !JZ .return               ; If *Pos = 0 Then return  
            !SUB EBX, [p.p_String1]
            !SHR EBX, 1              ; Byte to Word
            !MOV [EDX], EBX          ; Pos = CharNo which do not match
        !.return:     
        ASM_POP_EBX()   
        ProcedureReturn ; EAX
        EnableDebugger
        
      CompilerEndIf
      
    CompilerElse    ; C-Backend
      
      ; for now use PB CompareMemoryString. So it will work on other Platforms too.
      ; maybe provide a C optimized version in the future
      ProcedureReturn CompareMemoryString(*String1, *String2)  
      
    CompilerEndIf
    
   EndProcedure
  
  Procedure.i SSE_FindStr(*String, *StringToFind)
  ; ============================================================================
  ; NAME: SSE_FindStr
  ; DESC: Try to find StringToFind in String with SSE operation (PCmpIStrI)
  ; DESC: Search for the needle in the haystack
  ; DESC: This Function is for 2Byte Character Strings only
  ; VAR(*String): Pointer to String (Haystack)
  ; VAR(*StringToFind): Pointer to StringToFind (Needle)
  ; RET.i: If found: The startposition in Characters [1..n]. Otherwise 0
  ; ============================================================================    
        
    DisableDebugger
    
    ; TODO! Solve the 16Byte align problem
    
    CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm
      
      CompilerIf #PB_Compiler_64Bit 
        Protected memRAX, memRDX
        ; Returns a pointer To the first occurrence of str2 in str1, Or a null pointer If str2 is Not part of str1. 
        ; The matching process does not include the terminating null-characters, but it stops there
        ; RAX = haystack (Heuhaufen), RDX = needle (Nadel)
        
        ; XMM0 XMM1 XMM2 XMM3 XMM4
        ; XMM1 = [String1] : XMM2=[String2]
        
        !MOV RAX, [p.p_String]        ; haystack
        !MOV RDX, [p.p_StringToFind]  ; needle
        !MOVDQU XMM2, DQWORD[RDX] ; load the first 16 bytes of neddle (String to find)
    
       	!SUB RAX, 16		; Avoid extra jump in main loop
           
        ; ----------------------------------------------------------------------
        ; Find the first possible match of 16-byte fragment in haystack
        ; ----------------------------------------------------------------------
        !FindStr_MainLoop:
          !ADD RAX, 16      ; Step up Counter
          !MOVDQU XMM1, DQWORD[RAX]
         ;!PCMPISTRI XMM2, XMM1, 1100b ; EQUAL_ORDERED ; for ASCII Strings
          !PCMPISTRI XMM2, XMM1, 1101b ; EQUAL_ORDERED + UNSIGNED_WORDS; 11001b
          ; now RCX contains the offset in WORDS where a match was found
         	; Loop while ZF=0 and CF=0:
        	;	1) We find a null in s1(RAX) ZF=1
          ;	2) We find a char that does not match CF=1 
        !JA FindStr_MainLoop
        ; Jump if CF=0, we found only matching chars  
        !JNC FindStr_StrNotFound
        
        ; possible match found at WordOffset in RCX
        !ADD RCX, RCX ; Word to Byte
        !ADD RAX, RCX ; save the possible match start
                
        !MOV [p.v_memRDX], RDX ; mov edi, edx; save RDX
        !MOV [p.v_memRAX], RAX ; mov esi, eax; save RAX
        
        ; ----------------------------------------------------------------------
        ; Compare String, at possible match postion in haystack, with needle
        ; ----------------------------------------------------------------------
        !SUB RDX, RAX
        !SUB RAX, 16  ; counter
        
        !PXOR XMM3, XMM3          ; XMM3 = 0
        
        ; compare the strings
        !FindStr_Compare:
          !ADD RAX, 16  ; Counter
          !MOVDQU XMM1, DQWORD[RAX+RDX] ; Haystack          
          ; mask out invalid bytes in the haystack
         ;!PCMPISTRM XMM3, XMM1, 1011000b   ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK  ; for ASCII Strings
          !PCMPISTRM XMM3, XMM1, 1011001b   ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK + UNSIGNED_WORDS
          ; PCMPISTRM writes as result a Mask To XMM0, we used BYTE_MASK
          !MOVDQU XMM4, DQWORD[RAX] ; haystack  
          !PAND XMM4, XMM0
          
         ;!PCMPISTRI XMM1, XMM4, 0011000b ; EQUAL_EACH + NEGATIVE_POLARITY ; for ASCII Strings
          !PCMPISTRI XMM1, XMM4, 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS
         	; Loop while ZF=0 and CF=0:
        	;	1) We find a null in s1(RDX+RCX) ZF=1      {JA CF=0 & ZF=0} {JE : ZF=1)
          ;	2) We find a char that does not match CF=1 {JC, JNC}
          ; 3) We find a null in s2 SF=1               {JS, JNS}
          ;!JS FindStr_StrNotFound 
        !JA FindStr_Compare ; CF=0 AND ZF=0
        
        !MOV RDX, [p.v_memRDX]
        !MOV RAX, [p.v_memRAX]
        !JNC FindStr_StrFound
        
        ;!SUB RAX, 15  ; for ASCII Strings
        !SUB RAX, 14
        !JMP FindStr_MainLoop
        
        !FindStr_StrNotFound:
          !XOR RAX, RAX
          !JMP FindStr_End
          
        !FindStr_StrFound:
          ; because RAX contains the Pointer we have to calculate the Char-No.
          !SUB RAX, [p.p_String]    ; Sub the Haystack Start-Pointer
          !SHR RAX, 1  ; Byte to Word: not needed for ASCII Strings
          !ADD RAX, 1  ; Add 1 to start with 1 as first Char-No.
        !FindStr_End:
        ProcedureReturn  ; !RAX
 
      CompilerElse  ; #PB_Compiler_32Bit
        
        Protected memEAX, memEDX
        ; Returns a pointer To the first occurrence of str2 in str1, Or a null pointer If str2 is Not part of str1. 
        ; The matching process does not include the terminating null-characters, but it stops there
        ; RAX = haystack (Heuhaufen), EDX = needle (Nadel)
        
        ; XMM0 XMM1 XMM2 XMM3 XMM4
        ; XMM1 = [String1] : XMM2=[String2]
        
        !MOV EAX, [p.p_String]        ; haystack
        !MOV EDX, [p.p_StringToFind]  ; needle
        !MOVDQU XMM2, DQWORD[EDX] ; load the first 16 bytes of neddle (String to find)
    
       	!SUB EAX, 16		; Avoid extra jump in main loop
           
        ; ----------------------------------------------------------------------
        ; Find the first possible match of 16-byte fragment in haystack
        ; ----------------------------------------------------------------------
        !FindStr_MainLoop:
          !ADD EAX, 16      ; Step up Counter
          !MOVDQU XMM1, DQWORD[EAX]
         ;!PCMPISTRI XMM2, XMM1, 1100b ; EQUAL_ORDERED ; for ASCII Strings
          !PCMPISTRI XMM2, XMM1, 1101b ; EQUAL_ORDERED + UNSIGNED_WORDS; 11001b
          ; now RCX contains the offset in WORDS where a match was found
         	; Loop while ZF=0 and CF=0:
        	;	1) We find a null in s1(EAX) ZF=1
          ;	2) We find a char that does not match CF=1 
        !JA FindStr_MainLoop
        ; Jump if CF=0, we found only matching chars  
        !JNC FindStr_StrNotFound
        
        ; possible match found at WordOffset in ECX
        !ADD ECX, ECX ; Word to Byte
        !ADD EAX, ECX ; save the possible match start
                
        !MOV [p.v_memEDX], EDX ; mov edi, edx; save EDX
        !MOV [p.v_memEAX], EAX ; mov esi, eax; save EAX
        
        ; ----------------------------------------------------------------------
        ; Compare String, at possible match postion in haystack, with needle
        ; ----------------------------------------------------------------------
        !SUB EDX, EAX
        !SUB EAX, 16  ; counter
        
        !PXOR XMM3, XMM3          ; XMM3 = 0
        
        ; compare the strings
        !FindStr_Compare:
          !ADD EAX, 16  ; Counter
          !MOVDQU XMM1, DQWORD[EAX+EDX] ; Haystack          
          ; mask out invalid bytes in the haystack
         ;!PCMPISTRM XMM3, XMM1, 1011000b   ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK  ; for ASCII Strings
          !PCMPISTRM XMM3, XMM1, 1011001b   ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK + UNSIGNED_WORDS
          ; PCMPISTRM writes as result a Mask To XMM0, we used BYTE_MASK
          !MOVDQU XMM4, DQWORD[EAX] ; haystack  
          !PAND XMM4, XMM0
          
         ;!PCMPISTRI XMM1, XMM4, 0011000b ; EQUAL_EACH + NEGATIVE_POLARITY ; for ASCII Strings
          !PCMPISTRI XMM1, XMM4, 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS
         	; Loop while ZF=0 and CF=0:
        	;	1) We find a null in s1(EDX+ECX) ZF=1      {JA CF=0 & ZF=0} {JE : ZF=1)
          ;	2) We find a char that does not match CF=1 {JC, JNC}
          ; 3) We find a null in s2 SF=1               {JS, JNS}
          ;!JS FindStr_StrNotFound 
        !JA FindStr_Compare ; CF=0 AND ZF=0
        
        !MOV EDX, [p.v_memEDX]
        !MOV EAX, [p.v_memEAX]
        !JNC FindStr_StrFound
        
        ;!SUB EAX, 15  ; for ASCII Strings
        !SUB EAX, 14
        !JMP FindStr_MainLoop
        
        !FindStr_StrNotFound:
          !XOR EAX, EAX
          !JMP FindStr_End
          
        !FindStr_StrFound:
          ; because EAX contains the Pointer we have to calculate the Char-No.
          !SUB EAX, [p.p_String]    ; Sub the Haystack Start-Pointer
          !SHR EAX, 1  ; Byte to Word: not needed for ASCII Strings
          !ADD EAX, 1  ; Add 1 to start with 1 as first Char-No.
        !FindStr_End:
       
        ProcedureReturn 
      CompilerEndIf  ; #PB_Compiler_32Bit
      
    CompilerElse    ; C-Backend
      
      ; for now use PB FindString. So it will work on other Platforms too.
      ; maybe provide a C optimized version in the future
      Protected *pStr.String = *String
      Protected *pStrToFind.String = *StringToFind
      
      ProcedureReturn FindString(*pStr\s, *pStrToFind\s)    
    CompilerEndIf    
    
  EndProcedure
    
  ;- ----------------------------------------------------------------------
  ;- Initalisation
  ;- ----------------------------------------------------------------------
  
;   ; PCmpIStrI needs SSE4.2
;   If CPU::CpuMultiMediaFeatures\SSE4_2
;     Debug "SSE4.2 is supported"
;   EndIf
  
EndModule

CompilerIf #PB_Compiler_IsMainFile    
  ;- ----------------------------------------------------------------------
  ;- TEST-CODE
  ;- ----------------------------------------------------------------------
  
  EnableExplicit

  UseModule StrSSE
  
  Define sTest.s, sTest2.s, sASC.s
  Define sDbg.s
  Define I
  
  Dim bChar.b(255)  ; ASCII CHAR Array
  
  For I=0 To 98     ; Fill Char Array with 100 Ascii Chars
    bChar(i) = 33+I
  Next  
  
  Debug "--------------------------------------------------"
  Debug "String Len"
  Debug "--------------------------------------------------"

  sTest= Space(255)   ; Fill TestString with 255 Spaces
  
  sDbg= "PB: Len() = "  + Len(sTest) ; should be 255
  Debug sDbg
  sDbg = "SSE Len = " + Str(SSE_Len(@sTest)) ; should be 255
  Debug sDbg
  sDbg = "ASCII Len() = " + Str(SSE_LenA(@bChar(0)))  ; should be 100 Chars
  Debug sDbg
  
  ;- ----------------------------------------------------------------------

  Define.s S0, S1, S2, sQ
  Define ret
  Dim cmp.s(2)
  
  cmp(0) = "<"
  cmp(1) = "="
  cmp(2) = ">"
  sQ.s = Chr('"') ; Quotes
       ;1        10                                    48
  S0 = "Ich bin ein langer String, in welchem man nach 1234 suchen kann 5677"
  S1 = "Ich bin ein langer String, in welchem man nach 1234 suchen kann 5678"
  S2 = "Ich bin ein langer String, in welchem man nach 1234 suchen kann 5679"
   
  Debug "--------------------------------------------------"
  Debug "StringCompare"
  Debug "--------------------------------------------------"
  ;Debug ""
  ret = SSE_StringCompare(@S0, @S1)   ; =
  Debug ret
  Debug sQ + S0 + sQ + "  " + cmp(ret+1) + "  " + sQ + S1 + sQ
  
  ret = SSE_StringCompare(@S0, @S2)   ; <
  Debug ret
  Debug sQ + S0 + Sq + "  " + cmp(ret+1) +  "  " + sQ + S2 + sQ
  
  ret = SSE_StringCompare(@S2, @S1)   ; <
  Debug ret
  Debug sQ + S2 + sQ + "  " + cmp(ret+1) +  "  " + sQ + S1 + sQ
  
  Debug "--------------------------------------------------"
  Debug "FindString"
  Debug "--------------------------------------------------"
  ;Debug ""
  
  Define Search$ 
  Search$ = "1234"
  ;Search$ = "bin"
  ret = SSE_FindStr(@S0, @Search$)
  Debug ret
  
  Debug "--------------------------------------------------"

  ; ----------------------------------------------------------------------
  ; TIMING TEST
  ; ----------------------------------------------------------------------
  
  #cst_Loops = 2000000  ; 2Mio
  
  Define T1, T2, txtStrLen.s, txtStrCompare.s
  
  Debug "Stringlength"
  Debug Str(@S1 % 32) + " : " + Hex(@S1)
  Debug Str(@S2 % 16) + " : " + Hex(@S2)
  
  ; ----------------    StringLength ----------------------
  ; SSE Assembler Version
  ; S1 = Space(15000)
  
  T1 = ElapsedMilliseconds()
  For I = 1 To #cst_Loops
    ret = SSE_Len(@S1) 
  Next
  T1 = ElapsedMilliseconds() - T1
  
  ; Standard PB StringLenth
  T2 = ElapsedMilliseconds()
  For I = 1 To #cst_Loops
    ;ret = Len(S1)
    ret = MemoryStringLength(@S1)
  Next
  T2 = ElapsedMilliseconds() - T2
  
  txtStrLen = "StringLength  " + #cst_Loops + " Calls : ASM SSE = " + T1 + " / " + "PB Version = " + T2
  
  
  ; ----------------    StringCompare ----------------------
  
  ; SSE Assembler Version
  T1 = ElapsedMilliseconds()
  For I = 1 To #cst_Loops
    ret = SSE_StringCompare(@S1, @S2)
  Next
  T1 = ElapsedMilliseconds() - T1
  
  ; Standard PB StringLenth
  T2 = ElapsedMilliseconds()
  For I = 1 To #cst_Loops
    ret = CompareMemoryString(@S1, @S2)
  Next
  T2 = ElapsedMilliseconds() - T2
  
  txtStrCompare = "StringCompare " + #cst_Loops + " Calls : ASM SSE = " + T1 + " / " + "PB Version = " + T2
  
  MessageRequester("Timing results", txtStrLen + #CRLF$ + txtStrCompare)
  
CompilerEndIf

here the needed Assembler Macros for Push Registers
PbFw_ASM_Macros.pbi

Code: Select all

; ===========================================================================
;  FILE : PbFw_ASM_Macros.pbi
;  NAME : Collection of Assembler optimation Macros
;  DESC : Macros for PUSH, POP MMX and XMM Registers when using
;  DESC : in PB Procedures
;  DESC : The macros don't work in out of Procedure code becuase
;  DESC : the in Procedure Assembler variable convetion is used
;  DESC : [p.p_] [p.v_] for Pointers / Variables
;  DESC : For use outside of Procedures you have to use [p_] [v_]
; ===========================================================================
;
; AUTHOR   :  Stefan Maag
; DATE     :  2024/02/04
; VERSION  :  0.5 Developer Version
; COMPILER :  PureBasic 6.0
;
; LICENCE  :  MIT License see https://opensource.org/license/mit/
;             or \PbFramWork\MitLicence.txt
; ===========================================================================
; ChangeLog: 
;{ 
;}
;{ TODO:
;}
; ===========================================================================


; ------------------------------
; MMX and SSE Registers
; ------------------------------
; MM0..MM7    :  MMX    : Pentium P55C (Q5 1995) and AMD K6 (Q2 1997)
; XMM0..XMM15 :  SSE    : Intel Core2 and AMD K8 Athlon64 (2003)
; YMM0..YMM15 :  AVX256 : Intel SandyBridge (Q1 2011) and AMD Bulldozer (Q4 2011)
; X/Y/ZMM0..31 : AVX512 : Tiger Lake (Q4 2020) and AMD Zen4 (Q4 2022)

; ------------------------------
; Caller/callee saved registers
; ------------------------------
; The x64 ABI considers the registers RAX, RCX, RDX, R8, R9, R10, R11, And XMM0-XMM5 volatile.
; When present, the upper portions of YMM0-YMM15 And ZMM0-ZMM15 are also volatile. On AVX512VL;
; the ZMM, YMM, And XMM registers 16-31 are also volatile. When AMX support is present, 
; the TMM tile registers are volatile. Consider volatile registers destroyed on function calls
; unless otherwise safety-provable by analysis such As whole program optimization.
; The x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, And XMM6-XMM15 nonvolatile.
; They must be saved And restored by a function that uses them. 

;https://learn.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-170
; x64 calling conventions - Register use

; ------------------------------
; x64 CPU Register
; ------------------------------
; RAX 	    Volatile 	    Return value register
; RCX 	    Volatile 	    First integer argument
; RDX 	    Volatile 	    Second integer argument
; R8 	      Volatile 	    Third integer argument
; R9 	      Volatile 	    Fourth integer argument
; R10:R11 	Volatile 	    Must be preserved As needed by caller; used in syscall/sysret instructions
; R12:R15 	Nonvolatile 	Must be preserved by callee
; RDI 	    Nonvolatile 	Must be preserved by callee
; RSI 	    Nonvolatile 	Must be preserved by callee
; RBX 	    Nonvolatile 	Must be preserved by callee
; RBP 	    Nonvolatile 	May be used As a frame pointer; must be preserved by callee
; RSP 	    Nonvolatile 	Stack pointer
; ------------------------------
; MMX-Register
; ------------------------------
; MM0:MM7   Nonvolatile   Registers shared with FPU-Register. An EMMS Command is necessary after MMX-Register use
;                         to enable correct FPU functions again. 
; ------------------------------
; SSE Register
; ------------------------------
; XMM0, YMM0 	Volatile 	  First FP argument; first vector-type argument when __vectorcall is used
; XMM1, YMM1 	Volatile 	  Second FP argument; second vector-type argument when __vectorcall is used
; XMM2, YMM2 	Volatile 	  Third FP argument; third vector-type argument when __vectorcall is used
; XMM3, YMM3 	Volatile 	  Fourth FP argument; fourth vector-type argument when __vectorcall is used
; XMM4, YMM4 	Volatile 	  Must be preserved As needed by caller; fifth vector-type argument when __vectorcall is used
; XMM5, YMM5 	Volatile 	  Must be preserved As needed by caller; sixth vector-type argument when __vectorcall is used
; XMM6:XMM15, YMM6:YMM15 	Nonvolatile (XMM), Volatile (upper half of YMM) 	Must be preserved by callee. YMM registers must be preserved As needed by caller.


; @f = Jump forward to next @@;  @b = Jump backward to next @@  
; .Loop:      ; is a local Label or SubLable. It works form the last global lable
; The PB compiler sets a global label for each Procedure, so local lables work only inside the Procedure

; ------------------------------
; Some important SIMD instructions 
; ------------------------------
; https://hjlebbink.github.io/x86doc/

; PAND, POR, PXOR, PADD ...  : SSE2
; PCMPEQW         : SSE2  : Compare Packed Data for Equal
; PSHUFLW         : SSE2  : Shuffle Packed Low Words
; PSHUFHW         : SSE2  : Shuffle Packed High Words
; PSHUFB          : SSE3  : Packed Shuffle Bytes
; PEXTR[B/W/D/Q]  : SSE4.1 : PEXTRB RAX, XMM0, 1 : loads Byte 1 of XMM0[Byte 0..7] 
; PINSR[B/W/D/Q]  : SSE4.1 : PINSRB XMM0, RAX, 1 : transfers RAX LoByte to Byte 1 of XMM0 
; PCMPESTRI       : SSE4.2 : Packed Compare Implicit Length Strings, Return Index
; PCMPISTRM       : SSE4.2 : Packed Compare Implicit Length Strings, Return Mask


;- ----------------------------------------------------------------------
;- NaN Value 32/64 Bit
; #Nan32 = $FFC00000            ; Bit representaion for the 32Bit Float NaN value
; #Nan64 = $FFF8000000000000    ; Bit representaion for the 64Bit Float NaN value
;  ----------------------------------------------------------------------

; ----------------------------------------------------------------------
;  Structures to reserve Space on the Stack for ASM_PUSH, ASM_POP
; ----------------------------------------------------------------------

Structure TStack_16Byte
  R.q[2]  
EndStructure

Structure TStack_32Byte
  R.q[4]  
EndStructure

Structure TStack_48Byte
  R.q[6]  
EndStructure

Structure TStack_64Byte
  R.q[8]  
EndStructure

Structure TStack_96Byte
  R.q[12]  
EndStructure

Structure TStack_128Byte
  R.q[16]  
EndStructure

Structure TStack_256Byte
  R.q[32]  
EndStructure

Structure TStack_512Byte
  R.q[64]  
EndStructure

;- ----------------------------------------------------------------------
;- CPU Registers
;- ----------------------------------------------------------------------

; seperate Macros for EBX,RBX because this is often needed expecally for x32
Macro ASM_PUSH_EBX()
  Protected mEBX
  !MOV [p.v_mEBX], EBX
EndMacro

Macro ASM_POP_EBX()
   !MOV EBX, [p.v_mEBX]
EndMacro

Macro ASM_PUSH_RBX()
  Protected mRBX
  !MOV [p.v_mRBX], RBX
EndMacro

Macro ASM_POP_RBX()
   !MOV RBX, [p.v_mRBX]
EndMacro
 
; The LEA instruction: LoadEffectiveAddress of a variable
Macro ASM_PUSH_R10to11(ptrREG)
  Protected R1011.TStack_16Byte
  !LEA ptrREG, [p.v_R1011]        ; RDX = @R1011 = Pionter to RegisterBackupStruct
  !MOV [ptrREG], R10
  !MOV [ptrREG+8], R11
EndMacro

Macro ASM_POP_R10to11(ptrREG)
  !LEA ptrREG, [p.v_R1011]        ; RDX = @R1011 = Pionter to RegisterBackupStruct
  !MOV R10, [ptrREG]
  !MOV R11, [ptrREG+8]
 EndMacro

Macro ASM_PUSH_R12to15(ptrREG)
  Protected R1215.TStack_32Byte
  !LEA ptrREG, [p.v_R1215]        ; RDX = @R1215 = Pionter to RegisterBackupStruct
  !MOV [ptrREG], R12
  !MOV [ptrREG+8], R13
  !MOV [ptrREG+16], R14
  !MOV [ptrREG+24], R15
EndMacro

Macro ASM_POP_R12to15(ptrREG)
  !LEA ptrREG, [p.v_R1215]        ; RDX = @R1215 = Pionter to RegisterBackupStruct
  !MOV R12, [ptrREG]
  !MOV R13, [ptrREG+8]
  !MOV R14, [ptrREG+16]
  !MOV R15, [ptrREG+24]
 EndMacro
 
;- ----------------------------------------------------------------------
;- MMX Registers
;- ----------------------------------------------------------------------

; All MMX-Registers are non volatile (shard with FPU-Reisters)
; After the end of use of MMX-Regiters an EMMS Command mus follow to enable
; correct FPU operations again!

Macro ASM_PUSH_MM_0to3(ptrREG)
  Protected M03.TStack_32Byte
  !LEA ptrREG, [p.v_M03]          ; RDX = @M03 = Pionter to RegisterBackupStruct 
  !MOVQ [ptrREG], MM0
  !MOVQ [ptrREG+8], MM1
  !MOVQ [ptrREG+16], MM2
  !MOVQ [ptrREG+24], MM3
EndMacro

Macro ASM_POP_MM_0to3(ptrREG)
  !LEA ptrREG, [p.v_M03]          ; RDX = @M03 = Pionter to RegisterBackupStruct  
  !MOVQ MM0, [ptrREG]
  !MOVQ MM1, [ptrREG+8]
  !MOVQ MM2, [ptrREG+16]
  !MOVQ MM3, [ptrREG+24]
EndMacro

Macro ASM_PUSH_MM_4to5(ptrREG)
  Protected M45.TStack_32Byte
  !LEA ptrREG, [p.v_M45]          ; RDX = @M47 = Pionter to RegisterBackupStruct 
  !MOVQ [ptrREG], MM4
  !MOVQ [ptrREG+8], MM5
EndMacro

Macro ASM_POP_MM_4to5(ptrREG)
  !LEA ptrREG, [p.v_M45]          ; RDX = @M47 = Pionter to RegisterBackupStruct  
  !MOVQ MM4, [ptrREG]
  !MOVQ MM5, [ptrREG+8]
EndMacro

Macro ASM_PUSH_MM_4to7(ptrREG)
  Protected M47.TStack_32Byte
  !LEA ptrREG, [p.v_M47]          ; RDX = @M47 = Pionter to RegisterBackupStruct 
  !MOVQ [ptrREG], MM4
  !MOVQ [ptrREG+8], MM5
  !MOVQ [ptrREG+16], MM6
  !MOVQ [ptrREG+24], MM7
EndMacro

Macro ASM_POP_MM_4to7(ptrREG)
  !LEA ptrREG, [p.v_M47]          ; RDX = @M47 = Pionter to RegisterBackupStruct  
  !MOVQ MM4, [ptrREG]
  !MOVQ MM5, [ptrREG+8]
  !MOVQ MM6, [ptrREG+16]
  !MOVQ MM7, [ptrREG+24]
EndMacro

;- ----------------------------------------------------------------------
;- XMM Registers
;- ----------------------------------------------------------------------

; because of unaligend Memory latency we use 2x64 Bit MOV instead of 1x128 Bit MOV
; MOVDQU [ptrREG], XMM4 -> MOVLPS [ptrREG], XMM4  and  MOVHPS [ptrREG+8], XMM4
; x64 Prozessor can do 2 64Bit Memory transfers parallel

;  XMM4:XMM5 normally are volatile and we do not have to preserve it

; ATTENTION: XMM4:XMM5 must be preserved only when __vectorcall is used
; as I know PB don't use __vectorcall in ASM Backend. But if we use it 
; within a Procedure where __vectorcall isn't used. We don't have to preserve.
; So wee keep the Macro empty. If you want to activate, just activate the code.


Macro ASM_PUSH_XMM_4to5(ptrREG) 
EndMacro

Macro ASM_POP_XMM_4to5(ptrREG)
EndMacro

; Macro ASM_PUSH_XMM_4to5(ptrREG)
;   Protected X45.TStack_32Byte
;   !LEA ptrREG, [p.v_X45]          ; RDX = @X45 = Pionter to RegisterBackupStruct 
;   !MOVLPS [ptrREG], XMM4
;   !MOVHPS [ptrREG+8], XMM4 
;   !MOVLPS [ptrREG+16], XMM5
;   !MOVHPS [ptrREG+24], XMM5
; EndMacro

; Macro ASM_POP_XMM_4to5(ptrREG)
;   !LEA ptrREG, [p.v_X45]          ; RDX = @X45 = Pionter to RegisterBackupStruct
;   !MOVLPS XMM4, [ptrREG]
;   !MOVHPS XMM4, [ptrREG+8]  
;   !MOVLPS XMM5, [ptrREG+16]
;   !MOVHPS XMM5, [ptrREG+24]
; EndMacro
; ======================================================================

Macro ASM_PUSH_XMM_6to7(ptrREG)
  Protected X67.TStack_32Byte
  !LEA ptrREG, [p.v_X67]          ; RDX = @X67 = Pionter to RegisterBackupStruct    
  !MOVLPS [ptrREG], XMM6
  !MOVHPS [ptrREG+8], XMM6 
  !MOVLPS [ptrREG+16], XMM7
  !MOVHPS [ptrREG+24], XMM7
EndMacro

Macro ASM_POP_XMM_6to7(ptrREG)
  !LEA ptrREG, [p.v_X67]          ; RDX = @X67 = Pionter to RegisterBackupStruct  
  !MOVLPS XMM6, [ptrREG]
  !MOVHPS XMM6, [ptrREG+8]
  !MOVLPS XMM6, [ptrREG+16]
  !MOVHPS XMM6, [ptrREG+24]  
EndMacro

; Fast LOAD/SAVE XMM-Register; MOVDQU command for 128Bit has long latency.
; 2 x64Bit loads are faster! Processed parallel in 1 cycle with low or 0 latency
; this optimation is token from AMD code optimation guide
Macro ASM_LD_XMMM(REGX, ptrREG)
  !MOVLPS REGX, [ptrREG]
  !MOVHPS REGX, [ptrREG+8]
EndMacro

Macro ASM_SAV_XMMM(REGX, ptrREG)
  !MOVLPS [ptrREG], REGX
  !MOVHPS [ptrREG+8] + REGX 
EndMacro

;- ----------------------------------------------------------------------
;- YMM Registers
;- ----------------------------------------------------------------------

; for YMM 256 Bit Registes we switch to aligned Memory commands.
; YMM needs 256Bit = 32Byte Align. So wee need 32Bytes more Memory for manual
; align it! We have to ADD 32 to the Adress and than clear the lo 5 bits
; to get an address Align 32

; ATTENTION!  When using YMM-Registers we have to preserve only the lo-parts (XMM-Part)
;             The hi-parts are always volatile. So preserving XMM-Registers is enough!
; Use this Macros only if you want to preserve the complete YMM-Registers for your own purpose!
Macro ASM_PUSH_YMM_4to5(ptrREG)
  Protected Y45.TStack_96Byte ; we need 64Byte and use 96 to get Align 32
  ; Aling Adress to 32 Byte, so we can use Aligend MOV VMOVAPD
  !LEA ptrREG, [p.b_Y45]          ; RDX = @Y45 = Pionter to RegisterBackupStruct
  !ADD ptrREG, 32
  !SHR ptrREG, 5
  !SHL ptrREG, 5
  ; Move the YMM Registers to Memory Align 32
  !VMOVAPD [ptrREG], YMM4
  !VMOVAPD [ptrREG+32], YMM5
EndMacro

Macro ASM_POP_YMM_4to5(ptrREG)
  ; Aling Address @Y45 to 32 Byte, so we can use Aligned MOV VMOVAPD
  !LEA ptrREG, [p.v_Y45]          ; RDX = @Y45 = Pionter to RegisterBackupStruct
  !ADD ptrREG, 32
  !SHR ptrREG, 5
  !SHL ptrREG, 5
  ; POP Registers from Stack
  !VMOVAPD YMM4, [ptrREG]
  !VMOVAPD YMM5, [ptrREG+32]
EndMacro

Macro ASM_PUSH_YMM_6to7(ptrREG)
  Protected Y67.TStack_96Byte ; we need 64Byte an use 96 to get Align 32
  ; Aling Adress to 32 Byte, so we can use Aligend MOV VMOVAPD
  !LEA ptrREG, [p.b_Y67]          ; RDX = @Y67 = Pionter to RegisterBackupStruct
  !ADD ptrREG, 32
  !SHR ptrREG, 5
  !SHL ptrREG, 5
  ; Move the YMM Registers to Memory Align 32
  !VMOVAPD [ptrREG], YMM6
  !VMOVAPD [ptrREG+32], YMM7
EndMacro

Macro ASM_POP_YMM_6to7(ptrREG)
  ; Aling Adress @Y67 to 32 Byte, so we can use Aligned MOV VMOVAPD
  !LEA ptrREG, [p.v_Y67]          ; RDX = @Y67 = Pionter to RegisterBackupStruct
  !ADD ptrREG, 32
  !SHR ptrREG, 5
  !SHL ptrREG, 5
  ; POP Registers from Stack
  !VMOVAPD YMM6, [ptrREG]
  !VMOVAPD YMM7, [ptrREG+32]
EndMacro

Re: SEE String functions! For Testing!

Posted: Thu May 23, 2024 3:37 pm
by jassing
Nice work! Thanks.

Re: SEE String functions! For Testing!

Posted: Thu Aug 01, 2024 8:58 am
by SMaag
Update solving the 16Byte align problem! see orgingal post!

When using SSE Command PCMPISTRI sthe string should be aligned to 16Byte because PCMPISTRI is reading 16Byte wise.
This is a problem in one case:
If we read over the end of a memory page (what is 4k at x64 OS, 256 Byte at x32 OS).
This will cause a memory exception if the follwing page is not assigned to our program.
Well, this will be a very seldom case. But it can happen and it will happen!

So we have to be sure we do not read over the end of a memory page not assigned to our program.

If we start reading with PCMPISTRI at a 16Byte aligned adress and we are at the end of a memory page, we will read exactly to the end of page with PCMPISTRI . If the string continue in the next page, it is assigned to our program and the next read will start at the beginning of that page.
In thes case we will never read with PCMPISTRI in 2 pages simultan. -> It will not crash!

The way I soled it:
Test the Address to 16Byte align.
If not aligend: Read Charcter wise und test until we reach 16Byte aligned adress!
Now use PCMPISTRI