LenStr(), StringCompare, FindStr
The practical benefit is low at the moment, because there are PB implemented functions for this!
It is a more a proof of concept! The benefit is the speed compared to PB implemented function!
But only at very havy use like parsing very long text files.
see the pevious discussion here: https://www.purebasic.fr/english/viewtopic.php?t=82158
what I'm interested in is: how to do the same in C-Backend with the intrinsics macros.
here the Intel Intrinsics Guide for the SSE4.2 functions
https://www.intel.com/content/www/us/en ... expand=924
Here the Code, if anyone want to test it!
Update: 2024/08/01 - added 16-Byte align test to prevent reading over EndOfMemoryPage!
Code: Select all
; SEE FastString at PB Forum from 2012: https://www.purebasic.fr/english/viewtopic.php?p=375376#p375376
; ===========================================================================
; FILE : PbFw_Module_StringSSE.pb
; NAME : PureBasic Framework : Module String SSE [StrSSE::]
; DESC : using the MMX, SSE Registers, to speed up String operations
; DESC : CPU SSE4.2 support is needed
; DESC :
; SOURCES: https://en.wikibooks.org/wiki/X86_Assembly/SSE#The_Four_Instructions
; https://en.wikibooks.org/wiki/X86_Assembly/SSE
; https://www.strchr.com/strcmp_and_strlen_using_sse_4.2
; ===========================================================================
;
; AUTHOR : Stefan Maag
; DATE : 2022/12/04
; VERSION : 0.53 Developper Version
; COMPILER : PureBasic 6.0
;
; LICENCE : MIT License see https://opensource.org/license/mit/
; or \PbFramWork\MitLicence.txt
; ===========================================================================
;{ ChangeLog:
; 2024/03/01 S.Maag : 16 Byte Aling check and manually align unaligend Strings
; 2024/01/09 S.Maag : SSE_StringCompare: now return -1,0,1 instead of char difference
; to be compatible with PB Command CompareMemoryString().
; Tested and bugfixed FindStr
; 2023/07/31 S.Maag : SSE_StringCompare: compare Bug fixed
;}
;{ TODO: For the C-Backend; add functions using SSE
; here the link to the Intel Intrsics Guide for SSE4.2
; https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ssetechs=SSE4_2&ig_expand=924
; Implement all the SSE optimations for the C-Backend on x86
; propably by using the C intrinsic Macros
;}
; ===========================================================================
;{ Description PCmpIStrI
; PCmpIStrI arg1, arg2, IMM8 ; ATTENTION PCmpIStrI/M needs 16Byte aligned Memory
; modified Flags
; CF is reset If IntRes2 is zero, set otherwise
; ZF is set If a null terminating character is found in arg2, reset otherwise
; SF is set If a null terminating character is found in arg1, reset otherwise
; OF is set To IntRes2[0]
; AF is reset
; PF is reset
; ----------------------------------------------------------------------
; IMM8[1:0] specifies the format of the 128-bit source data
; ----------------------------------------------------------------------
; 00b unsigned bytes(16 packed unsigned bytes)
; 01b unsigned words(8 packed unsigned words)
; 10b signed bytes(16 packed signed bytes)
; 11b signed words(8 packed signed words)
; ----------------------------------------------------------------------
; IMM8[3:2] specifies the aggregation operation whose result will
; be placed in intermediate result 1, which we will refer to
; as IntRes1. The size of IntRes1 will depend on the format
; of the source Data, 16-bit for packed bytes and
; 8-bit For packed words:
; ----------------------------------------------------------------------
; 00b Equal Any, arg1 is a character set, arg2 is the string to search in.
; IntRes1[i] is set To 1 If arg2[i] is in the set represented by arg1
;
; arg1 = "aeiou"
; arg2 = "Example string 1"
; IntRes1 = 0010001000010000
; 01b Ranges, arg1 is a set of character ranges i.e. "09az" means all
; characters from 0 To 9 And from a To z., arg2 is the string To search over.
; IntRes1[i] is set To 1 If arg[i] is in any of the ranges represented by arg
;
; arg1 = "09az"
; arg2 = "Testing 1 2 3, T"
; IntRes1 = 0111111010101000
; 10b Equal Each, arg1 is string one and arg2 is string two.
; IntRes1[i] is set To 1 If arg1[i] == arg2[i]
;
; arg1 = "The quick brown "
; arg2 = "The quack green "
; IntRes1 = 1111110111010011
; 11b Equal Ordered, arg1 is a substring string to search for, arg2 is the
; string To search within. IntRes1[i] is set To 1 If the substring arg1
; can be found at position arg2[i]:
; arg1 = "he"
; arg2 = ", he helped her "
; IntRes1 = 0010010000001000
; ----------------------------------------------------------------------
; IMM8[5:4] specifies the polarity or the processing of IntRes1, into
; intermediate result 2, which will be referred To As IntRes2
; ----------------------------------------------------------------------
; 00b Positive Polarity IntRes2 = IntRes1
; 01b Negative Polarity IntRes2 = -1 XOr IntRes1
; 10b Masked Positive IntRes2 = IntRes1
; 11b Masked Negative IntRes2 = IntRes1 If reg/mem[i] is invalid Else ~IntRes1
; ----------------------------------------------------------------------
; IMM8[6] specifies the output selection, or how IntRes2 will be processed
; into the output. For PCMPESTRI And PCMPISTRI, the output is an
; index into the Data currently referenced by arg2
; ----------------------------------------------------------------------
; 0b Least Significant Index ECX contains the least significant set bit in IntRes2
; 1b Most Significant Index ECX contains the most significant set bit in IntRes2
; ----------------------------------------------------------------------
; IMM8[6] For PCMPESTRM and PCMPISTRM, the output is a mask reflecting
; all the set bits in IntRes2
; ----------------------------------------------------------------------
; 0b Least Significant Index Bit Mask, the least significant bits
; of XMM0 contain the IntRes2 16(8) bit mask.
; XMM0 is zero extended To 128-bits.
; 1b Most Significant Index Byte/Word Mask, XMM0 contains IntRes2 expanded into byte/word mask
; ----------------------------------------------------------------------
; EQUAL_ANY = 0000b
; RANGES = 0100b
; EQUAL_EACH = 1000b
; EQUAL_ORDERED = 1100b
; NEGATIVE_POLARITY = 010000b
; BYTE_MASK = 1000000b
; FLAGs
; OF : Overflow flag
; SF : Sign flag ; #True if negative
; ZF : Zero flag ; #True if zero
; AF: Auxillary (carry) flag
; PF: Parity flag
; CF: Carry flag
;}
;{ ----------------------------------------------------------------------
; The Problem of 16 Byte operations on lower aligend memory
; ----------------------------------------------------------------------
; If we process 16 Bytes on lower aligned memory we may run into an overflow at the
; end of meory pages when the end of String is located in the last bytes
; of the memory page and the following page is not allocated to our process.
; Yes this will happen very seldom but it can happen. So it is a source of
; crashes may happen in years in the future. Because it can happen, it will happen!
; It is only a question of time!
; A memory page in x64 Systems is 4096 Bytes
; We look on a 8 Byte aligned String at the end of memory page to show the problem
; a String followed by a NullChar and a further NullChar then the page ends
; EndOfString at Byte 4092..93 and a void 00
; | ..... 'I am a String at the end of a memroy page' 0000|
; if we process 16 Bytes at 8 Byte align starting at Byte 4088 we read until Byte 4103
; we read 8 Bytes into the next page. Now it will crash if the next page is not
; allocated to our process! We can use 16 Byte PCMPISTRI operation
; only if we are not at the end of a memory page or we have a 16 Byte aligned memory.
;}
;- ----------------------------------------------------------------------
;- Include Files
; ----------------------------------------------------------------------
DeclareModule StrSSE
EnableExplicit
; ----------------------------------------------------------------------
;- DECLARE
;- ----------------------------------------------------------------------
Declare.i SSE_LenA(*String)
Declare.i SSE_Len(*String)
Declare.i SSE_StringCompare(*String1, *String2, Pos=0)
Declare.i SSE_FindStr(*String, *StringToFind)
EndDeclareModule
Module StrSSE
IncludeFile "PbFw_ASM_Macros.pbi"
#EQUAL_ANY = %0000
#RANGES = %0100
#EQUAL_EACH = %1000
#EQUAL_ORDERED = %1100
#NEGATIVE_POLARITY = %0010000
#BYTE_MASK = %1000000
Structure pChar ; virtual CHAR-ARRAY, used as Pointer to overlay on strings
a.a[0] ; fixed ARRAY Of CHAR Length 0
c.c[0]
EndStructure
;- ----------------------------------------------------------------------
;- Module Public
;- ----------------------------------------------------------------------
Procedure.i SSE_LenA(*String)
; ============================================================================
; NAME: SSE_LenA
; DESC: Length in number of characters of Ascii Strings
; DESC: Use SSE PCmpIStrI operation. This is aprox. 3 times faster than PB Len()
; VAR(*String): Pointer to String 1
; RET.i: Number of Characters
; ============================================================================
; ATTENTION PCmpIStrI needs 16Byte aligned Memory
; If memory isn't aligned we have to align it manually
; by processing unalinged bytes in classic way and
; start with PCmpIStrI at aligned psoition.
; The Problem of analigned reading is the end of memory page (4096Bytes)
; if the following page is not allocated by our process.
; Memory exception because we cand read memory or other process.
; IMM8[1:0] = 00b
; Src data is unsigned bytes(16 packed unsigned bytes)
; IMM8[3:2] = 10b
; We are using Equal Each aggregation
; IMM8[5:4] = 00b
; Positive Polarity, IntRes2 = IntRes1
; IMM8[6] = 0b
; ECX contains the least significant set bit in IntRes2
;
; XMM0 XMM1 XMM2 XMM3 XMM4
; XMM1 = [String1] : XMM2=[String2] : XMM3=WideCharMask
DisableDebugger
CompilerIf #PB_Compiler_64Bit
!XOR RDX, RDX ; RDX = 0
!XOR RCX, RCX ; RCX = 0
!MOV RAX, [p.p_String] ; RAX = *String
!@@: ;
!TEST RAX, 0Fh ; Test for 16Byte align
!JZ @f ; If NOT aligned
!MOV DL, BYTE[RAX] ; process Char by Char until aligned
!TEST RDX, RDX ; Check for EndOfString
!JZ .Return ; Break if EndOfString
!INC RAX ; Pointer to NextChar
!JMP @b ; Jump back to @@
!@@: ; from here we have 16Byte aligned address
!PXOR XMM0, XMM0
!SUB RAX, 16
!@@:
!ADD RAX, 16
!PCMPISTRI XMM0, [RAX], 0001000b ; EQUAL_EACH, unsigned_Bytes
!JNZ @b
; ECX will contain the offset from eax where the first null
; terminating character was found.
!ADD RAX, RCX
!.Return:
!SUB RAX, [p.p_String]
ProcedureReturn
CompilerElse
!XOR EDX, EDX ; EDX = 0
!XOR ECX, ECX ; RCX = 0
!MOV EAX, [p.p_String] ; EAX = *String
!@@: ;
!TEST EAX, 0Fh ; Test for 16Byte align
!JZ @f ; If NOT aligned
!MOV DL, BYTE[EAX] ; process Char by Char until aligned
!TEST EDX, EDX ; Check for EndOfString
!JZ .Return ; Break if EndOfString
!INC EAX ; Pointer to NextChar
!JMP @b ; Jump back to @@
!@@: ; from here we have 16Byte aligned address
!PXOR XMM0, XMM0
!SUB EAX, 16
!@@:
!ADD EAX, 16
!PCMPISTRI XMM0, [EAX], 0001000b ; EQUAL_EACH, unsigned_Bytes
!JNZ @b
; ECX will contain the offset from eax where the first null
; terminating character was found.
!ADD EAX, ECX
!.Return:
!SUB EAX, [p.p_String]
ProcedureReturn
CompilerEndIf
EnableDebugger
EndProcedure
Procedure.i SSE_Len(*String)
; ============================================================================
; NAME: SSE_Len
; DESC: Length in number of characters of 2-Byte Char Strings
; DESC: Use SSE PCmpIStrI operation. This is aprox. 3 times faster than PB Len()
; VAR(*String): Pointer to String
; RET.i: Number of Characters
; ============================================================================
; IMM8[1:0] = 00b
; Src data is unsigned bytes(16 packed unsigned bytes)
; IMM8[3:2] = 10b
; We are using Equal Each aggregation
; IMM8[5:4] = 00b
; Positive Polarity, IntRes2 = IntRes1
; IMM8[6] = 0b
; ECX contains the least significant set bit in IntRes2
; XMM0 XMM1 XMM2 XMM3 XMM4
; XMM1 = [String1] : XMM2=[String2] : XMM3=WideCharMask
DisableDebugger
CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm
CompilerIf #PB_Compiler_64Bit
!XOR RDX, RDX
!XOR RCX, RCX
!MOV RAX, [p.p_String]
!@@:
!TEST RAX, 0Fh ; Test for 16Byte align
!JZ @f ; If NOT aligned
!MOV DX, WORD [RAX] ; process Char by Char until aligned
!TEST RDX, RDX ; Check for EndOfString
!JZ .Return ; Break if EndOfString
!INC RAX ; Pointer to NextChar
!JMP @b ; Jump back to @@
!@@: ; from here we have 16Byte aligned address
!PXOR XMM0, XMM0
!SUB RAX, 16
!@@:
!ADD RAX, 16
!PCMPISTRI XMM0, [RAX], 0001001b ; EQUAL_EACH WORD
!JNZ @b
; RCX will contain the offset from RAX where the first null
; terminating character was found.
!SHL RCX, 1 ; Word to Byte
!ADD RAX, RCX
!.Return:
!SUB RAX, [p.p_String]
!SHR RAX, 1 ; ByteCounter to Word
ProcedureReturn
CompilerElse ; #PB_Compiler_32Bit
!XOR EDX, EDX
!XOR ECX, ECX
!MOV EAX, [p.p_String]
!@@:
!TEST EAX, 0Fh ; Test for 16Byte align
!JZ @f ; If NOT aligned
!MOV DX, WORD [EAX] ; process Char by Char until aligned
!TEST EDX, EDX ; Check for EndOfString
!JZ .Return ; Break if EndOfString
!INC EAX ; Pointer to NextChar
!JMP @b ; Jump back to @@
!@@: ; from here we have 16Byte aligned address
!PXOR XMM0, XMM0
!SUB EAX, 16
!@@:
!ADD EAX, 16
!PCMPISTRI XMM0, [EAX], 0001001b ; EQUAL_EACH WORD
!JNZ @b
; ECX will contain the offset from EAX where the first null
; terminating character was found.
!SHL ECX, 1 ; Word to Byte
!ADD EAX, ECX
!.Return:
!SUB EAX, [p.p_String]
!SHR EAX, 1 ; Byte to Word
CompilerEndIf
CompilerElse ; #PB_Compiler_Backend = #PB_Backend_C
Protected *pStr.String
*pStr = *String
ProcedureReturn Len(*pStr\s)
CompilerEndIf
EnableDebugger
EndProcedure
Procedure.i SSE_StringCompare(*String1, *String2, *Pos=0)
; ============================================================================
; NAME: SSE_StringCompare
; DESC: Compares 2 Strings with SSE operation (PCmpIStrI)
; VAR(*String1): Pointer to String 1
; VAR(*String2): Pointer to String 2
; VAR(*Pos): optional Pointer to an Int to get the CharNo which do not match
; RET.i: 0=(S1=S2), 1=(S1>S2), -1=(S1<S2) #PB_String_Lower/Equal/Greater
; ============================================================================
; XMM0 XMM1 XMM2 XMM3 XMM4
; XMM1 = [String1] : XMM2=[String2]
; DisableDebugger
CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm
CompilerIf #PB_Compiler_64Bit
DisableDebugger
; used Registers
; RAX : *String1
; R8 : *String2
; RCX : operating Register
; RDX : operating Register
; ----------------------------------------------------------------------
; Check the *String1 and *String2 align
; The Problem of not aligend 16 is: Reading over the end of a
; memory page. If the next page is not allocated to our programm we
; produce a crash! So we have to be sure do not read over the
; EndOfPage if the String ends short before. PCMPISTRI process 16Bytes,
; so if we use PCMPISTRI at align 16 we can't read over the end of
; String without detecting EndOfString first!
; ----------------------------------------------------------------------
!MOV RAX, [p.p_String1]
!MOV R8, [p.p_String2]
!MOV RCX, RAX ; RCX = *String1
!AND RCX, 0Fh ; Filter the Aling Offset to 16Bytes
!MOV RDX, R8 ; RDX = *String2
!SUB R8, RAX
!AND RDX, 0Fh ; Filter the Aling Offset to 16Bytes
!TEST RDX, 1 ; Test for Odd align -> Align 16 not possible
!JNZ .NotAligned
!CMP RDX, RCX ; Test if align of String1 and String2 is identical
!JNE .NotAligned
!TEST RCX, RCX ; Test if it is aligend to 16Bytes (Offset ==0)
!JZ .a16 ; aligned to 16 Bytes, we have to do nothing
; ----------------------------------------------------------------------
; Case I: Not aligned to 16 Bytes but it's possilbe to align manually
; ----------------------------------------------------------------------
; identical align but not to 16Bytes
!SUB RAX, 2
; so first we compare Char by Char until the Address is 16 Byte aligned
!@@: ; Loop
!ADD RAX, 2
;!ADD R8, 2
!TEST RAX, 0Fh ; (AND RAX, 0Fh) == 0
!JZ .a16 ; Continue at Case III: aligend to 16 Byte
!MOV CX, WORD[RAX]
!CMP CX, WORD[RAX+R8]
!JA .GREATER
!JB .LOWER
; if identical check for EndOfString
!TEST CX, 0 ; TEST results in 0 if CX==0
!JZ .EQUAL
!JMP @b ; Not EndOfString -> Repeat Loop
; ----------------------------------------------------------------------
; Case II:
; A complete different align of *String1 and *String2, so it is not
; possible to aling both toghether to 16Byte. In this case we have
; 2 options:
; I) we don't use PCMPISTRI and do a classic Char by Char compare
; II) we have to check EndOfMemPage and use classic Char by Char
; at end of MemoryPages (4096 Bytes). But that's more
; complicated
; ----------------------------------------------------------------------
!.NotAligned: ; Not aligned : a complet different align
!SUB RAX, 2
!@@:
!ADD RAX, 2
!MOV CX, WORD[RAX]
!CMP CX, WORD[RAX+R8]
!JA .GREATER
!JB .LOWER
; if identical check for EndOfString
!TEST CX, 0 ; TEST results in 0 if CX==0
!JZ .EQUAL
!JMP @b ; Not EndOfString -> Repeat Loop
; ----------------------------------------------------------------------
; Case III:
; If *String1 And *String is aligned to 16Bytes
; we can use PCMPISTRI what is a 16Byte operation
; ----------------------------------------------------------------------
!.a16:
; Subtract s2(RDX) from s1(RAX). This admititedly looks odd, but we
; can now use RDX to index into s1 and s2. As we adjust RDX to move
; forward into s2, we can then add RDX to RAX and this will give us
; the comparable offset into s1 i.e. if we take RDX + 16 then:
;
; RDX = RDX + 16 = RDX + 16
; RAX+RDX = RAX -RDX + RDX + 16 = RAX + 16
;
; therefore RDX points to s2 + 16 and RAX + RDX points to s1 + 16.
; We only need one index, convoluted but effective.
!SUB RAX, 16
!XOR RCX, RCX
!@@:
!ADD RAX, 16
!MOVDQA XMM0, [RAX]
; IMM8[1:0] = 00b
; 00b: Src data is unsigned bytes(16 packed unsigned bytes)
; 01b: Src data is unsigned words( 8 packed unsigned words)
; IMM8[3:2] = 10b
; We are using Equal Each aggregation
; IMM8[5:4] = 01b
; Negative Polarity, IntRes2 = -1 XOR IntRes1
; IMM8[6] = 0b
; ECX contains the least significant set bit in IntRes2
!PCMPISTRI XMM0, [RAX+R8], 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS
; Loop while ZF=0 and CF=0:
; 1) We find a null in s1(RDX+RAX) ZF=1
; 2) We find a char that does not match CF=1
!JA @b ; IF CF=0 And ZF=0
!JC @f ; IF CF=1 : Jump if CF=1, we found a mismatched char
!JMP .EQUAL ; We terminated loop due to a null character i.e. CF=0 and ZF=1 -> The Strings are equal
!@@:
; ECX is the offset from the current poition in NoOfChars where the two strings do not match,
; so copy the respective non-matching char into DX and compare it with the position in *String2
; in remaining bits w/ zero. Because of 2ByteChar we have to convert Word to Byte
!SHL RCX, 1 ; Number of Chars to Adress Offset
!ADD RAX, RCX
!MOV DX, WORD[RAX]
; If S1=S2 : Return (0) ; #PB_String_Equal
; If S1>S2 : Return (+) ; #PB_String_Greater
; If S1<S2 : Return (-) ; #PB_String_Lower
!CMP DX, WORD [RAX+R8]
!JA .GREATER
!JB .LOWER
;!JMP .EQUAL
!.EQUAL: ; The Strings are equal
!MOV R8, RAX
!XOR RAX, RAX ; #PB_String_Equal, 0
!JMP @f
!.LOWER: ; String1 < String2
!MOV R8, RAX
!XOR RAX, RAX
!DEC RAX ; #PB_String_Lower, -1
!JMP @f
!.GREATER: ; String1 > String2
!MOV R8, RAX
!XOR RAX, RAX
!INC RAX ; #PB_String_Greater, 1
!@@:
; check for Return of CharNo in Pos
!MOV RDX, [p.p_Pos] ; RDX = *Pos
!TEST RDX, RDX ;
!JZ .return ; If *Pos = 0 Then return
!SUB R8, [p.p_String1]
!SHR R8, 1 ; Byte to Word
!MOV [RDX], R8 ; Pos = CharNo which do not match
!.return:
ProcedureReturn ; RAX
EnableDebugger
CompilerElse ; #PB_Compiler_32Bit
DisableDebugger
; used Registers
; EAX : *String1
; EBX : *String2
; ECX : operating Register
; EDX : operating Register
ASM_PUSH_EBX()
; ----------------------------------------------------------------------
; Check the *String1 abd *String2 align
; The Problem of not aligend 16 is: Reading over the end of a
; memory page. If the next page is not allocated to our programm we
; produce a crash! So we have to be sure do not read over the
; EndOfPage if the String ends short before. PCMPISTRI process 16Bytes,
; so if we use PCMPISTRI at align 16 we can't read over the end of
; String without detecting EndOfString first!
; ----------------------------------------------------------------------
!MOV EAX, [p.p_String1]
!MOV EBX, [p.p_String2]
!MOV EXC, EAX ; EXC = *String1
!AND EXC, 0Fh ; Filter the Aling Offset to 16Bytes
!MOV EDX, EBX ; EDX = *String2
!SUB EBX, EAX
!AND EDX, 0Fh ; Filter the Aling Offset to 16Bytes
!TEST EDX, 1 ; Test for Odd align -> Align 16 not possible
!JNZ .NotAligned
!CMP EDX, EXC ; Test if align of String1 and String2 is identical
!JNE .NotAligned
!TEST EXC, EXC ; Test if it is aligend to 16Bytes (Offset ==0)
!JZ .a16 ; aligned to 16 Bytes, we have to to nothing
; ----------------------------------------------------------------------
; Case I: Not aligned to 16 Bytes but it's possilbe to align manually
; ----------------------------------------------------------------------
; identical align but not to 16Bytes
!SUB EAX, 2
; so first we compare Char by Char until the Address is 16 Byte aligned
!@@: ; Loop
!ADD EAX, 2
;!ADD EBX, 2
!TEST EAX, 0Fh ; (AND EAX, 0Fh) == 0
!JZ .a16 ; Continue at Case III: aligend to 16 Byte
!MOV CX, WORD[EAX]
!CMP CX, WORD[EAX+EBX]
!JA .GREATER
!JB .LOWER
; if identical check for EndOfString
!TEST CX, 0 ; TEST results in 0 if CX==0
!JZ .EQUAL
!JMP @b ; Not EndOfString -> Repeat Loop
; ----------------------------------------------------------------------
; Case II:
; A complete different align of *String1 and *String2, so it is not
; possible to aling both toghether to 16Byte. In this case we have
; 2 options:
; I) we don't use PCMPISTRI and do a classic Char by Char compare
; II) we have to check EndOfMemPage and use classic Char by Char
; at end of MemoryPages (4096 Bytes). But that's more
; complicated
; ----------------------------------------------------------------------
!.NotAligned: ; Not aligned : a complet different align
!SUB EAX, 2
!@@:
!ADD EAX, 2
!MOV CX, WORD[EAX]
!CMP CX, WORD[EAX+EBX]
!JA .GREATER
!JB .LOWER
; if identical check for EndOfString
!TEST CX, 0 ; TEST results in 0 if CX==0
!JZ .EQUAL
!JMP @b ; Not EndOfString -> Repeat Loop
; ----------------------------------------------------------------------
; Case III:
; If *String1 And *String is aligned to 16Bytes
; we can use PCMPISTRI what is a 16Byte operation
; ----------------------------------------------------------------------
!.a16:
; Subtract s2(EDX) from s1(EAX). This admititedly looks odd, but we
; can now use EDX to index into s1 and s2. As we adjust EDX to move
; forward into s2, we can then add EDX to EAX and this will give us
; the comparable offset into s1 i.e. if we take EDX + 16 then:
;
; EDX = EDX + 16 = EDX + 16
; EAX+EDX = EAX -EDX + EDX + 16 = EAX + 16
;
; therefore EDX points to s2 + 16 and EAX + EDX points to s1 + 16.
; We only need one index, convoluted but effective.
!SUB EAX, 16
!XOR EXC, EXC
!@@:
!ADD EAX, 16
!MOVDQA XMM0, [EAX]
; IMM8[1:0] = 00b
; 00b: Src data is unsigned bytes(16 packed unsigned bytes)
; 01b: Src data is unsigned words( 8 packed unsigned words)
; IMM8[3:2] = 10b
; We are using Equal Each aggregation
; IMM8[5:4] = 01b
; Negative Polarity, IntRes2 = -1 XOR IntRes1
; IMM8[6] = 0b
; ECX contains the least significant set bit in IntRes2
!PCMPISTRI XMM0, [EAX+EBX], 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS
; Loop while ZF=0 and CF=0:
; 1) We find a null in s1(EDX+EAX) ZF=1
; 2) We find a char that does not match CF=1
!JA @b ; IF CF=0 And ZF=0
!JC @f ; IF CF=1 : Jump if CF=1, we found a mismatched char
!JMP .EQUAL ; We terminated loop due to a null character i.e. CF=0 and ZF=1 -> The Strings are equal
!@@:
; ECX is the offset from the current poition in NoOfChars where the two strings do not match,
; so copy the respective non-matching char into DX and compare it with the position in *String2
; in remaining bits w/ zero. Because of 2ByteChar we have to convert Word to Byte
!SHL EXC, 1 ; Number of Chars to Adress Offset
!ADD EAX, EXC
!MOV DX, WORD[EAX]
; If S1=S2 : Return (0) ; #PB_String_Equal
; If S1>S2 : Return (+) ; #PB_String_Greater
; If S1<S2 : Return (-) ; #PB_String_Lower
!CMP DX, WORD [EAX+EBX]
!JA .GREATER
!JB .LOWER
;!JMP .EQUAL
!.EQUAL: ; The Strings are equal
!MOV EBX, EAX
!XOR EAX, EAX ; #PB_String_Equal, 0
!JMP @f
!.LOWER: ; String1 < String2
!MOV EBX, EAX
!XOR EAX, EAX
!DEC EAX ; #PB_String_Lower, -1
!JMP @f
!.GREATER: ; String1 > String2
!MOV EBX, EAX
!XOR EAX, EAX
!INC EAX ; #PB_String_Greater, 1
!@@:
; check for Return of CharNo in Pos
!MOV EDX, [p.p_Pos] ; EDX = *Pos
!TEST EDX, EDX ;
!JZ .return ; If *Pos = 0 Then return
!SUB EBX, [p.p_String1]
!SHR EBX, 1 ; Byte to Word
!MOV [EDX], EBX ; Pos = CharNo which do not match
!.return:
ASM_POP_EBX()
ProcedureReturn ; EAX
EnableDebugger
CompilerEndIf
CompilerElse ; C-Backend
; for now use PB CompareMemoryString. So it will work on other Platforms too.
; maybe provide a C optimized version in the future
ProcedureReturn CompareMemoryString(*String1, *String2)
CompilerEndIf
EndProcedure
Procedure.i SSE_FindStr(*String, *StringToFind)
; ============================================================================
; NAME: SSE_FindStr
; DESC: Try to find StringToFind in String with SSE operation (PCmpIStrI)
; DESC: Search for the needle in the haystack
; DESC: This Function is for 2Byte Character Strings only
; VAR(*String): Pointer to String (Haystack)
; VAR(*StringToFind): Pointer to StringToFind (Needle)
; RET.i: If found: The startposition in Characters [1..n]. Otherwise 0
; ============================================================================
DisableDebugger
; TODO! Solve the 16Byte align problem
CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm
CompilerIf #PB_Compiler_64Bit
Protected memRAX, memRDX
; Returns a pointer To the first occurrence of str2 in str1, Or a null pointer If str2 is Not part of str1.
; The matching process does not include the terminating null-characters, but it stops there
; RAX = haystack (Heuhaufen), RDX = needle (Nadel)
; XMM0 XMM1 XMM2 XMM3 XMM4
; XMM1 = [String1] : XMM2=[String2]
!MOV RAX, [p.p_String] ; haystack
!MOV RDX, [p.p_StringToFind] ; needle
!MOVDQU XMM2, DQWORD[RDX] ; load the first 16 bytes of neddle (String to find)
!SUB RAX, 16 ; Avoid extra jump in main loop
; ----------------------------------------------------------------------
; Find the first possible match of 16-byte fragment in haystack
; ----------------------------------------------------------------------
!FindStr_MainLoop:
!ADD RAX, 16 ; Step up Counter
!MOVDQU XMM1, DQWORD[RAX]
;!PCMPISTRI XMM2, XMM1, 1100b ; EQUAL_ORDERED ; for ASCII Strings
!PCMPISTRI XMM2, XMM1, 1101b ; EQUAL_ORDERED + UNSIGNED_WORDS; 11001b
; now RCX contains the offset in WORDS where a match was found
; Loop while ZF=0 and CF=0:
; 1) We find a null in s1(RAX) ZF=1
; 2) We find a char that does not match CF=1
!JA FindStr_MainLoop
; Jump if CF=0, we found only matching chars
!JNC FindStr_StrNotFound
; possible match found at WordOffset in RCX
!ADD RCX, RCX ; Word to Byte
!ADD RAX, RCX ; save the possible match start
!MOV [p.v_memRDX], RDX ; mov edi, edx; save RDX
!MOV [p.v_memRAX], RAX ; mov esi, eax; save RAX
; ----------------------------------------------------------------------
; Compare String, at possible match postion in haystack, with needle
; ----------------------------------------------------------------------
!SUB RDX, RAX
!SUB RAX, 16 ; counter
!PXOR XMM3, XMM3 ; XMM3 = 0
; compare the strings
!FindStr_Compare:
!ADD RAX, 16 ; Counter
!MOVDQU XMM1, DQWORD[RAX+RDX] ; Haystack
; mask out invalid bytes in the haystack
;!PCMPISTRM XMM3, XMM1, 1011000b ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; for ASCII Strings
!PCMPISTRM XMM3, XMM1, 1011001b ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK + UNSIGNED_WORDS
; PCMPISTRM writes as result a Mask To XMM0, we used BYTE_MASK
!MOVDQU XMM4, DQWORD[RAX] ; haystack
!PAND XMM4, XMM0
;!PCMPISTRI XMM1, XMM4, 0011000b ; EQUAL_EACH + NEGATIVE_POLARITY ; for ASCII Strings
!PCMPISTRI XMM1, XMM4, 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS
; Loop while ZF=0 and CF=0:
; 1) We find a null in s1(RDX+RCX) ZF=1 {JA CF=0 & ZF=0} {JE : ZF=1)
; 2) We find a char that does not match CF=1 {JC, JNC}
; 3) We find a null in s2 SF=1 {JS, JNS}
;!JS FindStr_StrNotFound
!JA FindStr_Compare ; CF=0 AND ZF=0
!MOV RDX, [p.v_memRDX]
!MOV RAX, [p.v_memRAX]
!JNC FindStr_StrFound
;!SUB RAX, 15 ; for ASCII Strings
!SUB RAX, 14
!JMP FindStr_MainLoop
!FindStr_StrNotFound:
!XOR RAX, RAX
!JMP FindStr_End
!FindStr_StrFound:
; because RAX contains the Pointer we have to calculate the Char-No.
!SUB RAX, [p.p_String] ; Sub the Haystack Start-Pointer
!SHR RAX, 1 ; Byte to Word: not needed for ASCII Strings
!ADD RAX, 1 ; Add 1 to start with 1 as first Char-No.
!FindStr_End:
ProcedureReturn ; !RAX
CompilerElse ; #PB_Compiler_32Bit
Protected memEAX, memEDX
; Returns a pointer To the first occurrence of str2 in str1, Or a null pointer If str2 is Not part of str1.
; The matching process does not include the terminating null-characters, but it stops there
; RAX = haystack (Heuhaufen), EDX = needle (Nadel)
; XMM0 XMM1 XMM2 XMM3 XMM4
; XMM1 = [String1] : XMM2=[String2]
!MOV EAX, [p.p_String] ; haystack
!MOV EDX, [p.p_StringToFind] ; needle
!MOVDQU XMM2, DQWORD[EDX] ; load the first 16 bytes of neddle (String to find)
!SUB EAX, 16 ; Avoid extra jump in main loop
; ----------------------------------------------------------------------
; Find the first possible match of 16-byte fragment in haystack
; ----------------------------------------------------------------------
!FindStr_MainLoop:
!ADD EAX, 16 ; Step up Counter
!MOVDQU XMM1, DQWORD[EAX]
;!PCMPISTRI XMM2, XMM1, 1100b ; EQUAL_ORDERED ; for ASCII Strings
!PCMPISTRI XMM2, XMM1, 1101b ; EQUAL_ORDERED + UNSIGNED_WORDS; 11001b
; now RCX contains the offset in WORDS where a match was found
; Loop while ZF=0 and CF=0:
; 1) We find a null in s1(EAX) ZF=1
; 2) We find a char that does not match CF=1
!JA FindStr_MainLoop
; Jump if CF=0, we found only matching chars
!JNC FindStr_StrNotFound
; possible match found at WordOffset in ECX
!ADD ECX, ECX ; Word to Byte
!ADD EAX, ECX ; save the possible match start
!MOV [p.v_memEDX], EDX ; mov edi, edx; save EDX
!MOV [p.v_memEAX], EAX ; mov esi, eax; save EAX
; ----------------------------------------------------------------------
; Compare String, at possible match postion in haystack, with needle
; ----------------------------------------------------------------------
!SUB EDX, EAX
!SUB EAX, 16 ; counter
!PXOR XMM3, XMM3 ; XMM3 = 0
; compare the strings
!FindStr_Compare:
!ADD EAX, 16 ; Counter
!MOVDQU XMM1, DQWORD[EAX+EDX] ; Haystack
; mask out invalid bytes in the haystack
;!PCMPISTRM XMM3, XMM1, 1011000b ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; for ASCII Strings
!PCMPISTRM XMM3, XMM1, 1011001b ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK + UNSIGNED_WORDS
; PCMPISTRM writes as result a Mask To XMM0, we used BYTE_MASK
!MOVDQU XMM4, DQWORD[EAX] ; haystack
!PAND XMM4, XMM0
;!PCMPISTRI XMM1, XMM4, 0011000b ; EQUAL_EACH + NEGATIVE_POLARITY ; for ASCII Strings
!PCMPISTRI XMM1, XMM4, 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS
; Loop while ZF=0 and CF=0:
; 1) We find a null in s1(EDX+ECX) ZF=1 {JA CF=0 & ZF=0} {JE : ZF=1)
; 2) We find a char that does not match CF=1 {JC, JNC}
; 3) We find a null in s2 SF=1 {JS, JNS}
;!JS FindStr_StrNotFound
!JA FindStr_Compare ; CF=0 AND ZF=0
!MOV EDX, [p.v_memEDX]
!MOV EAX, [p.v_memEAX]
!JNC FindStr_StrFound
;!SUB EAX, 15 ; for ASCII Strings
!SUB EAX, 14
!JMP FindStr_MainLoop
!FindStr_StrNotFound:
!XOR EAX, EAX
!JMP FindStr_End
!FindStr_StrFound:
; because EAX contains the Pointer we have to calculate the Char-No.
!SUB EAX, [p.p_String] ; Sub the Haystack Start-Pointer
!SHR EAX, 1 ; Byte to Word: not needed for ASCII Strings
!ADD EAX, 1 ; Add 1 to start with 1 as first Char-No.
!FindStr_End:
ProcedureReturn
CompilerEndIf ; #PB_Compiler_32Bit
CompilerElse ; C-Backend
; for now use PB FindString. So it will work on other Platforms too.
; maybe provide a C optimized version in the future
Protected *pStr.String = *String
Protected *pStrToFind.String = *StringToFind
ProcedureReturn FindString(*pStr\s, *pStrToFind\s)
CompilerEndIf
EndProcedure
;- ----------------------------------------------------------------------
;- Initalisation
;- ----------------------------------------------------------------------
; ; PCmpIStrI needs SSE4.2
; If CPU::CpuMultiMediaFeatures\SSE4_2
; Debug "SSE4.2 is supported"
; EndIf
EndModule
CompilerIf #PB_Compiler_IsMainFile
;- ----------------------------------------------------------------------
;- TEST-CODE
;- ----------------------------------------------------------------------
EnableExplicit
UseModule StrSSE
Define sTest.s, sTest2.s, sASC.s
Define sDbg.s
Define I
Dim bChar.b(255) ; ASCII CHAR Array
For I=0 To 98 ; Fill Char Array with 100 Ascii Chars
bChar(i) = 33+I
Next
Debug "--------------------------------------------------"
Debug "String Len"
Debug "--------------------------------------------------"
sTest= Space(255) ; Fill TestString with 255 Spaces
sDbg= "PB: Len() = " + Len(sTest) ; should be 255
Debug sDbg
sDbg = "SSE Len = " + Str(SSE_Len(@sTest)) ; should be 255
Debug sDbg
sDbg = "ASCII Len() = " + Str(SSE_LenA(@bChar(0))) ; should be 100 Chars
Debug sDbg
;- ----------------------------------------------------------------------
Define.s S0, S1, S2, sQ
Define ret
Dim cmp.s(2)
cmp(0) = "<"
cmp(1) = "="
cmp(2) = ">"
sQ.s = Chr('"') ; Quotes
;1 10 48
S0 = "Ich bin ein langer String, in welchem man nach 1234 suchen kann 5677"
S1 = "Ich bin ein langer String, in welchem man nach 1234 suchen kann 5678"
S2 = "Ich bin ein langer String, in welchem man nach 1234 suchen kann 5679"
Debug "--------------------------------------------------"
Debug "StringCompare"
Debug "--------------------------------------------------"
;Debug ""
ret = SSE_StringCompare(@S0, @S1) ; =
Debug ret
Debug sQ + S0 + sQ + " " + cmp(ret+1) + " " + sQ + S1 + sQ
ret = SSE_StringCompare(@S0, @S2) ; <
Debug ret
Debug sQ + S0 + Sq + " " + cmp(ret+1) + " " + sQ + S2 + sQ
ret = SSE_StringCompare(@S2, @S1) ; <
Debug ret
Debug sQ + S2 + sQ + " " + cmp(ret+1) + " " + sQ + S1 + sQ
Debug "--------------------------------------------------"
Debug "FindString"
Debug "--------------------------------------------------"
;Debug ""
Define Search$
Search$ = "1234"
;Search$ = "bin"
ret = SSE_FindStr(@S0, @Search$)
Debug ret
Debug "--------------------------------------------------"
; ----------------------------------------------------------------------
; TIMING TEST
; ----------------------------------------------------------------------
#cst_Loops = 2000000 ; 2Mio
Define T1, T2, txtStrLen.s, txtStrCompare.s
Debug "Stringlength"
Debug Str(@S1 % 32) + " : " + Hex(@S1)
Debug Str(@S2 % 16) + " : " + Hex(@S2)
; ---------------- StringLength ----------------------
; SSE Assembler Version
; S1 = Space(15000)
T1 = ElapsedMilliseconds()
For I = 1 To #cst_Loops
ret = SSE_Len(@S1)
Next
T1 = ElapsedMilliseconds() - T1
; Standard PB StringLenth
T2 = ElapsedMilliseconds()
For I = 1 To #cst_Loops
;ret = Len(S1)
ret = MemoryStringLength(@S1)
Next
T2 = ElapsedMilliseconds() - T2
txtStrLen = "StringLength " + #cst_Loops + " Calls : ASM SSE = " + T1 + " / " + "PB Version = " + T2
; ---------------- StringCompare ----------------------
; SSE Assembler Version
T1 = ElapsedMilliseconds()
For I = 1 To #cst_Loops
ret = SSE_StringCompare(@S1, @S2)
Next
T1 = ElapsedMilliseconds() - T1
; Standard PB StringLenth
T2 = ElapsedMilliseconds()
For I = 1 To #cst_Loops
ret = CompareMemoryString(@S1, @S2)
Next
T2 = ElapsedMilliseconds() - T2
txtStrCompare = "StringCompare " + #cst_Loops + " Calls : ASM SSE = " + T1 + " / " + "PB Version = " + T2
MessageRequester("Timing results", txtStrLen + #CRLF$ + txtStrCompare)
CompilerEndIf
PbFw_ASM_Macros.pbi
Code: Select all
; ===========================================================================
; FILE : PbFw_ASM_Macros.pbi
; NAME : Collection of Assembler optimation Macros
; DESC : Macros for PUSH, POP MMX and XMM Registers when using
; DESC : in PB Procedures
; DESC : The macros don't work in out of Procedure code becuase
; DESC : the in Procedure Assembler variable convetion is used
; DESC : [p.p_] [p.v_] for Pointers / Variables
; DESC : For use outside of Procedures you have to use [p_] [v_]
; ===========================================================================
;
; AUTHOR : Stefan Maag
; DATE : 2024/02/04
; VERSION : 0.5 Developer Version
; COMPILER : PureBasic 6.0
;
; LICENCE : MIT License see https://opensource.org/license/mit/
; or \PbFramWork\MitLicence.txt
; ===========================================================================
; ChangeLog:
;{
;}
;{ TODO:
;}
; ===========================================================================
; ------------------------------
; MMX and SSE Registers
; ------------------------------
; MM0..MM7 : MMX : Pentium P55C (Q5 1995) and AMD K6 (Q2 1997)
; XMM0..XMM15 : SSE : Intel Core2 and AMD K8 Athlon64 (2003)
; YMM0..YMM15 : AVX256 : Intel SandyBridge (Q1 2011) and AMD Bulldozer (Q4 2011)
; X/Y/ZMM0..31 : AVX512 : Tiger Lake (Q4 2020) and AMD Zen4 (Q4 2022)
; ------------------------------
; Caller/callee saved registers
; ------------------------------
; The x64 ABI considers the registers RAX, RCX, RDX, R8, R9, R10, R11, And XMM0-XMM5 volatile.
; When present, the upper portions of YMM0-YMM15 And ZMM0-ZMM15 are also volatile. On AVX512VL;
; the ZMM, YMM, And XMM registers 16-31 are also volatile. When AMX support is present,
; the TMM tile registers are volatile. Consider volatile registers destroyed on function calls
; unless otherwise safety-provable by analysis such As whole program optimization.
; The x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, And XMM6-XMM15 nonvolatile.
; They must be saved And restored by a function that uses them.
;https://learn.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-170
; x64 calling conventions - Register use
; ------------------------------
; x64 CPU Register
; ------------------------------
; RAX Volatile Return value register
; RCX Volatile First integer argument
; RDX Volatile Second integer argument
; R8 Volatile Third integer argument
; R9 Volatile Fourth integer argument
; R10:R11 Volatile Must be preserved As needed by caller; used in syscall/sysret instructions
; R12:R15 Nonvolatile Must be preserved by callee
; RDI Nonvolatile Must be preserved by callee
; RSI Nonvolatile Must be preserved by callee
; RBX Nonvolatile Must be preserved by callee
; RBP Nonvolatile May be used As a frame pointer; must be preserved by callee
; RSP Nonvolatile Stack pointer
; ------------------------------
; MMX-Register
; ------------------------------
; MM0:MM7 Nonvolatile Registers shared with FPU-Register. An EMMS Command is necessary after MMX-Register use
; to enable correct FPU functions again.
; ------------------------------
; SSE Register
; ------------------------------
; XMM0, YMM0 Volatile First FP argument; first vector-type argument when __vectorcall is used
; XMM1, YMM1 Volatile Second FP argument; second vector-type argument when __vectorcall is used
; XMM2, YMM2 Volatile Third FP argument; third vector-type argument when __vectorcall is used
; XMM3, YMM3 Volatile Fourth FP argument; fourth vector-type argument when __vectorcall is used
; XMM4, YMM4 Volatile Must be preserved As needed by caller; fifth vector-type argument when __vectorcall is used
; XMM5, YMM5 Volatile Must be preserved As needed by caller; sixth vector-type argument when __vectorcall is used
; XMM6:XMM15, YMM6:YMM15 Nonvolatile (XMM), Volatile (upper half of YMM) Must be preserved by callee. YMM registers must be preserved As needed by caller.
; @f = Jump forward to next @@; @b = Jump backward to next @@
; .Loop: ; is a local Label or SubLable. It works form the last global lable
; The PB compiler sets a global label for each Procedure, so local lables work only inside the Procedure
; ------------------------------
; Some important SIMD instructions
; ------------------------------
; https://hjlebbink.github.io/x86doc/
; PAND, POR, PXOR, PADD ... : SSE2
; PCMPEQW : SSE2 : Compare Packed Data for Equal
; PSHUFLW : SSE2 : Shuffle Packed Low Words
; PSHUFHW : SSE2 : Shuffle Packed High Words
; PSHUFB : SSE3 : Packed Shuffle Bytes
; PEXTR[B/W/D/Q] : SSE4.1 : PEXTRB RAX, XMM0, 1 : loads Byte 1 of XMM0[Byte 0..7]
; PINSR[B/W/D/Q] : SSE4.1 : PINSRB XMM0, RAX, 1 : transfers RAX LoByte to Byte 1 of XMM0
; PCMPESTRI : SSE4.2 : Packed Compare Implicit Length Strings, Return Index
; PCMPISTRM : SSE4.2 : Packed Compare Implicit Length Strings, Return Mask
;- ----------------------------------------------------------------------
;- NaN Value 32/64 Bit
; #Nan32 = $FFC00000 ; Bit representaion for the 32Bit Float NaN value
; #Nan64 = $FFF8000000000000 ; Bit representaion for the 64Bit Float NaN value
; ----------------------------------------------------------------------
; ----------------------------------------------------------------------
; Structures to reserve Space on the Stack for ASM_PUSH, ASM_POP
; ----------------------------------------------------------------------
Structure TStack_16Byte
R.q[2]
EndStructure
Structure TStack_32Byte
R.q[4]
EndStructure
Structure TStack_48Byte
R.q[6]
EndStructure
Structure TStack_64Byte
R.q[8]
EndStructure
Structure TStack_96Byte
R.q[12]
EndStructure
Structure TStack_128Byte
R.q[16]
EndStructure
Structure TStack_256Byte
R.q[32]
EndStructure
Structure TStack_512Byte
R.q[64]
EndStructure
;- ----------------------------------------------------------------------
;- CPU Registers
;- ----------------------------------------------------------------------
; seperate Macros for EBX,RBX because this is often needed expecally for x32
Macro ASM_PUSH_EBX()
Protected mEBX
!MOV [p.v_mEBX], EBX
EndMacro
Macro ASM_POP_EBX()
!MOV EBX, [p.v_mEBX]
EndMacro
Macro ASM_PUSH_RBX()
Protected mRBX
!MOV [p.v_mRBX], RBX
EndMacro
Macro ASM_POP_RBX()
!MOV RBX, [p.v_mRBX]
EndMacro
; The LEA instruction: LoadEffectiveAddress of a variable
Macro ASM_PUSH_R10to11(ptrREG)
Protected R1011.TStack_16Byte
!LEA ptrREG, [p.v_R1011] ; RDX = @R1011 = Pionter to RegisterBackupStruct
!MOV [ptrREG], R10
!MOV [ptrREG+8], R11
EndMacro
Macro ASM_POP_R10to11(ptrREG)
!LEA ptrREG, [p.v_R1011] ; RDX = @R1011 = Pionter to RegisterBackupStruct
!MOV R10, [ptrREG]
!MOV R11, [ptrREG+8]
EndMacro
Macro ASM_PUSH_R12to15(ptrREG)
Protected R1215.TStack_32Byte
!LEA ptrREG, [p.v_R1215] ; RDX = @R1215 = Pionter to RegisterBackupStruct
!MOV [ptrREG], R12
!MOV [ptrREG+8], R13
!MOV [ptrREG+16], R14
!MOV [ptrREG+24], R15
EndMacro
Macro ASM_POP_R12to15(ptrREG)
!LEA ptrREG, [p.v_R1215] ; RDX = @R1215 = Pionter to RegisterBackupStruct
!MOV R12, [ptrREG]
!MOV R13, [ptrREG+8]
!MOV R14, [ptrREG+16]
!MOV R15, [ptrREG+24]
EndMacro
;- ----------------------------------------------------------------------
;- MMX Registers
;- ----------------------------------------------------------------------
; All MMX-Registers are non volatile (shard with FPU-Reisters)
; After the end of use of MMX-Regiters an EMMS Command mus follow to enable
; correct FPU operations again!
Macro ASM_PUSH_MM_0to3(ptrREG)
Protected M03.TStack_32Byte
!LEA ptrREG, [p.v_M03] ; RDX = @M03 = Pionter to RegisterBackupStruct
!MOVQ [ptrREG], MM0
!MOVQ [ptrREG+8], MM1
!MOVQ [ptrREG+16], MM2
!MOVQ [ptrREG+24], MM3
EndMacro
Macro ASM_POP_MM_0to3(ptrREG)
!LEA ptrREG, [p.v_M03] ; RDX = @M03 = Pionter to RegisterBackupStruct
!MOVQ MM0, [ptrREG]
!MOVQ MM1, [ptrREG+8]
!MOVQ MM2, [ptrREG+16]
!MOVQ MM3, [ptrREG+24]
EndMacro
Macro ASM_PUSH_MM_4to5(ptrREG)
Protected M45.TStack_32Byte
!LEA ptrREG, [p.v_M45] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ [ptrREG], MM4
!MOVQ [ptrREG+8], MM5
EndMacro
Macro ASM_POP_MM_4to5(ptrREG)
!LEA ptrREG, [p.v_M45] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ MM4, [ptrREG]
!MOVQ MM5, [ptrREG+8]
EndMacro
Macro ASM_PUSH_MM_4to7(ptrREG)
Protected M47.TStack_32Byte
!LEA ptrREG, [p.v_M47] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ [ptrREG], MM4
!MOVQ [ptrREG+8], MM5
!MOVQ [ptrREG+16], MM6
!MOVQ [ptrREG+24], MM7
EndMacro
Macro ASM_POP_MM_4to7(ptrREG)
!LEA ptrREG, [p.v_M47] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ MM4, [ptrREG]
!MOVQ MM5, [ptrREG+8]
!MOVQ MM6, [ptrREG+16]
!MOVQ MM7, [ptrREG+24]
EndMacro
;- ----------------------------------------------------------------------
;- XMM Registers
;- ----------------------------------------------------------------------
; because of unaligend Memory latency we use 2x64 Bit MOV instead of 1x128 Bit MOV
; MOVDQU [ptrREG], XMM4 -> MOVLPS [ptrREG], XMM4 and MOVHPS [ptrREG+8], XMM4
; x64 Prozessor can do 2 64Bit Memory transfers parallel
; XMM4:XMM5 normally are volatile and we do not have to preserve it
; ATTENTION: XMM4:XMM5 must be preserved only when __vectorcall is used
; as I know PB don't use __vectorcall in ASM Backend. But if we use it
; within a Procedure where __vectorcall isn't used. We don't have to preserve.
; So wee keep the Macro empty. If you want to activate, just activate the code.
Macro ASM_PUSH_XMM_4to5(ptrREG)
EndMacro
Macro ASM_POP_XMM_4to5(ptrREG)
EndMacro
; Macro ASM_PUSH_XMM_4to5(ptrREG)
; Protected X45.TStack_32Byte
; !LEA ptrREG, [p.v_X45] ; RDX = @X45 = Pionter to RegisterBackupStruct
; !MOVLPS [ptrREG], XMM4
; !MOVHPS [ptrREG+8], XMM4
; !MOVLPS [ptrREG+16], XMM5
; !MOVHPS [ptrREG+24], XMM5
; EndMacro
; Macro ASM_POP_XMM_4to5(ptrREG)
; !LEA ptrREG, [p.v_X45] ; RDX = @X45 = Pionter to RegisterBackupStruct
; !MOVLPS XMM4, [ptrREG]
; !MOVHPS XMM4, [ptrREG+8]
; !MOVLPS XMM5, [ptrREG+16]
; !MOVHPS XMM5, [ptrREG+24]
; EndMacro
; ======================================================================
Macro ASM_PUSH_XMM_6to7(ptrREG)
Protected X67.TStack_32Byte
!LEA ptrREG, [p.v_X67] ; RDX = @X67 = Pionter to RegisterBackupStruct
!MOVLPS [ptrREG], XMM6
!MOVHPS [ptrREG+8], XMM6
!MOVLPS [ptrREG+16], XMM7
!MOVHPS [ptrREG+24], XMM7
EndMacro
Macro ASM_POP_XMM_6to7(ptrREG)
!LEA ptrREG, [p.v_X67] ; RDX = @X67 = Pionter to RegisterBackupStruct
!MOVLPS XMM6, [ptrREG]
!MOVHPS XMM6, [ptrREG+8]
!MOVLPS XMM6, [ptrREG+16]
!MOVHPS XMM6, [ptrREG+24]
EndMacro
; Fast LOAD/SAVE XMM-Register; MOVDQU command for 128Bit has long latency.
; 2 x64Bit loads are faster! Processed parallel in 1 cycle with low or 0 latency
; this optimation is token from AMD code optimation guide
Macro ASM_LD_XMMM(REGX, ptrREG)
!MOVLPS REGX, [ptrREG]
!MOVHPS REGX, [ptrREG+8]
EndMacro
Macro ASM_SAV_XMMM(REGX, ptrREG)
!MOVLPS [ptrREG], REGX
!MOVHPS [ptrREG+8] + REGX
EndMacro
;- ----------------------------------------------------------------------
;- YMM Registers
;- ----------------------------------------------------------------------
; for YMM 256 Bit Registes we switch to aligned Memory commands.
; YMM needs 256Bit = 32Byte Align. So wee need 32Bytes more Memory for manual
; align it! We have to ADD 32 to the Adress and than clear the lo 5 bits
; to get an address Align 32
; ATTENTION! When using YMM-Registers we have to preserve only the lo-parts (XMM-Part)
; The hi-parts are always volatile. So preserving XMM-Registers is enough!
; Use this Macros only if you want to preserve the complete YMM-Registers for your own purpose!
Macro ASM_PUSH_YMM_4to5(ptrREG)
Protected Y45.TStack_96Byte ; we need 64Byte and use 96 to get Align 32
; Aling Adress to 32 Byte, so we can use Aligend MOV VMOVAPD
!LEA ptrREG, [p.b_Y45] ; RDX = @Y45 = Pionter to RegisterBackupStruct
!ADD ptrREG, 32
!SHR ptrREG, 5
!SHL ptrREG, 5
; Move the YMM Registers to Memory Align 32
!VMOVAPD [ptrREG], YMM4
!VMOVAPD [ptrREG+32], YMM5
EndMacro
Macro ASM_POP_YMM_4to5(ptrREG)
; Aling Address @Y45 to 32 Byte, so we can use Aligned MOV VMOVAPD
!LEA ptrREG, [p.v_Y45] ; RDX = @Y45 = Pionter to RegisterBackupStruct
!ADD ptrREG, 32
!SHR ptrREG, 5
!SHL ptrREG, 5
; POP Registers from Stack
!VMOVAPD YMM4, [ptrREG]
!VMOVAPD YMM5, [ptrREG+32]
EndMacro
Macro ASM_PUSH_YMM_6to7(ptrREG)
Protected Y67.TStack_96Byte ; we need 64Byte an use 96 to get Align 32
; Aling Adress to 32 Byte, so we can use Aligend MOV VMOVAPD
!LEA ptrREG, [p.b_Y67] ; RDX = @Y67 = Pionter to RegisterBackupStruct
!ADD ptrREG, 32
!SHR ptrREG, 5
!SHL ptrREG, 5
; Move the YMM Registers to Memory Align 32
!VMOVAPD [ptrREG], YMM6
!VMOVAPD [ptrREG+32], YMM7
EndMacro
Macro ASM_POP_YMM_6to7(ptrREG)
; Aling Adress @Y67 to 32 Byte, so we can use Aligned MOV VMOVAPD
!LEA ptrREG, [p.v_Y67] ; RDX = @Y67 = Pionter to RegisterBackupStruct
!ADD ptrREG, 32
!SHR ptrREG, 5
!SHL ptrREG, 5
; POP Registers from Stack
!VMOVAPD YMM6, [ptrREG]
!VMOVAPD YMM7, [ptrREG+32]
EndMacro