Helpful Assembler Macros for SSE SIMD
Posted: Wed Nov 12, 2025 6:37 pm
For those who are still using ASM Backend with inline Assembler Code. Or for those who want to use.
Here are a collection of helpful/important ASM Macros I use in my codes. The Version is still in developer state because not all Macros are 100% tested. But should work!
----------------------------------------------------------------------
I run into some conficts and changed again a lot.
Now it is possible to use the Macros inside and outside of Procedures.
Changed Macro names to have standard naming convetion, And SIMD vector commands
like MUL/DIV is not the same as mathematical Vector Mul/Div. Now Macros are named
as ASM_SIMD instead of ASM_Vec4.
Update V0.6; 2025/11/14
The TestCode shows the use of .Vector4 functions with SSE-Commands
Here are a collection of helpful/important ASM Macros I use in my codes. The Version is still in developer state because not all Macros are 100% tested. But should work!
----------------------------------------------------------------------
I run into some conficts and changed again a lot.
Now it is possible to use the Macros inside and outside of Procedures.
Changed Macro names to have standard naming convetion, And SIMD vector commands
like MUL/DIV is not the same as mathematical Vector Mul/Div. Now Macros are named
as ASM_SIMD instead of ASM_Vec4.
Update V0.6; 2025/11/14
The TestCode shows the use of .Vector4 functions with SSE-Commands
Code: Select all
; ===========================================================================
; FILE : PbFw_ASM_Macros.pbi
; NAME : Collection of Assembler Macros for SIMD SSE instructions
; DESC : Since PB has a C-Backend, Assembler Code in PB only make sens if
; DESC : it is used for SIMD instructions. Generally SIMD is suitable for all vector
; DESC : arithmetics like 3D Grafics, Color operations, Complex number arithmetic.
; DESC : This Library provides a general basic set of Macros for SIMD vector operations.
; DESC : for 2D and 4D Datas. Furthermore the necessary Macros for preserving non volatile
; DESC : Registers.
; DESC : Macros for PUSH, POP MMX and XMM Registers
; DESC : Macros for SIMD .Vector4 functions using SSE-Commands
; DESC : Macros for SIMD .Vector2 functions using SSE-Commands
; DESC : The Macros now work inside and outside of Procedures, but need EnableASM Statment
; DESC : The Macros are Structure Name independet. A fixed Data defintion is used but
; DESC : we can pass any Structure to it whitout Name checking. So it does not matter if we
; DESC : use the PB .Vector4 or our own Structure what implements the 4 floats (like 'VECf' from Vector Module)
; SOURCES:
; A full description of the Intel Assembler Commands
; https://hjlebbink.github.io/x86doc/
; ===========================================================================
;
; AUTHOR : Stefan Maag
; DATE : 2024/02/04
; VERSION : 0.6 Developer Version
; COMPILER : PureBasic 6.0+
;
; LICENCE : MIT License see https://opensource.org/license/mit/
; or \PbFramWork\MitLicence.txt
; ===========================================================================
; ChangeLog:
; {
; 2025/11/14 S.Maag : I run into some practial conflicts.
; - somtimes for 2D coordinates it is not suitable to use 4D coordinates.
; Because of that I added the 2 dimenional SSE Commands for double floats.
; - Naming convention: SIMD SSE vector functions MUL/DIV... are not the same as
; mathematical correct Vector MUL/DIV. Because of that I changed the naming
; to exactly what it is. SIMD (Single Instruction Multiple Data)
; Like ASM_Vec4_ADD_PS -> ASM_SIMD_ADD_4PS (SIMD ADD 4 packed single)
; 2025/11/13 S.Maag : modified Macros to use inside and outside Procedures
; This is possible with the PB ASM preprocessor (EnableAsm).
; Changed the register loads form !MOV REG, [p.v_var] to MOV REG, var
; Added _VectorPointerToREG Macro to handle *vec or vec automatically.
; 2025/11/12 S.Maag : added/changed some comments. Repaired bugs in XMM functions.
; For Vector4 Functions we have to determine the VarType of the
; Vector4 Structure: #ASM_VAR or ASM_PTR. Need LEA command vor #ASM_VAR
; and MOV command for #ASM_PTR
; 2024/08/01 S.Maag : added Register Load/Save Macros and Vector Macros
; for packed SingleFloat and packed DoubleWord
;{ TODO:
; - Add Functions for correct Shuffling of 2D/4D Datas to use SIMD vertical ADD/SUB functions instead
; of horizontal ADD/SUB. It's because the combination of Shuffling and vertical ADD/SUB is faster than
; the vertical ADD/SUB. All this is needed for 3D Grafics Vector and Matrix Multiplication.
; There is a havy use of combined Multiply & ADD. The first MMX integration were the combined Multiply and
; horizontal ADD. Later Shuffle commands were add to the instructionset. A combination of MUL, SHUF, Vertical ADD
; has much lower latency than the older vertical functions.
; - Add Funtions for 4D Double Float Vetors. But this is a little bit more complicated as it seems.
; For the 4D Double we have to change to 256-Bit YMM-Registers. That are the commands in the SSE instruction
; with the 'V' as prefix like VMAXPD. But 256 Bit instructions with 'V' has different functions compared to
; the 128 Bit instructions. So first a exact study of the documentation is needed.
; - Add Functions for fast SIMD Color operations
;}
; ===========================================================================
; ------------------------------
; MMX and SSE Registers
; ------------------------------
; MM0..MM7 : MMX : Pentium P55C (Q5 1995) and AMD K6 (Q2 1997)
; XMM0..XMM15 : SSE : Intel Core2 and AMD K8 Athlon64 (2003)
; YMM0..YMM15 : AVX256 : Intel SandyBridge (Q1 2011) and AMD Bulldozer (Q4 2011)
; X/Y/ZMM0..31 : AVX512 : Tiger Lake (Q4 2020) and AMD Zen4 (Q4 2022)
; ------------------------------
; Caller/callee saved registers
; ------------------------------
; The x64 ABI considers the registers RAX, RCX, RDX, R8, R9, R10, R11, And XMM0-XMM5 volatile.
; When present, the upper portions of YMM0-YMM15 And ZMM0-ZMM15 are also volatile. On AVX512VL;
; the ZMM, YMM, And XMM registers 16-31 are also volatile. When AMX support is present,
; the TMM tile registers are volatile. Consider volatile registers destroyed on function calls
; unless otherwise safety-provable by analysis such As whole program optimization.
; The x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, And XMM6-XMM15 nonvolatile.
; They must be saved And restored by a function that uses them.
;https://learn.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-170
; x64 calling conventions - Register use
; ------------------------------
; x64 CPU Register
; ------------------------------
; RAX Volatile Return value register
; RCX Volatile First integer argument
; RDX Volatile Second integer argument
; R8 Volatile Third integer argument
; R9 Volatile Fourth integer argument
; R10:R11 Volatile Must be preserved As needed by caller; used in syscall/sysret instructions
; R12:R15 Nonvolatile Must be preserved by callee
; RDI Nonvolatile Must be preserved by callee
; RSI Nonvolatile Must be preserved by callee
; RBX Nonvolatile Must be preserved by callee
; RBP Nonvolatile May be used As a frame pointer; must be preserved by callee
; RSP Nonvolatile Stack pointer
; ------------------------------
; MMX-Register
; ------------------------------
; MM0:MM7 Nonvolatile Registers shared with FPU-Register. An EMMS Command is necessary after MMX-Register use
; to enable correct FPU functions again.
; ------------------------------
; SSE Register
; ------------------------------
; XMM0, YMM0 Volatile First FP argument; first vector-type argument when __vectorcall is used
; XMM1, YMM1 Volatile Second FP argument; second vector-type argument when __vectorcall is used
; XMM2, YMM2 Volatile Third FP argument; third vector-type argument when __vectorcall is used
; XMM3, YMM3 Volatile Fourth FP argument; fourth vector-type argument when __vectorcall is used
; XMM4, YMM4 Volatile Must be preserved As needed by caller; fifth vector-type argument when __vectorcall is used
; XMM5, YMM5 Volatile Must be preserved As needed by caller; sixth vector-type argument when __vectorcall is used
; XMM6:XMM15, YMM6:YMM15 Nonvolatile (XMM), Volatile (upper half of YMM) Must be preserved by callee. YMM registers must be preserved As needed by caller.
; @f = Jump forward to next @@; @b = Jump backward to next @@
; .Loop: ; is a local Label or SubLable. It works form the last global lable
; The PB compiler sets a global label for each Procedure, so local lables work only inside the Procedure
; ------------------------------
; Some important SIMD instructions
; ------------------------------
; https://hjlebbink.github.io/x86doc/
; PAND, POR, PXOR, PADD ... : SSE2
; PCMPEQW : SSE2 : Compare Packed Data for Equal
; PSHUFLW : SSE2 : Shuffle Packed Low Words
; PSHUFHW : SSE2 : Shuffle Packed High Words
; PSHUFB : SSE3 : Packed Shuffle Bytes !
; PEXTR[B/W/D/Q] : SSE4.1 : PEXTRB RAX, XMM0, 1 : loads Byte 1 of XMM0[Byte 0..7]
; PINSR[B/W/D/Q] : SSE4.1 : PINSRB XMM0, RAX, 1 : transfers RAX LoByte to Byte 1 of XMM0
; PCMPESTRI : SSE4.2 : Packed Compare Implicit Length Strings, Return Index
; PCMPISTRM : SSE4.2 : Packed Compare Implicit Length Strings, Return Mask
;- ----------------------------------------------------------------------
;- NaN Value 32/64 Bit
; #Nan32 = $FFC00000 ; Bit representaion for the 32Bit Float NaN value
; #Nan64 = $FFF8000000000000 ; Bit representaion for the 64Bit Float NaN value
; ----------------------------------------------------------------------
; --------------------------------------------------
; Assembler Datasection Definition
; --------------------------------------------------
; db Define Byte = 1 byte
; dw Define Word = 2 bytes
; dd Define Doubleword = 4 bytes
; dq Define Quadword = 8 bytes
; dt Define ten Bytes = 10 bytes
; !label: dq 21, 22, 23
; --------------------------------------------------
; ----------------------------------------------------------------------
; Structures to reserve Space on the Stack for ASM_PUSH, ASM_POP
; ----------------------------------------------------------------------
Structure TStack_16Byte
R.q[2]
EndStructure
Structure TStack_32Byte
R.q[4]
EndStructure
Structure TStack_48Byte
R.q[6]
EndStructure
Structure TStack_64Byte
R.q[8]
EndStructure
Structure TStack_96Byte
R.q[12]
EndStructure
Structure TStack_128Byte
R.q[16]
EndStructure
Structure TStack_256Byte
R.q[32]
EndStructure
Structure TStack_512Byte
R.q[64]
EndStructure
Macro AsmCodeIsInProc
Bool(#PB_Compiler_Procedure <> #Null$)
EndMacro
;- ----------------------------------------------------------------------
;- CPU Registers
;- ----------------------------------------------------------------------
; seperate Macros for EBX/RBX because this is often needed expecally for x32
; It is not real a PUSH/POP it is more a SAVE/RESTORE!
; ATTENTION! Use EnableAsm in your code before using the Macros
; By using the PB ASM preprocessor and the Define Statement instead of Protected
; we can use the Macros now inside and outside a Procedure.
; Inside a Procedure PB handels Define and Proteced in the same way.
Macro ASM_PUSH_EBX()
Define mEBX
MOV mEBX, EBX
; !MOV [p.v_mEBX], EBX
EndMacro
Macro ASM_POP_EBX()
MOV EBX, mEBX
; !MOV EBX, [p.v_mEBX]
EndMacro
Macro ASM_PUSH_RBX()
Define mRBX
MOV mRBX, RBX
; !MOV [p.v_mRBX], RBX
EndMacro
Macro ASM_POP_RBX()
MOV RBX, mRBX
;!MOV RBX, [p.v_mRBX]
EndMacro
; The LEA instruction: LoadEffectiveAddress of a variable
Macro ASM_PUSH_R10to11(_REG=RDX)
Define R1011.TStack_16Byte
LEA _REG, R1011
;!LEA _REG, [p.v_R1011] ; RDX = @R1011 = Pionter to RegisterBackupStruct
!MOV [_REG], R10
!MOV [_REG+8], R11
EndMacro
Macro ASM_POP_R10to11(_REG=RDX)
LEA _REG, R1011
; !LEA _REG, [p.v_R1011] ; RDX = @R1011 = Pionter to RegisterBackupStruct
!MOV R10, [_REG]
!MOV R11, [_REG+8]
EndMacro
Macro ASM_PUSH_R12to15(_REG=RDX)
Define R1215.TStack_32Byte
LEA _REG, R1215
; !LEA _REG, [p.v_R1215] ; RDX = @R1215 = Pionter to RegisterBackupStruct
!MOV [_REG], R12
!MOV [_REG+8], R13
!MOV [_REG+16], R14
!MOV [_REG+24], R15
EndMacro
Macro ASM_POP_R12to15(_REG=RDX)
LEA _REG, R1215
; !LEA _REG, [p.v_R1215] ; RDX = @R1215 = Pionter to RegisterBackupStruct
!MOV R12, [_REG]
!MOV R13, [_REG+8]
!MOV R14, [_REG+16]
!MOV R15, [_REG+24]
EndMacro
;- ----------------------------------------------------------------------
;- MMX Registers (on x64 don't use MMX-Registers! XMM is the better!
;- ----------------------------------------------------------------------
; All MMX-Registers are non volatile (shard with FPU-Reisters)
; After the end of use of MMX-Regiters an EMMS Command mus follow to enable
; correct FPU operations again!
Macro ASM_PUSH_MM_0to3(_REG=RDX)
Define M03.TStack_32Byte
LEA _REG, M03
; !LEA _REG, [p.v_M03] ; RDX = @M03 = Pionter to RegisterBackupStruct
!MOVQ [_REG], MM0
!MOVQ [_REG+8], MM1
!MOVQ [_REG+16], MM2
!MOVQ [_REG+24], MM3
EndMacro
Macro ASM_POP_MM_0to3(_REG=RDX)
LEA _REG, M03
; !LEA _REG, [p.v_M03] ; RDX = @M03 = Pionter to RegisterBackupStruct
!MOVQ MM0, [_REG]
!MOVQ MM1, [_REG+8]
!MOVQ MM2, [_REG+16]
!MOVQ MM3, [_REG+24]
EndMacro
Macro ASM_PUSH_MM_4to5(_REG=RDX)
Define M45.TStack_32Byte
LEA _REG, M45
; !LEA _REG, [p.v_M45] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ [_REG], MM4
!MOVQ [_REG+8], MM5
EndMacro
Macro ASM_POP_MM_4to5(_REG=RDX)
LEA _REG, M45
;!LEA _REG, [p.v_M45] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ MM4, [_REG]
!MOVQ MM5, [_REG+8]
EndMacro
Macro ASM_PUSH_MM_4to7(_REG=RDX)
Define M47.TStack_32Byte
LEA _REG, M47
; !LEA _REG, [p.v_M47] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ [_REG], MM4
!MOVQ [_REG+8], MM5
!MOVQ [_REG+16], MM6
!MOVQ [_REG+24], MM7
EndMacro
Macro ASM_POP_MM_4to7(_REG=RDX)
LEA _REG, M47
;!LEA _REG, [p.v_M47] ; RDX = @M47 = Pionter to RegisterBackupStruct
!MOVQ MM4, [_REG]
!MOVQ MM5, [_REG+8]
!MOVQ MM6, [_REG+16]
!MOVQ MM7, [_REG+24]
EndMacro
;- ----------------------------------------------------------------------
;- XMM Registers
;- ----------------------------------------------------------------------
; because of unaligend Memory latency we use 2x64 Bit MOV instead of 1x128 Bit MOV
; MOVDQU [ptrREG], XMM4 -> MOVLPS [ptrREG], XMM4 and MOVHPS [ptrREG+8], XMM4
; x64 Prozessor can do 2 64Bit Memory transfers parallel
; XMM4:XMM5 normally are volatile and we do not have to preserve it
; ATTENTION: XMM4:XMM5 must be preserved only when __vectorcall is used
; as I know PB don't use __vectorcall in ASM Backend. But if we use it
; within a Procedure where __vectorcall isn't used, we don't have to preserve.
; So wee keep the Macro empty. If you want to activate, just activate the code.
Macro ASM_PUSH_XMM_4to5(_REG=RDX)
EndMacro
Macro ASM_POP_XMM_4to5(_REG=RDX)
EndMacro
; Macro ASM_PUSH_XMM_4to5(REG=RDX)
; Define X45.TStack_32Byte
; LEA REG, X45
; ; !LEA REG, [p.v_X45] ; RDX = @X45 = Pionter to RegisterBackupStruct
; !MOVLPS [REG], XMM4
; !MOVHPS [REG+8], XMM4
; !MOVLPS [REG+16], XMM5
; !MOVHPS [REG+24], XMM5
; EndMacro
; Macro ASM_POP_XMM_4to5(REG)
; LEA REG, X45
; ; !LEA REG, [p.v_X45] ; RDX = @X45 = Pionter to RegisterBackupStruct
; !MOVLPS XMM4, [REG]
; !MOVHPS XMM4, [REG+8]
; !MOVLPS XMM5, [REG+16]
; !MOVHPS XMM5, [REG+24]
; EndMacro
; ======================================================================
Macro ASM_PUSH_XMM_6to7(_REG=RDX)
Define X67.TStack_32Byte
LEA _REG, X67
; !LEA _REG, [p.v_X67] ; RDX = @X67 = Pionter to RegisterBackupStruct
!MOVLPS [_REG], XMM6
!MOVHPS [_REG+8], XMM6
!MOVLPS [_REG+16], XMM7
!MOVHPS [_REG+24], XMM7
EndMacro
Macro ASM_POP_XMM_6to7(_REG=RDX)
LEA _REG, X67
;!LEA _REG, [p.v_X67] ; RDX = @X67 = Pionter to RegisterBackupStruct
!MOVLPS XMM6, [_REG]
!MOVHPS XMM6, [_REG+8]
!MOVLPS XMM6, [_REG+16]
!MOVHPS XMM6, [_REG+24]
EndMacro
;- ----------------------------------------------------------------------
;- YMM Registers
;- ----------------------------------------------------------------------
; for YMM 256 Bit Registes we switch to aligned Memory commands (much faster than unaligned)
; YMM needs 256Bit = 32Byte Align. So wee need 32Bytes more Memory for manual
; align it! We have to ADD 32 to the Adress and than clear the lo 5 bits
; to get an address Align 32
; ATTENTION! When using YMM-Registers we have to preserve only the lo-parts (XMM-Part)
; The hi-parts are always volatile. So preserving XMM-Registers is enough!
; Use this Macros only if you want to preserve the complete YMM-Registers for your own purpose!
Macro ASM_PUSH_YMM_4to5(_REG=RDX)
Define Y45.TStack_96Byte ; we need 64Byte and use 96 to get Align 32
; Aling Adress to 32 Byte, so we can use Aligend MOV VMOVAPD
LEA _REG, Y45
; !LEA _REG, [p.b_Y45] ; RDX = @Y45 = Pionter to RegisterBackupStruct
!ADD _REG, 32
!SHR _REG, 5
!SHL _REG, 5
; Move the YMM Registers to Memory Align 32
!VMOVAPD [_REG], YMM4
!VMOVAPD [_REG+32], YMM5
EndMacro
Macro ASM_POP_YMM_4to5(_REG=RDX)
; Aling Address @Y45 to 32 Byte, so we can use Aligned MOV VMOVAPD
LEA _REG, Y45
;!LEA _REG, [p.v_Y45] ; RDX = @Y45 = Pionter to RegisterBackupStruct
!ADD _REG, 32
!SHR _REG, 5
!SHL _REG, 5
; POP Registers from Stack
!VMOVAPD YMM4, [_REG]
!VMOVAPD YMM5, [_REG+32]
EndMacro
Macro ASM_PUSH_YMM_6to7(_REG=RDX)
Define Y67.TStack_96Byte ; we need 64Byte an use 96 to get Align 32
; Aling Adress to 32 Byte, so we can use Aligend MOV VMOVAPD
LEA _REG, Y67
; !LEA _REG, [p.b_Y67] ; RDX = @Y67 = Pionter to RegisterBackupStruct
!ADD _REG, 32
!SHR _REG, 5
!SHL _REG, 5
; Move the YMM Registers to Memory Align 32
!VMOVAPD [_REG], YMM6
!VMOVAPD [_REG+32], YMM7
EndMacro
Macro ASM_POP_YMM_6to7(_REG=RDX)
; Aling Adress @Y67 to 32 Byte, so we can use Aligned MOV VMOVAPD
LEA _REG, Y67
; !LEA PtrREG, [p.v_Y67] ; RDX = @Y67 = Pionter to RegisterBackupStruct
!ADD _REG, 32
!SHR _REG, 5
!SHL _REG, 5
; POP Registers from Stack
!VMOVAPD YMM6, [_REG]
!VMOVAPD YMM7, [_REG+32]
EndMacro
;- ----------------------------------------------------------------------
;- Load/Save Registers from/to Value or Pointer
;- ----------------------------------------------------------------------
; This Macros are just to have a template for the correct code!
;
; Load a Register with a variable
Macro ASM_LD_REG_Var(_REG, var)
MOV _REG, var
; !MOV _REG, [p.v_#var]
EndMacro
; Save Register to variable
Macro ASM_SAV_REG_Var(_REG, var)
MOV var, _REG
; !MOV [p.v_#var], _REG
EndMacro
; Load Register with a PointerVar
Macro ASM_LD_REG_Ptr(_REG, pVar)
MOV _REG, pVar
; !MOV _REG, [p.p_#pVar]
EndMacro
; Save Pointer in Register to PointerVar
Macro ASM_SAV_REG_Ptr(_REG, pVar)
MOV pVar, _REG
; !MOV [p.p_#pVar], _REG
EndMacro
; load the Register with the Pointer of a var
Macro ASM_LD_REG_VarPtr(_REG, var)
LEA _REG, var
; !LEA _REG, [p.v_#var]
EndMacro
; ----------------------------------------------------------------------
; Lo latency LOAD/SAVE 128 Bit XMM-Register
; ----------------------------------------------------------------------
; MOVDQU command for 128Bit has long latency.
; 2 x64Bit loads are faster! Processed parallel in 1 cycle with low or 0 latency
; This optimation is token from AMD code optimation guide.
; (for 2020+ Processors like AMD Ryzen it does not matter, because Ryzen can load
; (128 Bit at same speed as 2x64Bit. For older Processors 2x 64Bit load is faster)
; ATTENTION! You have to load the MemoryPointer to REG first
; _XMM : The XMM Register to load with data from Memory
; REG : The Register containing the Pointer to the Memory
Macro ASM_LD_XMM(_XMM=XMM0, _REG=RDX)
!MOVLPS _XMM, [_REG]
!MOVHPS _XMM, [_REG+8]
EndMacro
; !MOVDQU _XMM, [_REG] ; alternative 128Bit direct load -> has long latency on Processors older than Ryzen
Macro ASM_SAV_XMM(_XMM=XMM0, _REG=RDX)
!MOVLPS [_REG], _XMM
!MOVHPS [_REG+8], _XMM
EndMacro
; !MOVDQU [_REG], _XMM ; alternative 128Bit direct load -> has long latency on Processors older than Ryzen
; ----------------------------------------------------------------------
; Helper Macro to determine _vector_ is a StructureVar or StructurePointer
; ----------------------------------------------------------------------
; SizeOf(StructureVar) = 16 SizeOf(StructurePointer = 8)
; TypeOf(StructureVar) = 7 TypeOf(StructurePointer = 7) -> TypeOf do not work!
; If _vector_ is a PointerToVector we have to use : MOV MOV REG, _vector_
; If _vector_ is the Structure we have to use : LEA REG, _vector_ ; LoadEffectiveAddress
; The problme solved are mixed calls with PointerOfVector and VectorVar.
; This example Proc shows the Problem!
; Procedure VecTest(*InVec.Vector4)
; Protected v.Vector4, res.Vector4
; v\x = 1.0 : v\y = 1.0 : v\z = 1.0 : v\w = 0.0
;
; ASM_Vec4Add_PS(*InVec, v) ; Now this is possible because of autodetect Pointer or var by the compiler
;
; EndProcedure
Macro _VectorPointerToREG(_REG_, _vector_)
CompilerIf #PB_Compiler_Procedure <> #Null$
CompilerIf SizeOf(_vector_)=SizeOf(Integer)
MOV _REG_, _vector_
CompilerElse
LEA _REG_, _vector_
CompilerEndIf
CompilerElse
MOV _REG_, _vector_
CompilerEndIf
EndMacro
; ----------------------------------------------------------------------
;- ----------------------------------------------------------------------
;- Vector4 PackedSingle ADD, SUB, MUL, DIV (for 4x32Bit Float Vectors)
;- ----------------------------------------------------------------------
; use to speed up standard 3D Grafics with SSE
; Vector4 is a predefined Structure in PB
; Structure Vector4
; x.f
; y.f
; z.f
; w.f
; EndStructure
; SSE Extention Functions
; 2025/11/13 : Changed form direct 128Bit loads to 2x 64Bit loads because of high latency of
; unaligned 128 Bit loads (MOVDQU) on older processor. Now use Macro ASM_LD_XMM instead of MOVDQU
; 4PS := 4 packed single (32Bit)
; _XMMA = _vec1 + _vec2
Macro ASM_SIMD_ADD_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!ADDPS _XMMA, _XMMB ; Add packed single float
EndMacro
; _XMMA = _vec1 - _vec2
Macro ASM_SIMD_SUB_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!SUBPS _XMMA, _XMMB ; Sub packed single float
EndMacro
; _XMMA = _vec1 * _vec2
Macro ASM_SIMD_MUL_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!MULPS _XMMA, _XMMB ; Mul packed single float
EndMacro
; _XMMA = _vec1 / _vec2
Macro ASM_SIMD_DIV_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!DIVPS _XMMA, _XMMB
EndMacro
; _XMMA\x = Min(_vec1\x _vec2\x) : _XMMA\y = Min(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MIN_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!MINPS _XMMA, _XMMB ; Minimum of packed single float
EndMacro
; _XMMA\x = Max(_vec1\x, _vec2\x) : _XMMA\y = Max(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MAX_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!MAXPS _XMMA, _XMMB ; Maximum of packed single float
EndMacro
;- ----------------------------------------------------------------------
;- Vector4 PackedDoubleWord ADD, SUB, MUL (for 4x32Bit Integer Vectors)
;- ----------------------------------------------------------------------
; Structure Vector4L ; This is not predefined in PB
; x.l
; y.l
; z.l
; w.l
; EndStructure
; SSE Extention Functions
; use for direct Integer Pixel postion calculations
; 4PDW := 4 packed double words (32Bit)
; _XMMA = _vec1 + _vec2
Macro ASM_SIMD_ADD_4PDW(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!PADDW _XMMA, _XMMB
EndMacro
; _XMMA = _vec1 - _vec2
Macro ASM_SIMD_SUB_4PDW(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!PSUBDW _XMMA, _XMMB ; Subtract packed DoubleWord integers
EndMacro
; _XMMA = _vec1 * _vec2
Macro ASM_SIMD_MUL_4PDW(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!PMULDQ _XMMA, _XMMB ; Multiply packed DoubleWord Integers
EndMacro
; A PDIVDQ to devide packed Doubleword Integers do not exist because of the CPU cycles are depending on the operands
; _XMMA\x = Min(_vec1\x _vec2\x) : _XMMA\y = Min(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MIN_4PDW(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!PMINSD _XMMA, _XMMB ; Minimum of signed packed Doubleword Integers
EndMacro
; _XMMA\x = Max(_vec1\x, _vec2\x) : _XMMA\y = Max(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MAX_4PDW(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!PMAXSD _XMMA, _XMMB ; Maximum of signed packed Doubleword Integers
EndMacro
;- ----------------------------------------------------------------------
;- Vector2 PackedDouble ADD, SUB, MUL, DIV (for 2x64Bit Double Vectors)
;- ----------------------------------------------------------------------
; use for 2D Double Float coordinates and Complex Number math
; 2PD := 2 packed double (64Bit)
; _XMMA = _vec1 + _vec2
Macro ASM_SIMD_ADD_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!ADDPD _XMMA, _XMMB ; Add packed double float
EndMacro
; _XMMA = _vec1 - _vec2
Macro ASM_SIMD_SUB_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!SUBPD _XMMA, _XMMB ; Sub packed double float
EndMacro
; _XMMA = _vec1 * _vec2
Macro ASM_SIMD_MUL_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!MULPD _XMMA, _XMMB ; Mul packed double float
EndMacro
; _XMMA = _vec1 / _vec2
Macro ASM_SIMD_DIV_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!DIVPD _XMMA, _XMMB
EndMacro
; _XMMA\x = Min(_vec1\x _vec2\x) : _XMMA\y = Min(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MIN_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!MINPD _XMMA, _XMMB ; Minimum of packed double float
EndMacro
; _XMMA\x = Max(_vec1\x, _vec2\x) : _XMMA\y = Max(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MAX_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
_VectorPointerToREG(_REGA, _vec1)
_VectorPointerToREG(_REGD, _vec2)
ASM_LD_XMM(_XMMA, _REGA)
ASM_LD_XMM(_XMMB, _REGD)
!MAXPD _XMMA, _XMMB ; Maximum of packed double float
EndMacro
CompilerIf #PB_Compiler_IsMainFile
;- ----------------------------------------------------------------------
;- TEST-CODE
;- ----------------------------------------------------------------------
EnableExplicit
Macro DbgVector4(_v4)
Debug "x=" + _v4\x
Debug "y=" + _v4\y
Debug "z=" + _v4\z
Debug "w=" + _v4\w
EndMacro
EnableASM
Procedure AddVector4(*vecResult.Vector4, *vec1.Vector4, *vec2.Vector4)
With *vecResult
\x = *vec1\x + *vec2\x
\y = *vec1\y + *vec2\y
\z = *vec1\z + *vec2\z
\w = *vec1\w + *vec2\w
EndWith
EndProcedure
Procedure AddVector4_SSE(*vecResult.Vector4, *vec1.Vector4, *vec2.Vector4)
ASM_SIMD_ADD_4PS(*vec1, *vec2) ; XMM0 = vec1 + vec2
ASM_LD_REG_Ptr(RDX, *vecResult) ; RDX = *vecRestlt
; !MOV RDX, [p.p_vecResult] ; or alternative the ASM-Code
ASM_SAV_XMM(XMM0, RDX) ; vecResult = XMM0 : lo latency 128Bit load
EndProcedure
; because all variables in the ASM Macros are defined as in Procedure, we have to use
; a Procedure for the TestCoode
Procedure Test()
Protected.Vector4 v1, v2, vres
v1\x = 1.0
v1\y = 12.0
v1\z = 3.0
v1\w = 14.0
v2\x = 11.0
v2\y = 2.0
v2\z = 13.0
v2\w = 4.0
Debug "SSE Vector4 operations"
Debug "v1.Vector4"
DbgVector4(v1)
Debug ""
Debug "v2.Vector4"
DbgVector4(v2)
; example adding two Vector4 Structures with SSE Commands
; vres = v1 + v2
ASM_SIMD_Add_4PS(v1, v2) ; ADD v1 to v2 using XMM0/XMM1 -> result in XMM0. Using RAX/RDX as pointer Registers
ASM_LD_REG_VarPtr(RDX, vres) ; load Register with Pointer @vres
ASM_SAV_XMM() ; save XMM0 to Memory pointed by RDX : vres = XMM0
Debug ""
Debug "v1 + v2"
DbgVector4(vres)
; example multiply two Vector4 Structures with SSE Commands
; vres = v1 * v2
ASM_SIMD_MUL_4PS(v1, v2) ; ADD v1 to v2 using XMM0/XMM1 -> result in XMM0. Using RAX/RDX as pointer Registers
ASM_LD_REG_VarPtr(RDX, vres) ; load Register with Pointer @vres
ASM_SAV_XMM() ; save XMM0 to Memory pointed by RDX : vres = XMM0
; example devide two Vector4 Structures with SSE Commands
; vres = vres / v2 -> reult will be v1
ASM_SIMD_DIV_4PS(vres, v2) ; ADD v1 to v2 using XMM0/XMM1 -> result in XMM0. Using RAX/RDX as pointer Registers
ASM_LD_REG_VarPtr(RDX, vres) ; load Register with Pointer @vres
ASM_SAV_XMM() ; save XMM0 to Memory pointed by RDX : vres = XMM0
Debug ""
Debug "vres / v2 -> result will be v1"
DbgVector4(vres)
; example minimum of two Vector4 Structures with SSE Commands
; vres = Min(v1 * v2)
ASM_SIMD_MIN_4PS(v1, v2) ; ADD v1 to v2 using XMM0/XMM1 -> result in XMM0. Using RAX/RDX as pointer Registers
ASM_LD_REG_VarPtr(RDX, vres) ; load Register with Pointer @vres
ASM_SAV_XMM() ; save XMM0 to Memory pointed by RDX : vres = XMM0
Debug ""
Debug "Min(v1, v2)"
DbgVector4(vres)
; example maximum of two Vector4 Structures with SSE Commands
; vres = Max(v1 * v2)
ASM_SIMD_MAX_4PS(v1, v2) ; ADD v1 to v2 using XMM0/XMM1 -> result in XMM0. Using RAX/RDX as pointer Registers
ASM_LD_REG_VarPtr(RDX, vres) ; load Register with Pointer @vres
ASM_SAV_XMM() ; save XMM0 to Memory pointed by RDX : vres = XMM0
Debug ""
Debug "Max(v1, v2)"
DbgVector4(vres)
Debug "--------------------------------------------------"
Debug ""
Debug "Timing Test"
Debug "First test the result of classic ADD and SSE ADD"
Debug "Classic ADD:"
AddVector4(vres, v1, v2)
DbgVector4(vres)
Debug "SSE ADD:"
AddVector4_SSE(vres, v1, v2)
DbgVector4(vres)
Debug ""
Define I, t1, t2
#Loops = 1000 * 10000
DisableDebugger
; because DisableDebugger do not switch off Debugger completely -> do not run Timing code when
; compiling with Debugger (PB 6.21)
CompilerIf Not #PB_Compiler_Debugger
; Classic ADD
AddVector4(vres, v1, v2) ; Load Proc to Cash
t1 = ElapsedMilliseconds()
For I =0 To #Loops
AddVector4(vres, v1, v2)
Next
t1 = ElapsedMilliseconds() - t1
; SEE ADD
AddVector4_SSE(vres, v1, v2) ; Load Proc to Cash
t2 = ElapsedMilliseconds()
For I =0 To #Loops
AddVector4_SSE(vres, v1, v2)
Next
t2 = ElapsedMilliseconds() - t1
EnableDebugger
OpenConsole()
PrintN("Debugger off for SpeedTest to get the correct timing")
PrintN( "Result for Loops=" + #Loops)
PrintN( "Classic ADD ms=" + t1)
PrintN( "SSE SIMD ADD ms=" + t2)
PrintN("Press a Key")
Input()
CompilerEndIf
EndProcedure
DisableASM
Test()
CompilerEndIf