Page 1 of 1

Helpful Assembler Macros for SSE SIMD

Posted: Wed Nov 12, 2025 6:37 pm
by SMaag
For those who are still using ASM Backend with inline Assembler Code. Or for those who want to use.

Here are a collection of helpful/important ASM Macros I use in my codes. The Version is still in developer state because not all Macros are 100% tested. But should work!

----------------------------------------------------------------------
I run into some conficts and changed again a lot.
Now it is possible to use the Macros inside and outside of Procedures.
Changed Macro names to have standard naming convetion, And SIMD vector commands
like MUL/DIV is not the same as mathematical Vector Mul/Div. Now Macros are named
as ASM_SIMD instead of ASM_Vec4.

Update V0.6; 2025/11/14
The TestCode shows the use of .Vector4 functions with SSE-Commands

Code: Select all

; ===========================================================================
;  FILE : PbFw_ASM_Macros.pbi
;  NAME : Collection of Assembler Macros for SIMD SSE instructions
;  DESC : Since PB has a C-Backend, Assembler Code in PB only make sens if
;  DESC : it is used for SIMD instructions. Generally SIMD is suitable for all vector 
;  DESC : arithmetics like 3D Grafics, Color operations, Complex number arithmetic.
;  DESC : This Library provides a general basic set of Macros for SIMD vector operations.
;  DESC : for 2D and 4D Datas. Furthermore the necessary Macros for preserving non volatile 
;  DESC : Registers.
;  DESC : Macros for PUSH, POP MMX and XMM Registers 
;  DESC : Macros for SIMD .Vector4 functions using SSE-Commands
;  DESC : Macros for SIMD .Vector2 functions using SSE-Commands
;  DESC : The Macros now work inside and outside of Procedures, but need EnableASM Statment
;  DESC : The Macros are Structure Name independet. A fixed Data defintion is used but
;  DESC : we can pass any Structure to it whitout Name checking. So it does not matter if we
;  DESC : use the PB .Vector4 or our own Structure what implements the 4 floats (like 'VECf' from Vector Module) 
;  SOURCES:
;   A full description of the Intel Assembler Commands
;   https://hjlebbink.github.io/x86doc/

; ===========================================================================
;
; AUTHOR   :  Stefan Maag
; DATE     :  2024/02/04
; VERSION  :  0.6 Developer Version
; COMPILER :  PureBasic 6.0+
;
; LICENCE  :  MIT License see https://opensource.org/license/mit/
;             or \PbFramWork\MitLicence.txt
; ===========================================================================
; ChangeLog: 
; {
; 2025/11/14 S.Maag : I run into some practial conflicts.
;                     - somtimes for 2D coordinates it is not suitable to use 4D coordinates.
;                       Because of that I added the 2 dimenional SSE Commands for double floats.
;                     - Naming convention: SIMD SSE vector functions MUL/DIV... are not the same as 
;                       mathematical correct Vector MUL/DIV. Because of that I changed the naming 
;                       to exactly what it is. SIMD (Single Instruction Multiple Data)
;                       Like ASM_Vec4_ADD_PS -> ASM_SIMD_ADD_4PS (SIMD ADD 4 packed single)

; 2025/11/13 S.Maag : modified Macros to use inside and outside Procedures
;                     This is possible with the PB ASM preprocessor (EnableAsm).
;                     Changed the register loads form !MOV REG, [p.v_var] to MOV REG, var
;                     Added _VectorPointerToREG Macro to handle *vec or vec automatically.

; 2025/11/12 S.Maag : added/changed some comments. Repaired bugs in XMM functions.
;                     For Vector4 Functions we have to determine the VarType of the 
;                     Vector4 Structure: #ASM_VAR or ASM_PTR. Need LEA command vor #ASM_VAR
;                     and MOV command for #ASM_PTR

; 2024/08/01 S.Maag : added Register Load/Save Macros and Vector Macros
;                     for packed SingleFloat and packed DoubleWord

;{ TODO:
; - Add Functions for correct Shuffling of 2D/4D Datas to use SIMD vertical ADD/SUB functions instead
;   of horizontal ADD/SUB. It's because the combination of Shuffling and vertical ADD/SUB is faster than
;   the vertical ADD/SUB. All this is needed for 3D Grafics Vector and Matrix Multiplication.
;   There is a havy use of combined Multiply & ADD. The first MMX integration were the combined Multiply and
;   horizontal ADD. Later Shuffle commands were add to the instructionset. A combination of MUL, SHUF, Vertical ADD
;   has much lower latency than the older vertical functions.

; - Add Funtions for 4D Double Float Vetors. But this is a little bit more complicated as it seems.
;   For the 4D Double we have to change to 256-Bit YMM-Registers. That are the commands in the SSE instruction
;   with the 'V' as prefix like VMAXPD. But 256 Bit instructions with 'V' has different functions compared to
;   the 128 Bit instructions. So first a exact study of the documentation is needed.

; - Add Functions for fast SIMD Color operations

;}
; ===========================================================================

; ------------------------------
; MMX and SSE Registers
; ------------------------------
; MM0..MM7    :  MMX    : Pentium P55C (Q5 1995) and AMD K6 (Q2 1997)
; XMM0..XMM15 :  SSE    : Intel Core2 and AMD K8 Athlon64 (2003)
; YMM0..YMM15 :  AVX256 : Intel SandyBridge (Q1 2011) and AMD Bulldozer (Q4 2011)
; X/Y/ZMM0..31 : AVX512 : Tiger Lake (Q4 2020) and AMD Zen4 (Q4 2022)

; ------------------------------
; Caller/callee saved registers
; ------------------------------
; The x64 ABI considers the registers RAX, RCX, RDX, R8, R9, R10, R11, And XMM0-XMM5 volatile.
; When present, the upper portions of YMM0-YMM15 And ZMM0-ZMM15 are also volatile. On AVX512VL;
; the ZMM, YMM, And XMM registers 16-31 are also volatile. When AMX support is present, 
; the TMM tile registers are volatile. Consider volatile registers destroyed on function calls
; unless otherwise safety-provable by analysis such As whole program optimization.
; The x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, And XMM6-XMM15 nonvolatile.
; They must be saved And restored by a function that uses them. 

;https://learn.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-170
; x64 calling conventions - Register use

; ------------------------------
; x64 CPU Register
; ------------------------------
; RAX 	    Volatile 	    Return value register
; RCX 	    Volatile 	    First integer argument
; RDX 	    Volatile 	    Second integer argument
; R8 	      Volatile 	    Third integer argument
; R9 	      Volatile 	    Fourth integer argument
; R10:R11 	Volatile 	    Must be preserved As needed by caller; used in syscall/sysret instructions
; R12:R15 	Nonvolatile 	Must be preserved by callee
; RDI 	    Nonvolatile 	Must be preserved by callee
; RSI 	    Nonvolatile 	Must be preserved by callee
; RBX 	    Nonvolatile 	Must be preserved by callee
; RBP 	    Nonvolatile 	May be used As a frame pointer; must be preserved by callee
; RSP 	    Nonvolatile 	Stack pointer
; ------------------------------
; MMX-Register
; ------------------------------
; MM0:MM7   Nonvolatile   Registers shared with FPU-Register. An EMMS Command is necessary after MMX-Register use
;                         to enable correct FPU functions again. 
; ------------------------------
; SSE Register
; ------------------------------
; XMM0, YMM0 	Volatile 	  First FP argument; first vector-type argument when __vectorcall is used
; XMM1, YMM1 	Volatile 	  Second FP argument; second vector-type argument when __vectorcall is used
; XMM2, YMM2 	Volatile 	  Third FP argument; third vector-type argument when __vectorcall is used
; XMM3, YMM3 	Volatile 	  Fourth FP argument; fourth vector-type argument when __vectorcall is used
; XMM4, YMM4 	Volatile 	  Must be preserved As needed by caller; fifth vector-type argument when __vectorcall is used
; XMM5, YMM5 	Volatile 	  Must be preserved As needed by caller; sixth vector-type argument when __vectorcall is used
; XMM6:XMM15, YMM6:YMM15 	Nonvolatile (XMM), Volatile (upper half of YMM) 	Must be preserved by callee. YMM registers must be preserved As needed by caller.


; @f = Jump forward to next @@;  @b = Jump backward to next @@  
; .Loop:      ; is a local Label or SubLable. It works form the last global lable
; The PB compiler sets a global label for each Procedure, so local lables work only inside the Procedure

; ------------------------------
; Some important SIMD instructions 
; ------------------------------
; https://hjlebbink.github.io/x86doc/

; PAND, POR, PXOR, PADD ...  : SSE2
; PCMPEQW         : SSE2  : Compare Packed Data for Equal
; PSHUFLW         : SSE2  : Shuffle Packed Low Words
; PSHUFHW         : SSE2  : Shuffle Packed High Words
; PSHUFB          : SSE3  : Packed Shuffle Bytes !
; PEXTR[B/W/D/Q]  : SSE4.1 : PEXTRB RAX, XMM0, 1 : loads Byte 1 of XMM0[Byte 0..7] 
; PINSR[B/W/D/Q]  : SSE4.1 : PINSRB XMM0, RAX, 1 : transfers RAX LoByte to Byte 1 of XMM0 
; PCMPESTRI       : SSE4.2 : Packed Compare Implicit Length Strings, Return Index
; PCMPISTRM       : SSE4.2 : Packed Compare Implicit Length Strings, Return Mask

;- ----------------------------------------------------------------------
;- NaN Value 32/64 Bit
; #Nan32 = $FFC00000            ; Bit representaion for the 32Bit Float NaN value
; #Nan64 = $FFF8000000000000    ; Bit representaion for the 64Bit Float NaN value
;  ----------------------------------------------------------------------

; --------------------------------------------------
; Assembler Datasection Definition
; --------------------------------------------------
; db 	Define Byte = 1 byte
; dw 	Define Word = 2 bytes
; dd 	Define Doubleword = 4 bytes
; dq 	Define Quadword = 8 bytes
; dt 	Define ten Bytes = 10 bytes
; !label: dq 21, 22, 23
; --------------------------------------------------

; ----------------------------------------------------------------------
;  Structures to reserve Space on the Stack for ASM_PUSH, ASM_POP
; ----------------------------------------------------------------------

Structure TStack_16Byte
  R.q[2]  
EndStructure

Structure TStack_32Byte
  R.q[4]  
EndStructure

Structure TStack_48Byte
  R.q[6]  
EndStructure

Structure TStack_64Byte
  R.q[8]  
EndStructure

Structure TStack_96Byte
  R.q[12]  
EndStructure

Structure TStack_128Byte
  R.q[16]  
EndStructure

Structure TStack_256Byte
  R.q[32]  
EndStructure

Structure TStack_512Byte
  R.q[64]  
EndStructure

Macro AsmCodeIsInProc
  Bool(#PB_Compiler_Procedure <> #Null$)  
EndMacro

;- ----------------------------------------------------------------------
;- CPU Registers
;- ----------------------------------------------------------------------

; seperate Macros for EBX/RBX because this is often needed expecally for x32
; It is not real a PUSH/POP it is more a SAVE/RESTORE!

; ATTENTION! Use EnableAsm in your code before using the Macros
; By using the PB ASM preprocessor and the Define Statement instead of Protected
; we can use the Macros now inside and outside a Procedure.
; Inside a Procedure PB handels Define and Proteced in the same way.
Macro ASM_PUSH_EBX()
  Define mEBX  
  MOV mEBX, EBX
  ; !MOV [p.v_mEBX], EBX
EndMacro

Macro ASM_POP_EBX()
  MOV EBX, mEBX
  ; !MOV EBX, [p.v_mEBX]
EndMacro

Macro ASM_PUSH_RBX()
  Define mRBX
  MOV mRBX, RBX
  ; !MOV [p.v_mRBX], RBX
EndMacro

Macro ASM_POP_RBX()
  MOV RBX, mRBX
  ;!MOV RBX, [p.v_mRBX]
EndMacro
 
; The LEA instruction: LoadEffectiveAddress of a variable
Macro ASM_PUSH_R10to11(_REG=RDX)
  Define R1011.TStack_16Byte
  LEA _REG, R1011
  ;!LEA _REG, [p.v_R1011]        ; RDX = @R1011 = Pionter to RegisterBackupStruct
  !MOV [_REG], R10
  !MOV [_REG+8], R11
EndMacro

Macro ASM_POP_R10to11(_REG=RDX)
  LEA _REG, R1011
  ; !LEA _REG, [p.v_R1011]        ; RDX = @R1011 = Pionter to RegisterBackupStruct
  !MOV R10, [_REG]
  !MOV R11, [_REG+8]
EndMacro

Macro ASM_PUSH_R12to15(_REG=RDX)
  Define R1215.TStack_32Byte
  LEA _REG, R1215
  ; !LEA _REG, [p.v_R1215]        ; RDX = @R1215 = Pionter to RegisterBackupStruct
  !MOV [_REG], R12
  !MOV [_REG+8], R13
  !MOV [_REG+16], R14
  !MOV [_REG+24], R15
EndMacro

Macro ASM_POP_R12to15(_REG=RDX)
  LEA _REG, R1215
  ; !LEA _REG, [p.v_R1215]        ; RDX = @R1215 = Pionter to RegisterBackupStruct
  !MOV R12, [_REG]
  !MOV R13, [_REG+8]
  !MOV R14, [_REG+16]
  !MOV R15, [_REG+24]
EndMacro
 
;- ----------------------------------------------------------------------
;- MMX Registers (on x64 don't use MMX-Registers! XMM is the better!
;- ----------------------------------------------------------------------

; All MMX-Registers are non volatile (shard with FPU-Reisters)
; After the end of use of MMX-Regiters an EMMS Command mus follow to enable
; correct FPU operations again!

Macro ASM_PUSH_MM_0to3(_REG=RDX)
  Define M03.TStack_32Byte
  LEA _REG, M03
  ; !LEA _REG, [p.v_M03]          ; RDX = @M03 = Pionter to RegisterBackupStruct 
  !MOVQ [_REG], MM0
  !MOVQ [_REG+8], MM1
  !MOVQ [_REG+16], MM2
  !MOVQ [_REG+24], MM3
EndMacro

Macro ASM_POP_MM_0to3(_REG=RDX)
  LEA _REG, M03
  ; !LEA _REG, [p.v_M03]          ; RDX = @M03 = Pionter to RegisterBackupStruct  
  !MOVQ MM0, [_REG]
  !MOVQ MM1, [_REG+8]
  !MOVQ MM2, [_REG+16]
  !MOVQ MM3, [_REG+24]
EndMacro

Macro ASM_PUSH_MM_4to5(_REG=RDX)
  Define M45.TStack_32Byte
  LEA _REG, M45
  ; !LEA _REG, [p.v_M45]          ; RDX = @M47 = Pionter to RegisterBackupStruct 
  !MOVQ [_REG], MM4
  !MOVQ [_REG+8], MM5
EndMacro

Macro ASM_POP_MM_4to5(_REG=RDX)
  LEA _REG, M45
  ;!LEA _REG, [p.v_M45]          ; RDX = @M47 = Pionter to RegisterBackupStruct  
  !MOVQ MM4, [_REG]
  !MOVQ MM5, [_REG+8]
EndMacro

Macro ASM_PUSH_MM_4to7(_REG=RDX)
  Define M47.TStack_32Byte
  LEA _REG, M47
  ; !LEA _REG, [p.v_M47]          ; RDX = @M47 = Pionter to RegisterBackupStruct 
  !MOVQ [_REG], MM4
  !MOVQ [_REG+8], MM5
  !MOVQ [_REG+16], MM6
  !MOVQ [_REG+24], MM7
EndMacro

Macro ASM_POP_MM_4to7(_REG=RDX)
  LEA _REG, M47
  ;!LEA _REG, [p.v_M47]          ; RDX = @M47 = Pionter to RegisterBackupStruct  
  !MOVQ MM4, [_REG]
  !MOVQ MM5, [_REG+8]
  !MOVQ MM6, [_REG+16]
  !MOVQ MM7, [_REG+24]
EndMacro

;- ----------------------------------------------------------------------
;- XMM Registers
;- ----------------------------------------------------------------------

; because of unaligend Memory latency we use 2x64 Bit MOV instead of 1x128 Bit MOV
; MOVDQU [ptrREG], XMM4 -> MOVLPS [ptrREG], XMM4  and  MOVHPS [ptrREG+8], XMM4
; x64 Prozessor can do 2 64Bit Memory transfers parallel

; XMM4:XMM5 normally are volatile and we do not have to preserve it

; ATTENTION: XMM4:XMM5 must be preserved only when __vectorcall is used
; as I know PB don't use __vectorcall in ASM Backend. But if we use it 
; within a Procedure where __vectorcall isn't used, we don't have to preserve.
; So wee keep the Macro empty. If you want to activate, just activate the code.


Macro ASM_PUSH_XMM_4to5(_REG=RDX) 
EndMacro

Macro ASM_POP_XMM_4to5(_REG=RDX)
EndMacro

; Macro ASM_PUSH_XMM_4to5(REG=RDX)
;   Define X45.TStack_32Byte
;   LEA REG, X45
;   ; !LEA REG, [p.v_X45]          ; RDX = @X45 = Pionter to RegisterBackupStruct 
;   !MOVLPS [REG], XMM4
;   !MOVHPS [REG+8], XMM4 
;   !MOVLPS [REG+16], XMM5
;   !MOVHPS [REG+24], XMM5
; EndMacro

; Macro ASM_POP_XMM_4to5(REG)
;   LEA REG, X45
;   ; !LEA REG, [p.v_X45]          ; RDX = @X45 = Pionter to RegisterBackupStruct
;   !MOVLPS XMM4, [REG]
;   !MOVHPS XMM4, [REG+8]  
;   !MOVLPS XMM5, [REG+16]
;   !MOVHPS XMM5, [REG+24]
; EndMacro
; ======================================================================

Macro ASM_PUSH_XMM_6to7(_REG=RDX)
  Define X67.TStack_32Byte
  LEA _REG, X67
  ; !LEA _REG, [p.v_X67]          ; RDX = @X67 = Pionter to RegisterBackupStruct    
  !MOVLPS [_REG], XMM6
  !MOVHPS [_REG+8], XMM6 
  !MOVLPS [_REG+16], XMM7
  !MOVHPS [_REG+24], XMM7
EndMacro

Macro ASM_POP_XMM_6to7(_REG=RDX)
  LEA _REG, X67
  ;!LEA _REG, [p.v_X67]          ; RDX = @X67 = Pionter to RegisterBackupStruct  
  !MOVLPS XMM6, [_REG]
  !MOVHPS XMM6, [_REG+8]
  !MOVLPS XMM6, [_REG+16]
  !MOVHPS XMM6, [_REG+24]  
EndMacro

;- ----------------------------------------------------------------------
;- YMM Registers
;- ----------------------------------------------------------------------

; for YMM 256 Bit Registes we switch to aligned Memory commands (much faster than unaligned)
; YMM needs 256Bit = 32Byte Align. So wee need 32Bytes more Memory for manual
; align it! We have to ADD 32 to the Adress and than clear the lo 5 bits
; to get an address Align 32

; ATTENTION!  When using YMM-Registers we have to preserve only the lo-parts (XMM-Part)
;             The hi-parts are always volatile. So preserving XMM-Registers is enough!
; Use this Macros only if you want to preserve the complete YMM-Registers for your own purpose!
Macro ASM_PUSH_YMM_4to5(_REG=RDX)
  Define Y45.TStack_96Byte ; we need 64Byte and use 96 to get Align 32
  ; Aling Adress to 32 Byte, so we can use Aligend MOV VMOVAPD
  LEA _REG, Y45
  ; !LEA _REG, [p.b_Y45]          ; RDX = @Y45 = Pionter to RegisterBackupStruct
  !ADD _REG, 32
  !SHR _REG, 5
  !SHL _REG, 5
  ; Move the YMM Registers to Memory Align 32
  !VMOVAPD [_REG], YMM4
  !VMOVAPD [_REG+32], YMM5
EndMacro

Macro ASM_POP_YMM_4to5(_REG=RDX)
  ; Aling Address @Y45 to 32 Byte, so we can use Aligned MOV VMOVAPD
  LEA _REG, Y45
  ;!LEA _REG, [p.v_Y45]          ; RDX = @Y45 = Pionter to RegisterBackupStruct
  !ADD _REG, 32
  !SHR _REG, 5
  !SHL _REG, 5
  ; POP Registers from Stack
  !VMOVAPD YMM4, [_REG]
  !VMOVAPD YMM5, [_REG+32]
EndMacro

Macro ASM_PUSH_YMM_6to7(_REG=RDX)
  Define Y67.TStack_96Byte ; we need 64Byte an use 96 to get Align 32
  ; Aling Adress to 32 Byte, so we can use Aligend MOV VMOVAPD
  LEA _REG, Y67
  ; !LEA _REG, [p.b_Y67]          ; RDX = @Y67 = Pionter to RegisterBackupStruct
  !ADD _REG, 32
  !SHR _REG, 5
  !SHL _REG, 5
  ; Move the YMM Registers to Memory Align 32
  !VMOVAPD [_REG], YMM6
  !VMOVAPD [_REG+32], YMM7
EndMacro

Macro ASM_POP_YMM_6to7(_REG=RDX)
  ; Aling Adress @Y67 to 32 Byte, so we can use Aligned MOV VMOVAPD
  LEA _REG, Y67
  ; !LEA PtrREG, [p.v_Y67]          ; RDX = @Y67 = Pionter to RegisterBackupStruct
  !ADD _REG, 32
  !SHR _REG, 5
  !SHL _REG, 5
  ; POP Registers from Stack
  !VMOVAPD YMM6, [_REG]
  !VMOVAPD YMM7, [_REG+32]
EndMacro

;- ----------------------------------------------------------------------
;- Load/Save Registers from/to Value or Pointer
;- ----------------------------------------------------------------------

; This Macros are just to have a template for the correct code!
; 
; Load a Register with a variable
Macro ASM_LD_REG_Var(_REG, var)
  MOV _REG, var
  ; !MOV _REG, [p.v_#var] 
EndMacro

; Save Register to variable 
Macro ASM_SAV_REG_Var(_REG, var)
  MOV var, _REG
  ; !MOV [p.v_#var], _REG 
EndMacro

; Load Register with a PointerVar
Macro ASM_LD_REG_Ptr(_REG, pVar)
  MOV _REG, pVar
  ; !MOV _REG, [p.p_#pVar]  
EndMacro

; Save Pointer in Register to PointerVar
Macro ASM_SAV_REG_Ptr(_REG, pVar)
  MOV pVar, _REG
  ; !MOV [p.p_#pVar], _REG  
EndMacro

; load the Register with the Pointer of a var
Macro ASM_LD_REG_VarPtr(_REG, var)
  LEA _REG, var      
  ; !LEA _REG, [p.v_#var]      
EndMacro

; ----------------------------------------------------------------------
; Lo latency LOAD/SAVE 128 Bit XMM-Register
; ----------------------------------------------------------------------
; MOVDQU command for 128Bit has long latency.
; 2 x64Bit loads are faster! Processed parallel in 1 cycle with low or 0 latency
; This optimation is token from AMD code optimation guide.
; (for 2020+ Processors like AMD Ryzen it does not matter, because Ryzen can load
; (128 Bit at same speed as 2x64Bit. For older Processors 2x 64Bit load is faster)

; ATTENTION! You have to load the MemoryPointer to REG first

; _XMM : The XMM Register to load with data from Memory
; REG  : The Register containing the Pointer to the Memory
Macro ASM_LD_XMM(_XMM=XMM0, _REG=RDX)
  !MOVLPS _XMM, [_REG]
  !MOVHPS _XMM, [_REG+8]
EndMacro
; !MOVDQU _XMM, [_REG]    ; alternative 128Bit direct load -> has long latency on Processors older than Ryzen

Macro ASM_SAV_XMM(_XMM=XMM0, _REG=RDX)
  !MOVLPS [_REG], _XMM
  !MOVHPS [_REG+8], _XMM 
EndMacro
; !MOVDQU [_REG], _XMM   ; alternative 128Bit direct load -> has long latency on Processors older than Ryzen

; ----------------------------------------------------------------------
; Helper Macro to determine _vector_ is a StructureVar or StructurePointer
; ----------------------------------------------------------------------
; SizeOf(StructureVar) = 16  SizeOf(StructurePointer = 8)
; TypeOf(StructureVar) = 7   TypeOf(StructurePointer = 7) -> TypeOf do not work!

; If _vector_ is a PointerToVector we have to use : MOV MOV REG, _vector_
; If _vector_ is the Structure we have to use     : LEA REG, _vector_  ; LoadEffectiveAddress
; The problme solved are mixed calls with PointerOfVector and VectorVar.

; This example Proc shows the Problem!
; Procedure VecTest(*InVec.Vector4)
;   Protected v.Vector4, res.Vector4 
;   v\x = 1.0 : v\y = 1.0 : v\z = 1.0 : v\w = 0.0 
;   
;   ASM_Vec4Add_PS(*InVec, v) ; Now this is possible because of autodetect Pointer or var by the compiler
;   
; EndProcedure

Macro _VectorPointerToREG(_REG_, _vector_)
  CompilerIf #PB_Compiler_Procedure <> #Null$
    CompilerIf SizeOf(_vector_)=SizeOf(Integer)
      MOV _REG_, _vector_
    CompilerElse
      LEA _REG_, _vector_
    CompilerEndIf
  CompilerElse
    MOV _REG_, _vector_
  CompilerEndIf
EndMacro
; ----------------------------------------------------------------------

;- ----------------------------------------------------------------------
;- Vector4 PackedSingle ADD, SUB, MUL, DIV (for 4x32Bit Float Vectors)
;- ----------------------------------------------------------------------

; use to speed up standard 3D Grafics with SSE 

; Vector4 is a predefined Structure in PB
;   Structure Vector4
;     x.f
;     y.f
;     z.f
;     w.f
;   EndStructure 

; SSE Extention Functions

; 2025/11/13 : Changed form direct 128Bit loads to 2x 64Bit loads because of high latency of 
;              unaligned 128 Bit loads (MOVDQU) on older processor. Now use Macro ASM_LD_XMM instead of MOVDQU

; 4PS := 4 packed single (32Bit)

; _XMMA = _vec1 + _vec2
Macro ASM_SIMD_ADD_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !ADDPS _XMMA, _XMMB       ; Add packed single float
EndMacro

; _XMMA = _vec1 - _vec2
Macro ASM_SIMD_SUB_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !SUBPS _XMMA, _XMMB       ; Sub packed single float
EndMacro

; _XMMA = _vec1 * _vec2
Macro ASM_SIMD_MUL_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !MULPS _XMMA, _XMMB       ; Mul packed single float
EndMacro

; _XMMA = _vec1 / _vec2
Macro ASM_SIMD_DIV_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !DIVPS _XMMA, _XMMB
EndMacro

; _XMMA\x = Min(_vec1\x _vec2\x) : _XMMA\y = Min(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MIN_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !MINPS _XMMA, _XMMB       ; Minimum of packed single float
EndMacro

; _XMMA\x = Max(_vec1\x, _vec2\x) : _XMMA\y = Max(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MAX_4PS(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !MAXPS _XMMA, _XMMB       ; Maximum of packed single float
EndMacro

;- ----------------------------------------------------------------------
;- Vector4 PackedDoubleWord ADD, SUB, MUL (for 4x32Bit Integer Vectors)
;- ----------------------------------------------------------------------

;   Structure Vector4L    ; This is not predefined in PB
;     x.l
;     y.l
;     z.l
;     w.l
;   EndStructure 

; SSE Extention Functions
; use for direct Integer Pixel postion calculations 

; 4PDW := 4 packed double words (32Bit)
; _XMMA = _vec1 + _vec2
Macro ASM_SIMD_ADD_4PDW(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !PADDW _XMMA, _XMMB
EndMacro

; _XMMA = _vec1 - _vec2
Macro ASM_SIMD_SUB_4PDW(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !PSUBDW _XMMA, _XMMB       ; Subtract packed DoubleWord integers
EndMacro

; _XMMA = _vec1 * _vec2
Macro ASM_SIMD_MUL_4PDW(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !PMULDQ _XMMA, _XMMB      ; Multiply packed DoubleWord Integers
EndMacro

; A PDIVDQ to devide packed Doubleword Integers do not exist because of the CPU cycles are depending on the operands 

; _XMMA\x = Min(_vec1\x _vec2\x) : _XMMA\y = Min(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MIN_4PDW(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !PMINSD _XMMA, _XMMB      ; Minimum of signed packed Doubleword Integers
EndMacro

; _XMMA\x = Max(_vec1\x, _vec2\x) : _XMMA\y = Max(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MAX_4PDW(_vec1, _vec2,  _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !PMAXSD _XMMA, _XMMB      ; Maximum of signed packed Doubleword Integers
EndMacro

;- ----------------------------------------------------------------------
;- Vector2 PackedDouble ADD, SUB, MUL, DIV (for 2x64Bit Double Vectors)
;- ----------------------------------------------------------------------

; use for 2D Double Float coordinates and Complex Number math

; 2PD := 2 packed double (64Bit)
; _XMMA = _vec1 + _vec2
Macro ASM_SIMD_ADD_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !ADDPD _XMMA, _XMMB       ; Add packed double float
EndMacro

; _XMMA = _vec1 - _vec2
Macro ASM_SIMD_SUB_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !SUBPD _XMMA, _XMMB       ; Sub packed double float
EndMacro

; _XMMA = _vec1 * _vec2
Macro ASM_SIMD_MUL_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !MULPD _XMMA, _XMMB       ; Mul packed double float
EndMacro

; _XMMA = _vec1 / _vec2
Macro ASM_SIMD_DIV_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !DIVPD _XMMA, _XMMB
EndMacro

; _XMMA\x = Min(_vec1\x _vec2\x) : _XMMA\y = Min(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MIN_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !MINPD _XMMA, _XMMB       ; Minimum of packed double float
EndMacro

; _XMMA\x = Max(_vec1\x, _vec2\x) : _XMMA\y = Max(_vec1\y, _vec2\y) ...
Macro ASM_SIMD_MAX_2PD(_vec1, _vec2, _XMMA=XMM0, _XMMB=XMM1, _REGA=RAX, _REGD=RDX)
  _VectorPointerToREG(_REGA, _vec1)
  _VectorPointerToREG(_REGD, _vec2)    
  ASM_LD_XMM(_XMMA, _REGA)
  ASM_LD_XMM(_XMMB, _REGD) 
  !MAXPD _XMMA, _XMMB       ; Maximum of packed double float
EndMacro

CompilerIf #PB_Compiler_IsMainFile
  ;- ----------------------------------------------------------------------
  ;- TEST-CODE
  ;- ----------------------------------------------------------------------
  
  EnableExplicit
  
  Macro DbgVector4(_v4)
    Debug "x=" + _v4\x
    Debug "y=" + _v4\y
    Debug "z=" + _v4\z
    Debug "w=" + _v4\w   
  EndMacro
  
  EnableASM
  
  Procedure AddVector4(*vecResult.Vector4, *vec1.Vector4, *vec2.Vector4)
    With *vecResult
      \x = *vec1\x + *vec2\x
      \y = *vec1\y + *vec2\y
      \z = *vec1\z + *vec2\z
      \w = *vec1\w + *vec2\w
    EndWith
  EndProcedure
  
  Procedure AddVector4_SSE(*vecResult.Vector4, *vec1.Vector4, *vec2.Vector4)
    ASM_SIMD_ADD_4PS(*vec1, *vec2)    ; XMM0 = vec1 + vec2 
    
    ASM_LD_REG_Ptr(RDX, *vecResult) ; RDX = *vecRestlt
    ; !MOV RDX, [p.p_vecResult]     ; or alternative the ASM-Code
    
    ASM_SAV_XMM(XMM0, RDX)          ; vecResult = XMM0 : lo latency 128Bit load
  EndProcedure

  ; because all variables in the ASM Macros are defined as in Procedure, we have to use
  ; a Procedure for the TestCoode
  Procedure Test()
    Protected.Vector4 v1, v2, vres
    v1\x = 1.0
    v1\y = 12.0
    v1\z = 3.0
    v1\w = 14.0
    
    v2\x = 11.0
    v2\y = 2.0
    v2\z = 13.0
    v2\w = 4.0
        
    Debug "SSE Vector4 operations"
    Debug "v1.Vector4"
    DbgVector4(v1)
    Debug ""
    Debug "v2.Vector4"
    DbgVector4(v2)

    ; example adding two Vector4 Structures with SSE Commands
    ; vres = v1 + v2
    ASM_SIMD_Add_4PS(v1, v2)          ; ADD v1 to v2 using XMM0/XMM1 -> result in XMM0. Using RAX/RDX as pointer Registers
    ASM_LD_REG_VarPtr(RDX, vres) ; load Register with Pointer @vres
    ASM_SAV_XMM()                 ; save XMM0 to Memory pointed by RDX : vres = XMM0
    
    Debug ""
    Debug "v1 + v2"
    DbgVector4(vres)
    
    ; example multiply two Vector4 Structures with SSE Commands
    ; vres = v1 * v2
    ASM_SIMD_MUL_4PS(v1, v2)        ; ADD v1 to v2 using XMM0/XMM1 -> result in XMM0. Using RAX/RDX as pointer Registers
    ASM_LD_REG_VarPtr(RDX, vres) ; load Register with Pointer @vres
    ASM_SAV_XMM()                 ; save XMM0 to Memory pointed by RDX : vres = XMM0
    
    ; example devide two Vector4 Structures with SSE Commands
    ; vres = vres / v2    -> reult will be v1
    ASM_SIMD_DIV_4PS(vres, v2)        ; ADD v1 to v2 using XMM0/XMM1 -> result in XMM0. Using RAX/RDX as pointer Registers
    ASM_LD_REG_VarPtr(RDX, vres) ; load Register with Pointer @vres
    ASM_SAV_XMM()                 ; save XMM0 to Memory pointed by RDX : vres = XMM0

    Debug ""
    Debug "vres / v2 -> result will be v1"
    DbgVector4(vres)
    
    ; example minimum of two Vector4 Structures with SSE Commands
    ; vres = Min(v1 * v2)
    ASM_SIMD_MIN_4PS(v1, v2)        ; ADD v1 to v2 using XMM0/XMM1 -> result in XMM0. Using RAX/RDX as pointer Registers
    ASM_LD_REG_VarPtr(RDX, vres) ; load Register with Pointer @vres
    ASM_SAV_XMM()                 ; save XMM0 to Memory pointed by RDX : vres = XMM0
    
    Debug ""
    Debug "Min(v1, v2)"
    DbgVector4(vres)
    
    ; example maximum of two Vector4 Structures with SSE Commands
    ; vres = Max(v1 * v2)
    ASM_SIMD_MAX_4PS(v1, v2)        ; ADD v1 to v2 using XMM0/XMM1 -> result in XMM0. Using RAX/RDX as pointer Registers
    ASM_LD_REG_VarPtr(RDX, vres) ; load Register with Pointer @vres
    ASM_SAV_XMM()                 ; save XMM0 to Memory pointed by RDX : vres = XMM0
    
    Debug ""
    Debug "Max(v1, v2)"
    DbgVector4(vres)
    
    Debug "--------------------------------------------------"
    Debug ""
    
    Debug "Timing Test"
    Debug "First test the result of classic ADD and SSE ADD"
    
    Debug "Classic ADD:"
    AddVector4(vres, v1, v2)
    DbgVector4(vres)
    
    Debug "SSE ADD:"
    AddVector4_SSE(vres, v1, v2)
    DbgVector4(vres)
    
    Debug ""
    
    Define I, t1, t2
    
    #Loops = 1000 * 10000
    
    DisableDebugger
    ; because DisableDebugger do not switch off Debugger completely -> do not run Timing code when
    ; compiling with Debugger (PB 6.21)
    CompilerIf Not #PB_Compiler_Debugger
      ; Classic ADD
      AddVector4(vres, v1, v2)      ; Load Proc to Cash  
      t1 = ElapsedMilliseconds()
      For I =0 To #Loops
        AddVector4(vres, v1, v2)
      Next
      t1 = ElapsedMilliseconds() - t1
      
      ; SEE ADD
      AddVector4_SSE(vres, v1, v2)   ; Load Proc to Cash  
      t2 = ElapsedMilliseconds()
      For I =0 To #Loops
        AddVector4_SSE(vres, v1, v2)
      Next
      t2 = ElapsedMilliseconds() - t1
      EnableDebugger
      
      OpenConsole()
      PrintN("Debugger off for SpeedTest to get the correct timing")
  
      PrintN( "Result for Loops=" + #Loops)
      PrintN( "Classic ADD  ms=" + t1)
      PrintN( "SSE SIMD ADD ms=" + t2)
      PrintN("Press a Key")
      Input()
    CompilerEndIf
  
  EndProcedure
  
  DisableASM

  Test() 
 
CompilerEndIf


Re: Helpful Assembler Macros

Posted: Wed Nov 12, 2025 9:47 pm
by idle
nice thanks