Matrix Structure Speed test

Everything else that doesn't fall into one of the other PB categories.
User avatar
StarBootics
Addict
Addict
Posts: 984
Joined: Sun Jul 07, 2013 11:35 am
Location: Canada

Matrix Structure Speed test

Post by StarBootics »

Hello everyone,

Apparently the use of static arrays inside structures are pretty darn slow in comparison of just straight fields.

Code: Select all

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; Project name : Matrix Structure Speed test
; File Name : Matrix Structure Speed test.pb
; File version: 1.0.0
; Programming : OK
; Programmed by : StarBootics
; Date : 20-03-2021
; Last Update : 20-03-2021
; PureBasic code : V5.73LTS
; Platform : Windows, Linux, MacOS X
; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Structure Line
 
  Element1D.f[4]
 
EndStructure

Structure Matrix4f
 
  Element2D.Line[4]
 
EndStructure

Structure Matrix44f
   e.f[16]
EndStructure

Structure Matrix44
 
  e11.f
  e21.f
  e31.f
  e41.f
  e12.f
  e22.f
  e32.f
  e42.f
  e13.f
  e23.f
  e33.f
  e43.f
  e14.f
  e24.f
  e34.f
  e44.f
 
EndStructure


Macro AccessMatrix44f(MatrixA, Line, Column)
   MatrixA\e[Column * 4 + Line]
EndMacro

Macro AccessMatrix4f(MatrixA, Line, Column)
 
  MatrixA\Element2D[Column]\Element1D[Line]
 
EndMacro

Procedure Identity4f(*MatrixA.Matrix4f)
 
  For i = 0 To 3
   
    For j = 0 To 3
     
      If i = j
        AccessMatrix4f(*MatrixA, i, j) = 1.0
      Else
        AccessMatrix4f(*MatrixA, i, j) = 0.0
      EndIf
     
    Next 
   
  Next
 
EndProcedure

Procedure Identity44f(*MatrixA.Matrix44f)
 
  For i = 0 To 3
   
    For j = 0 To 3
     
      If i = j
        AccessMatrix44f(*MatrixA, i, j) = 1.0
      Else
        AccessMatrix44f(*MatrixA, i, j) = 0.0
      EndIf
     
    Next 
   
  Next
 
EndProcedure

Procedure Identity44(*Identity.Matrix44)
 
  *Identity\e11 = 1.0
  *Identity\e21 = 0.0
  *Identity\e31 = 0.0
  *Identity\e41 = 0.0
  *Identity\e12 = 0.0
  *Identity\e22 = 1.0
  *Identity\e32 = 0.0
  *Identity\e42 = 0.0
  *Identity\e13 = 0.0
  *Identity\e23 = 0.0
  *Identity\e33 = 1.0
  *Identity\e43 = 0.0
  *Identity\e14 = 0.0
  *Identity\e24 = 0.0
  *Identity\e34 = 0.0
  *Identity\e44 = 1.0
 
EndProcedure

For TestID = 0 To 4
  
  TempsDepart0 = ElapsedMilliseconds()
  
  For Index = 0 To 100000
    Identity4f(MatrixA.Matrix4f)
  Next
  
  TempsEcoule0 = ElapsedMilliseconds()-TempsDepart0
  
  TempsDepart1 = ElapsedMilliseconds()
  
  For Index = 0 To 100000
    Identity44f(MatrixB.Matrix44f)
  Next
  
  TempsEcoule1 = ElapsedMilliseconds()-TempsDepart1
  
  TempsDepart2 = ElapsedMilliseconds()
  
  For Index = 0 To 100000
    Identity44(MatrixC.Matrix44)
  Next
  
  TempsEcoule2 = ElapsedMilliseconds()-TempsDepart2
  
  MessageRequester("Elapsed Time", "Matrix4f : " + Str(TempsEcoule0) + " milliseconds" + #LF$ + "Matrix44f : " + Str(TempsEcoule1) +" milliseconds"+ #LF$ + "Matrix44 : " + Str(TempsEcoule2) +" milliseconds")
  
Next

; <<<<<<<<<<<<<<<<<<<<<<<
; <<<<< END OF FILE <<<<<
; <<<<<<<<<<<<<<<<<<<<<<<
When I run the code above this is what I get

Matrix4f : 10 milliseconds
Matrix44f : 8 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 13 milliseconds
Matrix44f : 8 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 13 milliseconds
Matrix44f : 8 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 12 milliseconds
Matrix44f : 8 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 12 milliseconds
Matrix44f : 9 milliseconds
Matrix44 : 2 milliseconds

The question is what is making the use of static arrays so slow ? The memory offset calculation ? The nested for loops ? Both, memory offset calculation and nested for loops ?

Thanks beforehand.
StarBootics
The Stone Age did not end due to a shortage of stones !
User avatar
STARGÅTE
Addict
Addict
Posts: 2084
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Matrix Structure Speed test

Post by STARGÅTE »

It is mainly the nested loop, because here you have multiple queries and jumps.
Whenever you want to speed up a code, write multiple lines, instead of For-loop.

Btw: If you want to speed up 4D matrix stuff, you can (or have to) use the ASM SSE instructions set.
In SSE you have 128bit registers and you can calculate with 4 floats at once.

Here an example:

Code: Select all

Procedure.i IdentitySSE( *m4fOut.Matrix44, fValue.f = 1.0 )
	
	CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
		! MOV     rax, [p.p_m4fOut]
		! MOVSS   xmm0, [p.v_fValue]
		! MOVUPS  [rax+00], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
		! MOVUPS  [rax+16], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b
		! MOVUPS  [rax+32], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b
		! MOVUPS  [rax+48], xmm0
	CompilerElse
		! MOV     eax, [p.p_m4fOut]
		! MOVSS   xmm0, [p.v_fValue]
		! MOVUPS  [eax+00], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
		! MOVUPS  [eax+16], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b
		! MOVUPS  [eax+32], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b
		! MOVUPS  [eax+48], xmm0
	CompilerEndIf
	
	ProcedureReturn
	
EndProcedure
---------------------------
Elapsed Time
---------------------------
Matrix4f : 27 milliseconds
Matrix44f : 27 milliseconds
Matrix44 : 6 milliseconds
MatrixSSE: 2 milliseconds
---------------------------
OK
---------------------------
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
User avatar
idle
Always Here
Always Here
Posts: 5089
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Matrix Structure Speed test

Post by idle »

The slow down is due to the Cmps from the nested loops and IF statement rather than the offset calculations
where ever you have branching there's the potential to cause slowdowns

If you need to use loops it's better to use repeat or while loops and if you can avoid an IF do it.

Code: Select all

Procedure Identity44f(*MatrixA.Matrix44f)
    
  CopyMemory(?_Identity44,*MatrixA,64)
     
  DataSection : _Identity44:  
  Data.q $000000003F800000,$0000000000000000
  Data.q $3F80000000000000,$0000000000000000
  Data.q $0000000000000000,$000000003F800000
  Data.q $0000000000000000,$3F80000000000000
  EndDataSection 
  
EndProcedure
Windows 11, Manjaro, Raspberry Pi OS
Image
User avatar
StarBootics
Addict
Addict
Posts: 984
Joined: Sun Jul 07, 2013 11:35 am
Location: Canada

Re: Matrix Structure Speed test

Post by StarBootics »

STARGÅTE wrote:Btw: If you want to speed up 4D matrix stuff, you can (or have to) use the ASM SSE instructions set.
In SSE you have 128bit registers and you can calculate with 4 floats at once.

Here an example:

Code: Select all

Procedure.i IdentitySSE( *m4fOut.Matrix44, fValue.f = 1.0 )
	
	CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
		! MOV     rax, [p.p_m4fOut]
		! MOVSS   xmm0, [p.v_fValue]
		! MOVUPS  [rax+00], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
		! MOVUPS  [rax+16], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b
		! MOVUPS  [rax+32], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b
		! MOVUPS  [rax+48], xmm0
	CompilerElse
		! MOV     eax, [p.p_m4fOut]
		! MOVSS   xmm0, [p.v_fValue]
		! MOVUPS  [eax+00], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
		! MOVUPS  [eax+16], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b
		! MOVUPS  [eax+32], xmm0
		! SHUFPS  xmm0, xmm0, 10010011b
		! MOVUPS  [eax+48], xmm0
	CompilerEndIf
	
	ProcedureReturn
	
EndProcedure
It's an interesting piece of code. I really need to learn ASM coding.
idle wrote:The slow down is due to the Cmps from the nested loops and IF statement rather than the offset calculations
The offset calculation appear to have an impact as well. This is the V2.0.0 without any loops and If statement

Code: Select all

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; Project name : Matrix Structure Speed test
; File Name : Matrix Structure Speed test.pb
; File version: 2.0.0
; Programming : OK
; Programmed by : StarBootics
; Date : 20-03-2021
; Last Update : 20-03-2021
; PureBasic code : V5.73LTS
; Platform : Windows, Linux, MacOS X
; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Structure Line
 
  Element1D.f[4]
 
EndStructure

Structure Matrix4f
 
  Element2D.Line[4]
 
EndStructure

Structure Matrix44f
   e.f[16]
EndStructure

Structure Matrix44
 
  e11.f
  e21.f
  e31.f
  e41.f
  e12.f
  e22.f
  e32.f
  e42.f
  e13.f
  e23.f
  e33.f
  e43.f
  e14.f
  e24.f
  e34.f
  e44.f
 
EndStructure


Macro AccessMatrix44f(MatrixA, Line, Column)
   MatrixA\e[Column * 4 + Line]
EndMacro

Macro AccessMatrix4f(MatrixA, Line, Column)
 
  MatrixA\Element2D[Column]\Element1D[Line]
 
EndMacro

Procedure Identity4f(*MatrixA.Matrix4f)
  
  AccessMatrix4f(*MatrixA, 0, 0) = 1.0
  AccessMatrix4f(*MatrixA, 1, 0) = 0.0
  AccessMatrix4f(*MatrixA, 2, 0) = 0.0
  AccessMatrix4f(*MatrixA, 3, 0) = 0.0
  
  AccessMatrix4f(*MatrixA, 0, 1) = 0.0
  AccessMatrix4f(*MatrixA, 1, 1) = 1.0
  AccessMatrix4f(*MatrixA, 2, 1) = 0.0
  AccessMatrix4f(*MatrixA, 3, 1) = 0.0
  
  AccessMatrix4f(*MatrixA, 0, 2) = 0.0
  AccessMatrix4f(*MatrixA, 1, 2) = 0.0
  AccessMatrix4f(*MatrixA, 2, 2) = 1.0
  AccessMatrix4f(*MatrixA, 3, 2) = 0.0
  
  AccessMatrix4f(*MatrixA, 0, 3) = 0.0
  AccessMatrix4f(*MatrixA, 1, 3) = 0.0
  AccessMatrix4f(*MatrixA, 2, 3) = 0.0
  AccessMatrix4f(*MatrixA, 3, 3) = 1.0
  
    
  
  
;   For i = 0 To 3
;    
;     For j = 0 To 3
;      
;       If i = j
;         AccessMatrix4f(*MatrixA, i, j) = 1.0
;       Else
;         AccessMatrix4f(*MatrixA, i, j) = 0.0
;       EndIf
;      
;     Next 
;    
;   Next
 
EndProcedure

Procedure Identity44f(*MatrixA.Matrix44f)
  
  AccessMatrix44f(*MatrixA, 0, 0) = 1.0
  AccessMatrix44f(*MatrixA, 1, 0) = 0.0
  AccessMatrix44f(*MatrixA, 2, 0) = 0.0
  AccessMatrix44f(*MatrixA, 3, 0) = 0.0
  
  AccessMatrix44f(*MatrixA, 0, 1) = 0.0
  AccessMatrix44f(*MatrixA, 1, 1) = 1.0
  AccessMatrix44f(*MatrixA, 2, 1) = 0.0
  AccessMatrix44f(*MatrixA, 3, 1) = 0.0
  
  
  AccessMatrix44f(*MatrixA, 0, 2) = 0.0
  AccessMatrix44f(*MatrixA, 1, 2) = 0.0
  AccessMatrix44f(*MatrixA, 2, 2) = 1.0
  AccessMatrix44f(*MatrixA, 3, 2) = 0.0
  
  AccessMatrix44f(*MatrixA, 0, 3) = 0.0
  AccessMatrix44f(*MatrixA, 1, 3) = 0.0
  AccessMatrix44f(*MatrixA, 2, 3) = 0.0
  AccessMatrix44f(*MatrixA, 3, 3) = 1.0
  
  
;   For i = 0 To 3
;    
;     For j = 0 To 3
;      
;       If i = j
;         AccessMatrix44f(*MatrixA, i, j) = 1.0
;       Else
;         AccessMatrix44f(*MatrixA, i, j) = 0.0
;       EndIf
;      
;     Next 
;    
;   Next
 
EndProcedure

Procedure Identity44(*Identity.Matrix44)
 
  *Identity\e11 = 1.0
  *Identity\e21 = 0.0
  *Identity\e31 = 0.0
  *Identity\e41 = 0.0
  *Identity\e12 = 0.0
  *Identity\e22 = 1.0
  *Identity\e32 = 0.0
  *Identity\e42 = 0.0
  *Identity\e13 = 0.0
  *Identity\e23 = 0.0
  *Identity\e33 = 1.0
  *Identity\e43 = 0.0
  *Identity\e14 = 0.0
  *Identity\e24 = 0.0
  *Identity\e34 = 0.0
  *Identity\e44 = 1.0
 
EndProcedure

For TestID = 0 To 4
  
  TempsDepart0 = ElapsedMilliseconds()
  
  For Index = 0 To 100000
    Identity4f(MatrixA.Matrix4f)
  Next
  
  TempsEcoule0 = ElapsedMilliseconds()-TempsDepart0
  
  TempsDepart1 = ElapsedMilliseconds()
  
  For Index = 0 To 100000
    Identity44f(MatrixB.Matrix44f)
  Next
  
  TempsEcoule1 = ElapsedMilliseconds()-TempsDepart1
  
  TempsDepart2 = ElapsedMilliseconds()
  
  For Index = 0 To 100000
    Identity44(MatrixC.Matrix44)
  Next
  
  TempsEcoule2 = ElapsedMilliseconds()-TempsDepart2
  
  MessageRequester("Elapsed Time", "Matrix4f : " + Str(TempsEcoule0) + " milliseconds" + #LF$ + "Matrix44f : " + Str(TempsEcoule1) +" milliseconds"+ #LF$ + "Matrix44 : " + Str(TempsEcoule2) +" milliseconds")
  
Next

; <<<<<<<<<<<<<<<<<<<<<<<
; <<<<< END OF FILE <<<<<
; <<<<<<<<<<<<<<<<<<<<<<<
Matrix4f : 6 milliseconds
Matrix44f : 4 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 8 milliseconds
Matrix44f :11 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 8 milliseconds
Matrix44f : 4 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 6 milliseconds
Matrix44f : 5 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 5 milliseconds
Matrix44f : 5 milliseconds
Matrix44 : 2 milliseconds

As you can see the offset calculation has an impact but the performance is much better. It's clear that it's better to create a structure without using static arrays.

Best regards
StarBootics
The Stone Age did not end due to a shortage of stones !
User avatar
StarBootics
Addict
Addict
Posts: 984
Joined: Sun Jul 07, 2013 11:35 am
Location: Canada

Re: Matrix Structure Speed test

Post by StarBootics »

Hello everyone,

A much closer to real life speed test about matrices : Concatenating matrices

Code: Select all

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; Project name : Matrix Multiply
; File Name : Matrix Multiply.pb
; File version: 1.0.0
; Programming : OK
; Programmed by : StarBootics
; Date : 07-06-2021
; Last Update : 07-06-2021
; PureBasic code : V5.73 LTS
; Platform : Windows, Linux, MacOS X
; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Structure Matrix44
  
  VirtualTable.i
  e11.f
  e21.f
  e31.f
  e41.f
  e12.f
  e22.f
  e32.f
  e42.f
  e13.f
  e23.f
  e33.f
  e43.f
  e14.f
  e24.f
  e34.f
  e44.f
  
EndStructure

Procedure.i IdentitySSE(*Identity.Matrix44)
  
  *m4fOut = *Identity + OffsetOf(Matrix44\e11)
  fValue.f = 1.0 
  
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    ! MOV     rax, [p.p_m4fOut]
    ! MOVSS   xmm0, [p.v_fValue]
    ! MOVUPS  [rax+00], xmm0
    ! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
    ! MOVUPS  [rax+16], xmm0
    ! SHUFPS  xmm0, xmm0, 10010011b
    ! MOVUPS  [rax+32], xmm0
    ! SHUFPS  xmm0, xmm0, 10010011b
    ! MOVUPS  [rax+48], xmm0
  CompilerElse
    ! MOV     eax, [p.p_m4fOut]
    ! MOVSS   xmm0, [p.v_fValue]
    ! MOVUPS  [eax+00], xmm0
    ! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
    ! MOVUPS  [eax+16], xmm0
    ! SHUFPS  xmm0, xmm0, 10010011b
    ! MOVUPS  [eax+32], xmm0
    ! SHUFPS  xmm0, xmm0, 10010011b
    ! MOVUPS  [eax+48], xmm0
  CompilerEndIf
  
  ProcedureReturn
EndProcedure

Procedure Multiply(*This.Matrix44, *Other.Matrix44)
  
  e11.f = *This\e11 * *Other\e11 + *This\e12 * *Other\e21 + *This\e13 * *Other\e31 + *This\e14 * *Other\e41
  e12.f = *This\e11 * *Other\e12 + *This\e12 * *Other\e22 + *This\e13 * *Other\e32 + *This\e14 * *Other\e42
  e13.f = *This\e11 * *Other\e13 + *This\e12 * *Other\e23 + *This\e13 * *Other\e33 + *This\e14 * *Other\e43
  e14.f = *This\e11 * *Other\e14 + *This\e12 * *Other\e24 + *This\e13 * *Other\e34 + *This\e14 * *Other\e44
  e21.f = *This\e21 * *Other\e11 + *This\e22 * *Other\e21 + *This\e23 * *Other\e31 + *This\e24 * *Other\e41
  e22.f = *This\e21 * *Other\e12 + *This\e22 * *Other\e22 + *This\e23 * *Other\e32 + *This\e24 * *Other\e42
  e23.f = *This\e21 * *Other\e13 + *This\e22 * *Other\e23 + *This\e23 * *Other\e33 + *This\e24 * *Other\e43
  e24.f = *This\e21 * *Other\e14 + *This\e22 * *Other\e24 + *This\e23 * *Other\e34 + *This\e24 * *Other\e44
  e31.f = *This\e31 * *Other\e11 + *This\e32 * *Other\e21 + *This\e33 * *Other\e31 + *This\e34 * *Other\e41
  e32.f = *This\e31 * *Other\e12 + *This\e32 * *Other\e22 + *This\e33 * *Other\e32 + *This\e34 * *Other\e42
  e33.f = *This\e31 * *Other\e13 + *This\e32 * *Other\e23 + *This\e33 * *Other\e33 + *This\e34 * *Other\e43
  e34.f = *This\e31 * *Other\e14 + *This\e32 * *Other\e24 + *This\e33 * *Other\e34 + *This\e34 * *Other\e44
  e41.f = *This\e41 * *Other\e11 + *This\e42 * *Other\e21 + *This\e43 * *Other\e31 + *This\e44 * *Other\e41
  e42.f = *This\e41 * *Other\e12 + *This\e42 * *Other\e22 + *This\e43 * *Other\e32 + *This\e44 * *Other\e42
  e43.f = *This\e41 * *Other\e13 + *This\e42 * *Other\e23 + *This\e43 * *Other\e33 + *This\e44 * *Other\e43
  e44.f = *This\e41 * *Other\e14 + *This\e42 * *Other\e24 + *This\e43 * *Other\e34 + *This\e44 * *Other\e44
  
  *This\e11 = e11
  *This\e12 = e12
  *This\e13 = e13
  *This\e14 = e14
  
  *This\e21 = e21
  *This\e22 = e22
  *This\e23 = e23
  *This\e24 = e24
  
  *This\e31 = e31
  *This\e32 = e32
  *This\e33 = e33
  *This\e34 = e34
  
  *This\e41 = e41
  *This\e42 = e42
  *This\e43 = e43
  *This\e44 = e44
  
EndProcedure

Procedure Translation(*This.Matrix44, Trans_x.f, Trans_y.f, Trans_z.f)
  
  *This\e11 = 1.0
  *This\e12 = 0.0
  *This\e13 = 0.0
  *This\e14 = Trans_x
  
  *This\e21 = 0.0
  *This\e22 = 1.0
  *This\e23 = 0.0
  *This\e24 = Trans_y
  
  *This\e31 = 0.0
  *This\e32 = 0.0
  *This\e33 = 1.0
  *This\e34 = Trans_z
  
  *This\e41 = 0.0
  *This\e42 = 0.0
  *This\e43 = 0.0
  *This\e44 = 1.0
  
EndProcedure

Procedure RotateX(*This.Matrix44, Theta.f)
  
  ; Protected Cos.f = Cos(Theta)
  ; Protected Sin.f = Sin(Theta) 
  
  Protected Cos.f, Sin.f
  
  !FLD dword [p.v_Theta]
  !FSINCOS
  !FSTP dword [p.v_Cos]
  !FSTP dword [p.v_Sin] 
  
  *This\e11 = 1.0
  *This\e12 = 0.0
  *This\e13 = 0.0
  *This\e14 = 0.0
  
  *This\e21 = 0.0
  *This\e22 = Cos
  *This\e23 = -Sin
  *This\e24 = 0.0
  
  *This\e31 = 0.0
  *This\e32 = Sin
  *This\e33 = Cos
  *This\e34 = 0.0
  
  *This\e41 = 0.0
  *This\e42 = 0.0
  *This\e43 = 0.0
  *This\e44 = 1.0
  
EndProcedure

Procedure RotateY(*This.Matrix44, Theta.f)
  
  ; Protected Cos.f = Cos(Theta)
  ; Protected Sin.f = Sin(Theta) 
  
  Protected Cos.f, Sin.f
  
  !FLD dword [p.v_Theta]
  !FSINCOS
  !FSTP dword [p.v_Cos]
  !FSTP dword [p.v_Sin] 
  
  *This\e11 = Cos
  *This\e12 = 0.0
  *This\e13 = Sin
  *This\e14 = 0.0
  
  *This\e21 = 0.0
  *This\e22 = 1.0
  *This\e23 = 0.0
  *This\e24 = 0.0
  
  *This\e31 = -Sin
  *This\e32 = 0.0
  *This\e33 = Cos
  *This\e34 = 0.0
  
  *This\e41 = 0.0
  *This\e42 = 0.0
  *This\e43 = 0.0
  *This\e44 = 1.0
  
EndProcedure

Translation(Translation.Matrix44, 25.0, 15.0, 3.0)
Translation(InvTranslation.Matrix44, -25.0, -15.0, -3.0)
RotateX(RotationX.Matrix44, Radian(25.0))
RotateX(InvRotationX.Matrix44, Radian(-25.0))
Translation(SlideZ.Matrix44, 0.0, 0.0, 1.0)
RotateY(RotationY.Matrix44, Radian(25.0))
RotateY(InvRotationY.Matrix44, Radian(-25.0))

For TestID = 0 To 4

  TempsDepart0 = ElapsedMilliseconds()
  
  For Index = 0 To 10000
    IdentitySSE(AnimationMatrix.Matrix44)
    Multiply(AnimationMatrix, InvTranslation)
    Multiply(AnimationMatrix, InvRotationY)
    Multiply(AnimationMatrix, InvRotationX)
    Multiply(AnimationMatrix, SlideZ)
    Multiply(AnimationMatrix, RotationX)
    Multiply(AnimationMatrix, RotationY)
    Multiply(AnimationMatrix, Translation)
  Next

  TempsEcoule0 = ElapsedMilliseconds()-TempsDepart0
 
  MessageRequester("Elapsed Time", "Matrix Multiply : " + Str(TempsEcoule0) + " milliseconds")
  
Next

; <<<<<<<<<<<<<<<<<<<<<<<
; <<<<< END OF FILE <<<<<
; <<<<<<<<<<<<<<<<<<<<<<<
This the result I get :

7 milliseconds
13 milliseconds
10 milliseconds
7 milliseconds
22 milliseconds

I think the Multiply procedure need to be accelerated using ASM SSE Instructions but this out of reach for me at the moment.

If some volunteer can try this code with the C-Backend full optimized compilation and give some result it will be nice

Best regards
StarBootics
The Stone Age did not end due to a shortage of stones !
User avatar
STARGÅTE
Addict
Addict
Posts: 2084
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Matrix Structure Speed test

Post by STARGÅTE »

StarBootics wrote: Mon Jun 07, 2021 4:55 pm I think the Multiply procedure need to be accelerated using ASM SSE Instructions
Of course!
If the matrix structure is like:

Code: Select all

Structure UB2D_MATRIX4f
	I11.f : I21.f : I31.f : I41.f
	I12.f : I22.f : I32.f : I42.f
	I13.f : I23.f : I33.f : I43.f
	I14.f : I24.f : I34.f : I44.f
EndStructure
The multiplication is:
(Don't be afraid of the length of the ASM code, it is much faster than the native PB code)

Code: Select all

Structure UB2D_MATRIX4f
	I11.f : I21.f : I31.f : I41.f
	I12.f : I22.f : I32.f : I42.f
	I13.f : I23.f : I33.f : I43.f
	I14.f : I24.f : I34.f : I44.f
EndStructure

Procedure.i UB2D_m4fMultiplication( *m4fResult.UB2D_MATRIX4f, *m4fLeft.UB2D_MATRIX4f, *m4fRight.UB2D_MATRIX4f )
	
	Protected Backup.UB2D_MATRIX4f
	
	CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
		! MOV rax, [p.p_m4fResult]
		! MOV rcx, [p.p_m4fLeft]
		! MOV rdx, [p.p_m4fRight]
		; Linke Matrix laden
		! MOVUPS xmm0, [rcx+00]
		! MOVUPS xmm1, [rcx+16]
		! MOVUPS xmm2, [rcx+32]
		! MOVUPS xmm3, [rcx+48]
		; Backup von xmm4-xmm7
		! MOVUPS [p.v_Backup+00], xmm4
		! MOVUPS [p.v_Backup+16], xmm5
		! MOVUPS [p.v_Backup+32], xmm6
	CompilerElse
		! MOV eax, [p.p_m4fResult]
		! MOV ecx, [p.p_m4fLeft]
		! MOV edx, [p.p_m4fRight]
		; Linke Matrix laden
		! MOVUPS xmm0, [ecx+00]
		! MOVUPS xmm1, [ecx+16]
		! MOVUPS xmm2, [ecx+32]
		! MOVUPS xmm3, [ecx+48]
		; Backup von xmm4-xmm7
		! MOVUPS [p.v_Backup+00], xmm4
		! MOVUPS [p.v_Backup+16], xmm5
		! MOVUPS [p.v_Backup+32], xmm6
	CompilerEndIf
	
	CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
		; Multiplikation mit rechter Matrix (1. Spalte) 
		! MOVUPS xmm4, [rdx+00]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [rax+00], xmm6
		; Multiplikation mit rechter Matrix (2. Spalte) 
		! MOVUPS xmm4, [rdx+16]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [rax+16], xmm6
		; Multiplikation mit rechter Matrix (3. Spalte) 
		! MOVUPS xmm4, [rdx+32]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [rax+32], xmm6
		; Multiplikation mit rechter Matrix (4. Spalte) 
		! MOVUPS xmm4, [rdx+48]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [rax+48], xmm6
		; Wiederherstellung von xmm4-xmm7
		! MOVUPS xmm4, [p.v_Backup+00]
		! MOVUPS xmm5, [p.v_Backup+16]
		! MOVUPS xmm6, [p.v_Backup+32]
	CompilerElse
		; Multiplikation mit rechter Matrix (1. Spalte) 
		! MOVUPS xmm4, [edx+00]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [eax+00], xmm6
		; Multiplikation mit rechter Matrix (2. Spalte) 
		! MOVUPS xmm4, [edx+16]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [eax+16], xmm6
		; Multiplikation mit rechter Matrix (3. Spalte) 
		! MOVUPS xmm4, [edx+32]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [eax+32], xmm6
		; Multiplikation mit rechter Matrix (4. Spalte) 
		! MOVUPS xmm4, [edx+48]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [eax+48], xmm6
		; Wiederherstellung von xmm4-xmm7
		! MOVUPS xmm4, [p.v_Backup+00]
		! MOVUPS xmm5, [p.v_Backup+16]
		! MOVUPS xmm6, [p.v_Backup+32]
	CompilerEndIf
	
	ProcedureReturn
	
EndProcedure
Edit: Bug-Fix
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
User avatar
skywalk
Addict
Addict
Posts: 3994
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Matrix Structure Speed test

Post by skywalk »

STARGÅTE wrote:(Don't be afraid of the length of the ASM code, it is much faster than the native PB code)
And PB v6 + C emit + optimization will produce the same or better result?
We should leave handcoding ASM for SIMD stuff. :twisted:
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
STARGÅTE
Addict
Addict
Posts: 2084
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Matrix Structure Speed test

Post by STARGÅTE »

skywalk wrote: Mon Jun 07, 2021 7:27 pm
STARGÅTE wrote:(Don't be afraid of the length of the ASM code, it is much faster than the native PB code)
And PB v6 + C emit + optimization will produce the same or better result?
We should leave handcoding ASM for SIMD stuff. :twisted:
I don't know.
Measurements on my system give the following results:
PB 6.00, native code, ASM backend: 45 ns
PB 6.00, native code, C backend: 41 ns
PB 6.00, native code, C backend + optimization: 16 ns
PB 6.00, ASM Code, ASM backend: 6.3 ns
At the end, you need also the C-SSE-instructions for optimization.

For the tests, I need to use this strange code with random matrix, because the C-optimizer canceled the code when the result matrix isn't used anymore.

Code: Select all

Structure UB2D_MATRIX4f
	I11.f : I21.f : I31.f : I41.f
	I12.f : I22.f : I32.f : I42.f
	I13.f : I23.f : I33.f : I43.f
	I14.f : I24.f : I34.f : I44.f
EndStructure

Procedure.i UB2D_m4fMultiplicationASM( *m4fResult.UB2D_MATRIX4f, *m4fLeft.UB2D_MATRIX4f, *m4fRight.UB2D_MATRIX4f )
	
	Protected Backup.UB2D_MATRIX4f
	
	CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
		! MOV rax, [p.p_m4fResult]
		! MOV rcx, [p.p_m4fLeft]
		! MOV rdx, [p.p_m4fRight]
		; Linke Matrix laden
		! MOVUPS xmm0, [rcx+00]
		! MOVUPS xmm1, [rcx+16]
		! MOVUPS xmm2, [rcx+32]
		! MOVUPS xmm3, [rcx+48]
		; Backup von xmm4-xmm7
		! MOVUPS [p.v_Backup+00], xmm4
		! MOVUPS [p.v_Backup+16], xmm5
		! MOVUPS [p.v_Backup+32], xmm6
	CompilerElse
		! MOV eax, [p.p_m4fResult]
		! MOV ecx, [p.p_m4fLeft]
		! MOV edx, [p.p_m4fRight]
		; Linke Matrix laden
		! MOVUPS xmm0, [ecx+00]
		! MOVUPS xmm1, [ecx+16]
		! MOVUPS xmm2, [ecx+32]
		! MOVUPS xmm3, [ecx+48]
		; Backup von xmm4-xmm7
		! MOVUPS [p.v_Backup+00], xmm4
		! MOVUPS [p.v_Backup+16], xmm5
		! MOVUPS [p.v_Backup+32], xmm6
	CompilerEndIf
	
	CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
		; Multiplikation mit rechter Matrix (1. Spalte) 
		! MOVUPS xmm4, [rdx+00]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [rax+00], xmm6
		; Multiplikation mit rechter Matrix (2. Spalte) 
		! MOVUPS xmm4, [rdx+16]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [rax+16], xmm6
		; Multiplikation mit rechter Matrix (3. Spalte) 
		! MOVUPS xmm4, [rdx+32]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [rax+32], xmm6
		; Multiplikation mit rechter Matrix (4. Spalte) 
		! MOVUPS xmm4, [rdx+48]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [rax+48], xmm6
		; Wiederherstellung von xmm4-xmm7
		! MOVUPS xmm4, [p.v_Backup+00]
		! MOVUPS xmm5, [p.v_Backup+16]
		! MOVUPS xmm6, [p.v_Backup+32]
	CompilerElse
		; Multiplikation mit rechter Matrix (1. Spalte) 
		! MOVUPS xmm4, [edx+00]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [eax+00], xmm6
		; Multiplikation mit rechter Matrix (2. Spalte) 
		! MOVUPS xmm4, [edx+16]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [eax+16], xmm6
		; Multiplikation mit rechter Matrix (3. Spalte) 
		! MOVUPS xmm4, [edx+32]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [eax+32], xmm6
		; Multiplikation mit rechter Matrix (4. Spalte) 
		! MOVUPS xmm4, [edx+48]
		! MOVAPS xmm6, xmm4
		! SHUFPS xmm6, xmm6, 00000000b
		! MULPS  xmm6, xmm0
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 01010101b
		! MULPS  xmm5, xmm1
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 10101010b
		! MULPS  xmm5, xmm2
		! ADDPS  xmm6, xmm5
		! MOVAPS xmm5, xmm4
		! SHUFPS xmm5, xmm5, 11111111b
		! MULPS  xmm5, xmm3
		! ADDPS  xmm6, xmm5
		! MOVUPS [eax+48], xmm6
		; Wiederherstellung von xmm4-xmm7
		! MOVUPS xmm4, [p.v_Backup+00]
		! MOVUPS xmm5, [p.v_Backup+16]
		! MOVUPS xmm6, [p.v_Backup+32]
	CompilerEndIf
	
	ProcedureReturn
	
EndProcedure


Procedure.i UB2D_m4fMultiplicationNativ( *m4fResult.UB2D_MATRIX4f, *m4fLeft.UB2D_MATRIX4f, *m4fRight.UB2D_MATRIX4f )
	
	Protected m4fBackup.UB2D_MATRIX4f
	
	m4fBackup\I11 = *m4fLeft\I11 * *m4fRight\I11 + *m4fLeft\I12 * *m4fRight\I21 + *m4fLeft\I13 * *m4fRight\I31 + *m4fLeft\I14 * *m4fRight\I41
	m4fBackup\I12 = *m4fLeft\I11 * *m4fRight\I12 + *m4fLeft\I12 * *m4fRight\I22 + *m4fLeft\I13 * *m4fRight\I32 + *m4fLeft\I14 * *m4fRight\I42
	m4fBackup\I13 = *m4fLeft\I11 * *m4fRight\I13 + *m4fLeft\I12 * *m4fRight\I23 + *m4fLeft\I13 * *m4fRight\I33 + *m4fLeft\I14 * *m4fRight\I43
	m4fBackup\I14 = *m4fLeft\I11 * *m4fRight\I14 + *m4fLeft\I12 * *m4fRight\I24 + *m4fLeft\I13 * *m4fRight\I34 + *m4fLeft\I14 * *m4fRight\I44
	
	m4fBackup\I21 = *m4fLeft\I21 * *m4fRight\I11 + *m4fLeft\I22 * *m4fRight\I21 + *m4fLeft\I23 * *m4fRight\I31 + *m4fLeft\I24 * *m4fRight\I41
	m4fBackup\I22 = *m4fLeft\I21 * *m4fRight\I12 + *m4fLeft\I22 * *m4fRight\I22 + *m4fLeft\I23 * *m4fRight\I32 + *m4fLeft\I24 * *m4fRight\I42
	m4fBackup\I23 = *m4fLeft\I21 * *m4fRight\I13 + *m4fLeft\I22 * *m4fRight\I23 + *m4fLeft\I23 * *m4fRight\I33 + *m4fLeft\I24 * *m4fRight\I43
	m4fBackup\I24 = *m4fLeft\I21 * *m4fRight\I14 + *m4fLeft\I22 * *m4fRight\I24 + *m4fLeft\I23 * *m4fRight\I34 + *m4fLeft\I24 * *m4fRight\I44
	
	m4fBackup\I31 = *m4fLeft\I31 * *m4fRight\I11 + *m4fLeft\I32 * *m4fRight\I21 + *m4fLeft\I33 * *m4fRight\I31 + *m4fLeft\I34 * *m4fRight\I41
	m4fBackup\I32 = *m4fLeft\I31 * *m4fRight\I12 + *m4fLeft\I32 * *m4fRight\I22 + *m4fLeft\I33 * *m4fRight\I32 + *m4fLeft\I34 * *m4fRight\I42
	m4fBackup\I33 = *m4fLeft\I31 * *m4fRight\I13 + *m4fLeft\I32 * *m4fRight\I23 + *m4fLeft\I33 * *m4fRight\I33 + *m4fLeft\I34 * *m4fRight\I43
	m4fBackup\I34 = *m4fLeft\I31 * *m4fRight\I14 + *m4fLeft\I32 * *m4fRight\I24 + *m4fLeft\I33 * *m4fRight\I34 + *m4fLeft\I34 * *m4fRight\I44
	
	m4fBackup\I41 = *m4fLeft\I41 * *m4fRight\I11 + *m4fLeft\I42 * *m4fRight\I21 + *m4fLeft\I43 * *m4fRight\I31 + *m4fLeft\I44 * *m4fRight\I41
	m4fBackup\I42 = *m4fLeft\I41 * *m4fRight\I12 + *m4fLeft\I42 * *m4fRight\I22 + *m4fLeft\I43 * *m4fRight\I32 + *m4fLeft\I44 * *m4fRight\I42
	m4fBackup\I43 = *m4fLeft\I41 * *m4fRight\I13 + *m4fLeft\I42 * *m4fRight\I23 + *m4fLeft\I43 * *m4fRight\I33 + *m4fLeft\I44 * *m4fRight\I43
	m4fBackup\I44 = *m4fLeft\I41 * *m4fRight\I14 + *m4fLeft\I42 * *m4fRight\I24 + *m4fLeft\I43 * *m4fRight\I34 + *m4fLeft\I44 * *m4fRight\I44
	
	CopyMemory(@m4fBackup, *m4fResult, SizeOf(UB2D_MATRIX4f))
	
	ProcedureReturn *m4fResult
		
EndProcedure


Procedure.i UB2D_m4fRandom( *m4fResult.UB2D_MATRIX4f, fMax.f = 1.0, fMin.f = 0.0 )
	
	Protected I.i
	
	For I = 0 To 15
		PokeF(*m4fResult + SizeOf(Float)*I, (fMax-fMin) * 4.6566128752457969241e-10 * Random(2147483647) + fMin )
	Next
	
	ProcedureReturn *m4fResult
	
EndProcedure

Procedure   UB2D_m4fPrint( *m4fSource.UB2D_MATRIX4f )
	
	With *m4fSource
		PrintN( RSet(StrF(\I11, 3), 9)+RSet(StrF(\I12, 3), 9)+RSet(StrF(\I13, 3), 9)+RSet(StrF(\I14, 3), 9) )
		PrintN( RSet(StrF(\I21, 3), 9)+RSet(StrF(\I22, 3), 9)+RSet(StrF(\I23, 3), 9)+RSet(StrF(\I24, 3), 9) )
		PrintN( RSet(StrF(\I31, 3), 9)+RSet(StrF(\I32, 3), 9)+RSet(StrF(\I33, 3), 9)+RSet(StrF(\I34, 3), 9) )
		PrintN( RSet(StrF(\I41, 3), 9)+RSet(StrF(\I42, 3), 9)+RSet(StrF(\I43, 3), 9)+RSet(StrF(\I44, 3), 9) )
	EndWith
	
EndProcedure


Define.UB2D_MATRIX4f A, B
Define Time.i, TimeBias.i, I.i

#Count = 10000000

OpenConsole()
RandomSeed(1)
UB2D_m4fRandom(A, 0.6438257)
UB2D_m4fRandom(B, 0.6438257)

TimeBias = ElapsedMilliseconds()
For I = 1 To #Count
Next
TimeBias = ElapsedMilliseconds() - TimeBias
Time = ElapsedMilliseconds()
For I = 1 To #Count
	UB2D_m4fMultiplicationNativ(A, A, B)
	;UB2D_m4fMultiplicationASM(A, A, B)
Next
Time = ElapsedMilliseconds() - Time

UB2D_m4fPrint(A)

PrintN("Time: " + Str(Time-TimeBias)+" ms")
PrintN("Single Time: " + StrF(1.0e6*(Time-TimeBias)/#Count, 3)+" ns")
Input()

PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
User avatar
StarBootics
Addict
Addict
Posts: 984
Joined: Sun Jul 07, 2013 11:35 am
Location: Canada

Re: Matrix Structure Speed test

Post by StarBootics »

Hello STARGATE,

The matrix structure is more like this :

Code: Select all

Structure Matrix44
  
  VirtualTable.i
  e11.f : e21.f : e31.f : e41.f
  e12.f : e22.f : e32.f : e42.f
  e13.f : e23.f : e33.f : e43.f
  e14.f : e24.f : e34.f : e44.f
  
EndStructure
The elements of the matrix are placed the same way (Compatibility with OpenGL), the only difference is the presence of a VirtualTable.

I have try to adapt your code to mine but I get an Invalid Memory Access error (For simplicity I have removed the 32 bit stuff since I don't need it)

Code: Select all

Procedure Multiply(*This.Matrix44, *Other.Matrix44)
  
  ; This procedure is supposed to mimic : This *= Other math operator in C++

  Protected *Result.Matrix44 = AllocateStructure(Matrix44)
  Protected *Backup.Matrix44 = AllocateStructure(Matrix44)
  
  *m4fResult = *Result + OffsetOf(Matrix44\e11)
  *m4fBackup = *Backup + OffsetOf(Matrix44\e11)
  *m4fThis = *This + OffsetOf(Matrix44\e11)
  *m4fOther = *Other + OffsetOf(Matrix44\e11)
  
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    ! MOV rax, [p.p_m4fResult]
    ! MOV rcx, [p.p_m4fThis]
    ! MOV rdx, [p.p_m4fOther]
    ; Linke Matrix laden
    ! MOVUPS xmm0, [rcx+00]
    ! MOVUPS xmm1, [rcx+16]
    ! MOVUPS xmm2, [rcx+32]
    ! MOVUPS xmm3, [rcx+48]
    ; Backup von xmm4-xmm7
    ! MOVUPS [p.p_m4fBackup+00], xmm4
    ! MOVUPS [p.p_m4fBackup+16], xmm5
    ! MOVUPS [p.p_m4fBackup+32], xmm6
  CompilerEndIf
  
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    ; Multiplikation mit rechter Matrix (1. Spalte) 
    ! MOVUPS xmm4, [rdx+00]
    ! MOVAPS xmm6, xmm4
    ! SHUFPS xmm6, xmm6, 00000000b
    ! MULPS  xmm6, xmm0
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 01010101b
    ! MULPS  xmm5, xmm1
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 10101010b
    ! MULPS  xmm5, xmm2
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 11111111b
    ! MULPS  xmm5, xmm3
    ! ADDPS  xmm6, xmm5
    ! MOVUPS [rax+00], xmm6
    ; Multiplikation mit rechter Matrix (2. Spalte) 
    ! MOVUPS xmm4, [rdx+16]
    ! MOVAPS xmm6, xmm4
    ! SHUFPS xmm6, xmm6, 00000000b
    ! MULPS  xmm6, xmm0
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 01010101b
    ! MULPS  xmm5, xmm1
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 10101010b
    ! MULPS  xmm5, xmm2
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 11111111b
    ! MULPS  xmm5, xmm3
    ! ADDPS  xmm6, xmm5
    ! MOVUPS [rax+16], xmm6
    ; Multiplikation mit rechter Matrix (3. Spalte) 
    ! MOVUPS xmm4, [rdx+32]
    ! MOVAPS xmm6, xmm4
    ! SHUFPS xmm6, xmm6, 00000000b
    ! MULPS  xmm6, xmm0
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 01010101b
    ! MULPS  xmm5, xmm1
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 10101010b
    ! MULPS  xmm5, xmm2
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 11111111b
    ! MULPS  xmm5, xmm3
    ! ADDPS  xmm6, xmm5
    ! MOVUPS [rax+32], xmm6
    ; Multiplikation mit rechter Matrix (4. Spalte) 
    ! MOVUPS xmm4, [rdx+48]
    ! MOVAPS xmm6, xmm4
    ! SHUFPS xmm6, xmm6, 00000000b
    ! MULPS  xmm6, xmm0
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 01010101b
    ! MULPS  xmm5, xmm1
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 10101010b
    ! MULPS  xmm5, xmm2
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 11111111b
    ! MULPS  xmm5, xmm3
    ! ADDPS  xmm6, xmm5
    ! MOVUPS [rax+48], xmm6
    ; Wiederherstellung von xmm4-xmm7
    ! MOVUPS xmm4, [p.p_m4fBackup+00]
    ! MOVUPS xmm5, [p.p_m4fBackup+16]
    ! MOVUPS xmm6, [p.p_m4fBackup+32]
  CompilerEndIf
  
  CopyMemory(*Result, *This, SizeOf(Matrix44))
  FreeStructure(*Result)
  FreeStructure(*Backup)
  
  ProcedureReturn 
EndProcedure
As I said before my knowledge of the Assembler is very limited.
Any help or explanation will be welcome.

Best regards
StarBootics
The Stone Age did not end due to a shortage of stones !
User avatar
STARGÅTE
Addict
Addict
Posts: 2084
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Matrix Structure Speed test

Post by STARGÅTE »

The protected "Backup" variable is for the registers xmm4 to xmm7, which have to be saved before changing.
You need here a buffer structure with a size of 4*16 byte:

Try this:

Code: Select all

Structure Matrix44
  
  VirtualTable.i
  e11.f : e21.f : e31.f : e41.f
  e12.f : e22.f : e32.f : e42.f
  e13.f : e23.f : e33.f : e43.f
  e14.f : e24.f : e34.f : e44.f
  
EndStructure

Structure Buffer
	sse.f[16]
EndStructure

Procedure Multiply(*This.Matrix44, *Other.Matrix44)
	
	Protected Backup.Buffer
	
	; This procedure is supposed to mimic : This *= Other math operator in C++

  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
		! MOV rax, [p.p_This]
		! MOV rcx, [p.p_This]
		! MOV rdx, [p.p_Other]
		; Linke Matrix laden
		! MOVUPS xmm0, [rcx+00+8]  ; + 8 because of VirtualTable.i
		! MOVUPS xmm1, [rcx+16+8]  ; + 8 because of VirtualTable.i
		! MOVUPS xmm2, [rcx+32+8]  ; + 8 because of VirtualTable.i
		! MOVUPS xmm3, [rcx+48+8]  ; + 8 because of VirtualTable.i
		; Backup von xmm4-xmm7
		! MOVUPS [p.v_Backup+00], xmm4
		! MOVUPS [p.v_Backup+16], xmm5
		! MOVUPS [p.v_Backup+32], xmm6
  CompilerEndIf
  
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    ; Multiplikation mit rechter Matrix (1. Spalte) 
    ! MOVUPS xmm4, [rdx+00+8] ; + 8 because of VirtualTable.i
    ! MOVAPS xmm6, xmm4
    ! SHUFPS xmm6, xmm6, 00000000b
    ! MULPS  xmm6, xmm0
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 01010101b
    ! MULPS  xmm5, xmm1
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 10101010b
    ! MULPS  xmm5, xmm2
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 11111111b
    ! MULPS  xmm5, xmm3
    ! ADDPS  xmm6, xmm5
    ! MOVUPS [rax+00+8], xmm6  ; + 8 because of VirtualTable.i
    ; Multiplikation mit rechter Matrix (2. Spalte) 
    ! MOVUPS xmm4, [rdx+16+8] ; + 8 because of VirtualTable.i
    ! MOVAPS xmm6, xmm4
    ! SHUFPS xmm6, xmm6, 00000000b
    ! MULPS  xmm6, xmm0
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 01010101b
    ! MULPS  xmm5, xmm1
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 10101010b
    ! MULPS  xmm5, xmm2
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 11111111b
    ! MULPS  xmm5, xmm3
    ! ADDPS  xmm6, xmm5
    ! MOVUPS [rax+16+8], xmm6  ; + 8 because of VirtualTable.i
    ; Multiplikation mit rechter Matrix (3. Spalte) 
    ! MOVUPS xmm4, [rdx+32+8] ; + 8 because of VirtualTable.i
    ! MOVAPS xmm6, xmm4
    ! SHUFPS xmm6, xmm6, 00000000b
    ! MULPS  xmm6, xmm0
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 01010101b
    ! MULPS  xmm5, xmm1
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 10101010b
    ! MULPS  xmm5, xmm2
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 11111111b
    ! MULPS  xmm5, xmm3
    ! ADDPS  xmm6, xmm5
    ! MOVUPS [rax+32+8], xmm6  ; + 8 because of VirtualTable.i
    ; Multiplikation mit rechter Matrix (4. Spalte) 
    ! MOVUPS xmm4, [rdx+48+8] ; + 8 because of VirtualTable.i
    ! MOVAPS xmm6, xmm4
    ! SHUFPS xmm6, xmm6, 00000000b
    ! MULPS  xmm6, xmm0
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 01010101b
    ! MULPS  xmm5, xmm1
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 10101010b
    ! MULPS  xmm5, xmm2
    ! ADDPS  xmm6, xmm5
    ! MOVAPS xmm5, xmm4
    ! SHUFPS xmm5, xmm5, 11111111b
    ! MULPS  xmm5, xmm3
    ! ADDPS  xmm6, xmm5
    ! MOVUPS [rax+48+8], xmm6  ; + 8 because of VirtualTable.i
    ; Wiederherstellung von xmm4-xmm7
    ! MOVUPS xmm4, [p.v_Backup+00]
    ! MOVUPS xmm5, [p.v_Backup+16]
    ! MOVUPS xmm6, [p.v_Backup+32]
  CompilerEndIf
	  
  ProcedureReturn 
EndProcedure
Edit; Sry, bug fix
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
User avatar
skywalk
Addict
Addict
Posts: 3994
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Matrix Structure Speed test

Post by skywalk »

STARGÅTE wrote:I don't know.
Measurements on my system give the following results:
PB 6.00, native code, ASM backend: 45 ns
PB 6.00, native code, C backend: 41 ns
PB 6.00, native code, C backend + optimization: 16 ns
PB 6.00, ASM Code, ASM backend: 6.3 ns
At the end, you need also the C-SSE-instructions for optimization.
This is great and matches Fred's earlier blog. There is easy 4x improvement just with C optimizer. Then, crafty ASM guys can even do 2x better again. 8)

Next question, is actual data comparisons with C optimizer.
Do you really get the same numeric responses?
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
StarBootics
Addict
Addict
Posts: 984
Joined: Sun Jul 07, 2013 11:35 am
Location: Canada

Re: Matrix Structure Speed test

Post by StarBootics »

Hello everyone,

1st : Thanks to STARGATE by chance you have some code done already. With your computer even with the native PureBasic code the execution is pretty fast already.

2nd : @skywalk : I can't tell about the C-Backend since I'm under Linux but the Assembly optimized code work absolutely fine. The animations are working the same way as before and little bit much faster indeed. That being said I'm testing everything at 30 FPS I have plenty of time between 2 frames to do the calculations. This is for the Editor, the game will have a bigger workload every frame so the less time wasted the better.

Best regards
StarBootics
The Stone Age did not end due to a shortage of stones !
User avatar
StarBootics
Addict
Addict
Posts: 984
Joined: Sun Jul 07, 2013 11:35 am
Location: Canada

Re: Matrix Structure Speed test

Post by StarBootics »

Hello everyone,

Some more SSE acceleration demonstration :

Code: Select all

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; Project name : Matrix Structure Speed Test 3
; File Name : Matrix Structure Speed Test 3.pb
; File version: 1.0.0
; Programming : OK
; Programmed by : StarBootics
; Date : June 9th, 2021
; Last Update : June 9th, 2021
; PureBasic code : V5.73 LTS
; Platform : Windows, Linux, MacOS X
; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Structure Matrix44
  
  VirtualTable.i
  e11.f
  e21.f
  e31.f
  e41.f
  e12.f
  e22.f
  e32.f
  e42.f
  e13.f
  e23.f
  e33.f
  e43.f
  e14.f
  e24.f
  e34.f
  e44.f
  
EndStructure

Structure Vector4
  sse.f[4]
EndStructure

Procedure Add(*This.Matrix44, *Other.Matrix44)
  
    *This\e11 + *Other\e11
    *This\e21 + *Other\e21
    *This\e31 + *Other\e31
    *This\e41 + *Other\e41
    *This\e12 + *Other\e12
    *This\e22 + *Other\e22
    *This\e32 + *Other\e32
    *This\e42 + *Other\e42
    *This\e13 + *Other\e13
    *This\e23 + *Other\e23
    *This\e33 + *Other\e33
    *This\e43 + *Other\e43
    *This\e14 + *Other\e14
    *This\e24 + *Other\e24
    *This\e34 + *Other\e34
    *This\e44 + *Other\e44
  
EndProcedure

Procedure Subtract(*This.Matrix44, *Other.Matrix44)

    *This\e11 - *Other\e11
    *This\e21 - *Other\e21
    *This\e31 - *Other\e31
    *This\e41 - *Other\e41
    *This\e12 - *Other\e12
    *This\e22 - *Other\e22
    *This\e32 - *Other\e32
    *This\e42 - *Other\e42
    *This\e13 - *Other\e13
    *This\e23 - *Other\e23
    *This\e33 - *Other\e33
    *This\e43 - *Other\e43
    *This\e14 - *Other\e14
    *This\e24 - *Other\e24
    *This\e34 - *Other\e34
    *This\e44 - *Other\e44
  
EndProcedure

Procedure ProductByScalar(*This.Matrix44, P_Scalar.f)
  
  *This\e11 * P_Scalar
  *This\e21 * P_Scalar
  *This\e31 * P_Scalar
  *This\e41 * P_Scalar
  *This\e12 * P_Scalar
  *This\e22 * P_Scalar
  *This\e32 * P_Scalar
  *This\e42 * P_Scalar
  *This\e13 * P_Scalar
  *This\e23 * P_Scalar
  *This\e33 * P_Scalar
  *This\e43 * P_Scalar
  *This\e14 * P_Scalar
  *This\e24 * P_Scalar
  *This\e34 * P_Scalar
  *This\e44 * P_Scalar
  
EndProcedure

Procedure AddSSE(*This.Matrix44, *Other.Matrix44)

  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    
    ! MOV rax, [p.p_This]
    ! MOV rcx, [p.p_Other]
    
    ! MOVUPS xmm0, [rax+00+8]
    ! MOVUPS xmm1, [rcx+00+8]
    ! ADDPS  xmm1, xmm0
    ! MOVUPS [rax+00+8], xmm1
    
    ! MOVUPS xmm0, [rax+16+8]
    ! MOVUPS xmm1, [rcx+16+8]
    ! ADDPS  xmm1, xmm0
    ! MOVUPS [rax+16+8], xmm1
    
    ! MOVUPS xmm0, [rax+32+8]
    ! MOVUPS xmm1, [rcx+32+8]
    ! ADDPS  xmm1, xmm0
    ! MOVUPS [rax+32+8], xmm1
    
    ! MOVUPS xmm0, [rax+48+8]
    ! MOVUPS xmm1, [rcx+48+8]
    ! ADDPS  xmm1, xmm0
    ! MOVUPS [rax+48+8], xmm1
    
  CompilerElse
    ! MOV eax, [p.p_This]
    ! MOV ecx, [p.p_Other]
    
    ! MOVUPS xmm0, [eax+00+8]
    ! MOVUPS xmm1, [ecx+00+8]
    ! ADDPS  xmm1, xmm0
    ! MOVUPS [eax+00+8], xmm1
    
    ! MOVUPS xmm0, [eax+16+8]
    ! MOVUPS xmm1, [ecx+16+8]
    ! ADDPS  xmm1, xmm0
    ! MOVUPS [eax+16+8], xmm1
    
    ! MOVUPS xmm0, [eax+32+8]
    ! MOVUPS xmm1, [ecx+32+8]
    ! ADDPS  xmm1, xmm0
    ! MOVUPS [eax+32+8], xmm1
    
    ! MOVUPS xmm0, [eax+48+8]
    ! MOVUPS xmm1, [ecx+48+8]
    ! ADDPS  xmm1, xmm0
    ! MOVUPS [eax+48+8], xmm1
    
  CompilerEndIf

EndProcedure

Procedure SubtractSSE(*This.Matrix44, *Other.Matrix44)

  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    
    ! MOV rax, [p.p_This]
    ! MOV rcx, [p.p_Other]
    
    ! MOVUPS xmm1, [rax+00+8]
    ! MOVUPS xmm0, [rcx+00+8]
    ! SUBPS  xmm1, xmm0
    ! MOVUPS [rax+00+8], xmm1
    
    ! MOVUPS xmm1, [rax+16+8]
    ! MOVUPS xmm0, [rcx+16+8]
    ! SUBPS  xmm1, xmm0
    ! MOVUPS [rax+16+8], xmm1
    
    ! MOVUPS xmm1, [rax+32+8]
    ! MOVUPS xmm0, [rcx+32+8]
    ! SUBPS  xmm1, xmm0
    ! MOVUPS [rax+32+8], xmm1
    
    ! MOVUPS xmm1, [rax+48+8]
    ! MOVUPS xmm0, [rcx+48+8]
    ! SUBPS  xmm1, xmm0
    ! MOVUPS [rax+48+8], xmm1
    
  CompilerElse
    ! MOV eax, [p.p_This]
    ! MOV ecx, [p.p_Other]
    
    ! MOVUPS xmm1, [eax+00+8]
    ! MOVUPS xmm0, [ecx+00+8]
    ! SUBPS  xmm1, xmm0
    ! MOVUPS [eax+00+8], xmm1
    
    ! MOVUPS xmm1, [eax+16+8]
    ! MOVUPS xmm0, [ecx+16+8]
    ! SUBPS  xmm1, xmm0
    ! MOVUPS [eax+16+8], xmm1
    
    ! MOVUPS xmm1, [eax+32+8]
    ! MOVUPS xmm0, [ecx+32+8]
    ! SUBPS  xmm1, xmm0
    ! MOVUPS [eax+32+8], xmm1
    
    ! MOVUPS xmm1, [eax+48+8]
    ! MOVUPS xmm0, [ecx+48+8]
    ! SUBPS  xmm1, xmm0
    ! MOVUPS [eax+48+8], xmm1
    
  CompilerEndIf
  
EndProcedure

Procedure ProductByScalarSSE(*This.Matrix44, P_Scalar.f)
  
  Protected Vector.Vector4
  
  Vector\sse[0] = P_Scalar
  Vector\sse[1] = P_Scalar
  Vector\sse[2] = P_Scalar
  Vector\sse[3] = P_Scalar
  
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    
    ! MOV rax, [p.p_This]
    ! MOVUPS xmm1, [rax+00+8]
    ! MOVUPS xmm0, [p.v_Vector]
    ! MULPS  xmm1, xmm0
    ! MOVUPS [rax+00+8], xmm1
    
    ! MOVUPS xmm1, [rax+16+8]
    ! MULPS  xmm1, xmm0
    ! MOVUPS [rax+16+8], xmm1
    
    ! MOVUPS xmm1, [rax+32+8]
    ! MULPS  xmm1, xmm0
    ! MOVUPS [rax+32+8], xmm1
    
    ! MOVUPS xmm1, [rax+48+8]
    ! MULPS  xmm1, xmm0
    ! MOVUPS [rax+48+8], xmm1

  CompilerElse
    
    ! MOV eax, [p.p_This]
    ! MOVUPS xmm0, [p.v_Vector]
    
    ! MOVUPS xmm1, [eax+00+8]
    ! MULPS  xmm1, xmm0
    ! MOVUPS [eax+00+8], xmm1
    
    ! MOVUPS xmm1, [eax+16+8]
    ! MULPS  xmm1, xmm0
    ! MOVUPS [eax+16+8], xmm1
    
    ! MOVUPS xmm1, [eax+32+8]
    ! MULPS  xmm1, xmm0
    ! MOVUPS [eax+32+8], xmm1
    
    ! MOVUPS xmm1, [eax+48+8]
    ! MULPS  xmm1, xmm0
    ! MOVUPS [eax+48+8], xmm1
    
  CompilerEndIf
  
EndProcedure


MatA.Matrix44
MatA\e11 = 1.0
MatA\e21 = 2.0
MatA\e31 = 3.0
MatA\e41 = 4.0

MatA\e12 = 5.0
MatA\e22 = 6.0
MatA\e32 = 7.0
MatA\e42 = 8.0

MatA\e13 = 9.0
MatA\e23 = 10.0
MatA\e33 = 11.0
MatA\e43 = 12.0

MatA\e14 = 13.0
MatA\e24 = 14.0
MatA\e34 = 15.0
MatA\e44 = 16.0

MatB.Matrix44
MatB\e11 = 1.0
MatB\e21 = 1.0
MatB\e31 = 1.0
MatB\e41 = 1.0

MatB\e12 = 1.0
MatB\e22 = 1.0
MatB\e32 = 1.0
MatB\e42 = 1.0

MatB\e13 = 1.0
MatB\e23 = 1.0
MatB\e33 = 1.0
MatB\e43 = 1.0

MatB\e14 = 1.0
MatB\e24 = 1.0
MatB\e34 = 1.0
MatB\e44 = 1.0

For TestID = 0 To 4
  
  TempsDepart0 = ElapsedMilliseconds()
  
  For Index = 0 To 100000
    Add(MatA, MatB)
    Subtract(MatA, MatB)
  Next
  
  TempsEcoule0 = ElapsedMilliseconds()-TempsDepart0
  
  TempsDepart1 = ElapsedMilliseconds()
  
  For Index = 0 To 100000
    AddSSE(MatA, MatB)
    SubtractSSE(MatA, MatB)
  Next
  
  TempsEcoule1 = ElapsedMilliseconds()-TempsDepart1
  
  TempsDepart2 = ElapsedMilliseconds()
  
  For Index = 0 To 100000
    ProductByScalar(MatB, 1.0)
  Next
  
  TempsEcoule2 = ElapsedMilliseconds()-TempsDepart2
  
  TempsDepart3 = ElapsedMilliseconds()
  
  For Index = 0 To 100000
    ProductByScalarSSE(MatB, 1.0)
  Next
  
  TempsEcoule3 = ElapsedMilliseconds()-TempsDepart3
   
  MessageRequester("Elapsed Time", "Matrix Add-Subtract : " + Str(TempsEcoule0) + " milliseconds" + #LF$ + "Matrix Add-Subtract SSE : " + Str(TempsEcoule1) + " milliseconds" + #LF$ + "Matrix ProductByScalar : " + Str(TempsEcoule2) + " milliseconds" + #LF$ + "Matrix ProductByScalar SSE : " + Str(TempsEcoule3) + " milliseconds")
  
Next

; <<<<<<<<<<<<<<<<<<<<<<<
; <<<<< END OF FILE <<<<<
; <<<<<<<<<<<<<<<<<<<<<<<
Matrix Add-Subtract : 8 milliseconds
Matrix Add-Subtract SSE : 2 milliseconds
Matrix ProductByScalar : 4 milliseconds
Matrix ProductByScalar SSE : 2 milliseconds

Matrix Add-Subtract : 36 milliseconds
Matrix Add-Subtract SSE : 8 milliseconds
Matrix ProductByScalar : 6 milliseconds
Matrix ProductByScalar SSE : 2 milliseconds

Matrix Add-Subtract : 36 milliseconds
Matrix Add-Subtract SSE : 8 milliseconds
Matrix ProductByScalar : 18 milliseconds
Matrix ProductByScalar SSE : 5 milliseconds

Matrix Add-Subtract : 40 milliseconds
Matrix Add-Subtract SSE : 10 milliseconds
Matrix ProductByScalar : 8 milliseconds
Matrix ProductByScalar SSE : 2 milliseconds

Matrix Add-Subtract : 27 milliseconds
Matrix Add-Subtract SSE : 3 milliseconds
Matrix ProductByScalar : 6 milliseconds
Matrix ProductByScalar SSE : 3 milliseconds
Best regards
StarBootics
The Stone Age did not end due to a shortage of stones !
Post Reply