## Matrix Structure Speed test

Everything else that doesn't fall into one of the other PB categories.
StarBootics
Enthusiast
Posts: 705
Joined: Sun Jul 07, 2013 11:35 am

### Matrix Structure Speed test

Hello everyone,

Apparently the use of static arrays inside structures are pretty darn slow in comparison of just straight fields.

Code: Select all

``````; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; Project name : Matrix Structure Speed test
; File Name : Matrix Structure Speed test.pb
; File version: 1.0.0
; Programming : OK
; Programmed by : StarBootics
; Date : 20-03-2021
; Last Update : 20-03-2021
; PureBasic code : V5.73LTS
; Platform : Windows, Linux, MacOS X
; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Structure Line

Element1D.f[4]

EndStructure

Structure Matrix4f

Element2D.Line[4]

EndStructure

Structure Matrix44f
e.f[16]
EndStructure

Structure Matrix44

e11.f
e21.f
e31.f
e41.f
e12.f
e22.f
e32.f
e42.f
e13.f
e23.f
e33.f
e43.f
e14.f
e24.f
e34.f
e44.f

EndStructure

Macro AccessMatrix44f(MatrixA, Line, Column)
MatrixA\e[Column * 4 + Line]
EndMacro

Macro AccessMatrix4f(MatrixA, Line, Column)

MatrixA\Element2D[Column]\Element1D[Line]

EndMacro

Procedure Identity4f(*MatrixA.Matrix4f)

For i = 0 To 3

For j = 0 To 3

If i = j
AccessMatrix4f(*MatrixA, i, j) = 1.0
Else
AccessMatrix4f(*MatrixA, i, j) = 0.0
EndIf

Next

Next

EndProcedure

Procedure Identity44f(*MatrixA.Matrix44f)

For i = 0 To 3

For j = 0 To 3

If i = j
AccessMatrix44f(*MatrixA, i, j) = 1.0
Else
AccessMatrix44f(*MatrixA, i, j) = 0.0
EndIf

Next

Next

EndProcedure

Procedure Identity44(*Identity.Matrix44)

*Identity\e11 = 1.0
*Identity\e21 = 0.0
*Identity\e31 = 0.0
*Identity\e41 = 0.0
*Identity\e12 = 0.0
*Identity\e22 = 1.0
*Identity\e32 = 0.0
*Identity\e42 = 0.0
*Identity\e13 = 0.0
*Identity\e23 = 0.0
*Identity\e33 = 1.0
*Identity\e43 = 0.0
*Identity\e14 = 0.0
*Identity\e24 = 0.0
*Identity\e34 = 0.0
*Identity\e44 = 1.0

EndProcedure

For TestID = 0 To 4

TempsDepart0 = ElapsedMilliseconds()

For Index = 0 To 100000
Identity4f(MatrixA.Matrix4f)
Next

TempsEcoule0 = ElapsedMilliseconds()-TempsDepart0

TempsDepart1 = ElapsedMilliseconds()

For Index = 0 To 100000
Identity44f(MatrixB.Matrix44f)
Next

TempsEcoule1 = ElapsedMilliseconds()-TempsDepart1

TempsDepart2 = ElapsedMilliseconds()

For Index = 0 To 100000
Identity44(MatrixC.Matrix44)
Next

TempsEcoule2 = ElapsedMilliseconds()-TempsDepart2

MessageRequester("Elapsed Time", "Matrix4f : " + Str(TempsEcoule0) + " milliseconds" + #LF\$ + "Matrix44f : " + Str(TempsEcoule1) +" milliseconds"+ #LF\$ + "Matrix44 : " + Str(TempsEcoule2) +" milliseconds")

Next

; <<<<<<<<<<<<<<<<<<<<<<<
; <<<<< END OF FILE <<<<<
; <<<<<<<<<<<<<<<<<<<<<<<``````
When I run the code above this is what I get

Matrix4f : 10 milliseconds
Matrix44f : 8 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 13 milliseconds
Matrix44f : 8 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 13 milliseconds
Matrix44f : 8 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 12 milliseconds
Matrix44f : 8 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 12 milliseconds
Matrix44f : 9 milliseconds
Matrix44 : 2 milliseconds

The question is what is making the use of static arrays so slow ? The memory offset calculation ? The nested for loops ? Both, memory offset calculation and nested for loops ?

Thanks beforehand.
StarBootics
The Stone Age did not end due to a shortage of stones !
STARGÅTE
Posts: 1501
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

### Re: Matrix Structure Speed test

It is mainly the nested loop, because here you have multiple queries and jumps.
Whenever you want to speed up a code, write multiple lines, instead of For-loop.

Btw: If you want to speed up 4D matrix stuff, you can (or have to) use the ASM SSE instructions set.
In SSE you have 128bit registers and you can calculate with 4 floats at once.

Here an example:

Code: Select all

``````Procedure.i IdentitySSE( *m4fOut.Matrix44, fValue.f = 1.0 )

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
! MOV     rax, [p.p_m4fOut]
! MOVSS   xmm0, [p.v_fValue]
! MOVUPS  [rax+00], xmm0
! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
! MOVUPS  [rax+16], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [rax+32], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [rax+48], xmm0
CompilerElse
! MOV     eax, [p.p_m4fOut]
! MOVSS   xmm0, [p.v_fValue]
! MOVUPS  [eax+00], xmm0
! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
! MOVUPS  [eax+16], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [eax+32], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [eax+48], xmm0
CompilerEndIf

ProcedureReturn

EndProcedure``````
---------------------------
Elapsed Time
---------------------------
Matrix4f : 27 milliseconds
Matrix44f : 27 milliseconds
Matrix44 : 6 milliseconds
MatrixSSE: 2 milliseconds
---------------------------
OK
---------------------------
PB 5.73 ― Win 10, 20H2 ― Ryzen 9 3900X ― Radeon RX 5600 XT ITX ― Vivaldi 3.6 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
idle
Posts: 3628
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

### Re: Matrix Structure Speed test

The slow down is due to the Cmps from the nested loops and IF statement rather than the offset calculations
where ever you have branching there's the potential to cause slowdowns

If you need to use loops it's better to use repeat or while loops and if you can avoid an IF do it.

Code: Select all

``````Procedure Identity44f(*MatrixA.Matrix44f)

CopyMemory(?_Identity44,*MatrixA,64)

DataSection : _Identity44:
Data.q \$000000003F800000,\$0000000000000000
Data.q \$3F80000000000000,\$0000000000000000
Data.q \$0000000000000000,\$000000003F800000
Data.q \$0000000000000000,\$3F80000000000000
EndDataSection

EndProcedure
``````
StarBootics
Enthusiast
Posts: 705
Joined: Sun Jul 07, 2013 11:35 am

### Re: Matrix Structure Speed test

STARGÅTE wrote:Btw: If you want to speed up 4D matrix stuff, you can (or have to) use the ASM SSE instructions set.
In SSE you have 128bit registers and you can calculate with 4 floats at once.

Here an example:

Code: Select all

``````Procedure.i IdentitySSE( *m4fOut.Matrix44, fValue.f = 1.0 )

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
! MOV     rax, [p.p_m4fOut]
! MOVSS   xmm0, [p.v_fValue]
! MOVUPS  [rax+00], xmm0
! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
! MOVUPS  [rax+16], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [rax+32], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [rax+48], xmm0
CompilerElse
! MOV     eax, [p.p_m4fOut]
! MOVSS   xmm0, [p.v_fValue]
! MOVUPS  [eax+00], xmm0
! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
! MOVUPS  [eax+16], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [eax+32], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [eax+48], xmm0
CompilerEndIf

ProcedureReturn

EndProcedure``````
It's an interesting piece of code. I really need to learn ASM coding.
idle wrote:The slow down is due to the Cmps from the nested loops and IF statement rather than the offset calculations
The offset calculation appear to have an impact as well. This is the V2.0.0 without any loops and If statement

Code: Select all

``````; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; Project name : Matrix Structure Speed test
; File Name : Matrix Structure Speed test.pb
; File version: 2.0.0
; Programming : OK
; Programmed by : StarBootics
; Date : 20-03-2021
; Last Update : 20-03-2021
; PureBasic code : V5.73LTS
; Platform : Windows, Linux, MacOS X
; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Structure Line

Element1D.f[4]

EndStructure

Structure Matrix4f

Element2D.Line[4]

EndStructure

Structure Matrix44f
e.f[16]
EndStructure

Structure Matrix44

e11.f
e21.f
e31.f
e41.f
e12.f
e22.f
e32.f
e42.f
e13.f
e23.f
e33.f
e43.f
e14.f
e24.f
e34.f
e44.f

EndStructure

Macro AccessMatrix44f(MatrixA, Line, Column)
MatrixA\e[Column * 4 + Line]
EndMacro

Macro AccessMatrix4f(MatrixA, Line, Column)

MatrixA\Element2D[Column]\Element1D[Line]

EndMacro

Procedure Identity4f(*MatrixA.Matrix4f)

AccessMatrix4f(*MatrixA, 0, 0) = 1.0
AccessMatrix4f(*MatrixA, 1, 0) = 0.0
AccessMatrix4f(*MatrixA, 2, 0) = 0.0
AccessMatrix4f(*MatrixA, 3, 0) = 0.0

AccessMatrix4f(*MatrixA, 0, 1) = 0.0
AccessMatrix4f(*MatrixA, 1, 1) = 1.0
AccessMatrix4f(*MatrixA, 2, 1) = 0.0
AccessMatrix4f(*MatrixA, 3, 1) = 0.0

AccessMatrix4f(*MatrixA, 0, 2) = 0.0
AccessMatrix4f(*MatrixA, 1, 2) = 0.0
AccessMatrix4f(*MatrixA, 2, 2) = 1.0
AccessMatrix4f(*MatrixA, 3, 2) = 0.0

AccessMatrix4f(*MatrixA, 0, 3) = 0.0
AccessMatrix4f(*MatrixA, 1, 3) = 0.0
AccessMatrix4f(*MatrixA, 2, 3) = 0.0
AccessMatrix4f(*MatrixA, 3, 3) = 1.0

;   For i = 0 To 3
;
;     For j = 0 To 3
;
;       If i = j
;         AccessMatrix4f(*MatrixA, i, j) = 1.0
;       Else
;         AccessMatrix4f(*MatrixA, i, j) = 0.0
;       EndIf
;
;     Next
;
;   Next

EndProcedure

Procedure Identity44f(*MatrixA.Matrix44f)

AccessMatrix44f(*MatrixA, 0, 0) = 1.0
AccessMatrix44f(*MatrixA, 1, 0) = 0.0
AccessMatrix44f(*MatrixA, 2, 0) = 0.0
AccessMatrix44f(*MatrixA, 3, 0) = 0.0

AccessMatrix44f(*MatrixA, 0, 1) = 0.0
AccessMatrix44f(*MatrixA, 1, 1) = 1.0
AccessMatrix44f(*MatrixA, 2, 1) = 0.0
AccessMatrix44f(*MatrixA, 3, 1) = 0.0

AccessMatrix44f(*MatrixA, 0, 2) = 0.0
AccessMatrix44f(*MatrixA, 1, 2) = 0.0
AccessMatrix44f(*MatrixA, 2, 2) = 1.0
AccessMatrix44f(*MatrixA, 3, 2) = 0.0

AccessMatrix44f(*MatrixA, 0, 3) = 0.0
AccessMatrix44f(*MatrixA, 1, 3) = 0.0
AccessMatrix44f(*MatrixA, 2, 3) = 0.0
AccessMatrix44f(*MatrixA, 3, 3) = 1.0

;   For i = 0 To 3
;
;     For j = 0 To 3
;
;       If i = j
;         AccessMatrix44f(*MatrixA, i, j) = 1.0
;       Else
;         AccessMatrix44f(*MatrixA, i, j) = 0.0
;       EndIf
;
;     Next
;
;   Next

EndProcedure

Procedure Identity44(*Identity.Matrix44)

*Identity\e11 = 1.0
*Identity\e21 = 0.0
*Identity\e31 = 0.0
*Identity\e41 = 0.0
*Identity\e12 = 0.0
*Identity\e22 = 1.0
*Identity\e32 = 0.0
*Identity\e42 = 0.0
*Identity\e13 = 0.0
*Identity\e23 = 0.0
*Identity\e33 = 1.0
*Identity\e43 = 0.0
*Identity\e14 = 0.0
*Identity\e24 = 0.0
*Identity\e34 = 0.0
*Identity\e44 = 1.0

EndProcedure

For TestID = 0 To 4

TempsDepart0 = ElapsedMilliseconds()

For Index = 0 To 100000
Identity4f(MatrixA.Matrix4f)
Next

TempsEcoule0 = ElapsedMilliseconds()-TempsDepart0

TempsDepart1 = ElapsedMilliseconds()

For Index = 0 To 100000
Identity44f(MatrixB.Matrix44f)
Next

TempsEcoule1 = ElapsedMilliseconds()-TempsDepart1

TempsDepart2 = ElapsedMilliseconds()

For Index = 0 To 100000
Identity44(MatrixC.Matrix44)
Next

TempsEcoule2 = ElapsedMilliseconds()-TempsDepart2

MessageRequester("Elapsed Time", "Matrix4f : " + Str(TempsEcoule0) + " milliseconds" + #LF\$ + "Matrix44f : " + Str(TempsEcoule1) +" milliseconds"+ #LF\$ + "Matrix44 : " + Str(TempsEcoule2) +" milliseconds")

Next

; <<<<<<<<<<<<<<<<<<<<<<<
; <<<<< END OF FILE <<<<<
; <<<<<<<<<<<<<<<<<<<<<<<``````
Matrix4f : 6 milliseconds
Matrix44f : 4 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 8 milliseconds
Matrix44f :11 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 8 milliseconds
Matrix44f : 4 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 6 milliseconds
Matrix44f : 5 milliseconds
Matrix44 : 2 milliseconds

Matrix4f : 5 milliseconds
Matrix44f : 5 milliseconds
Matrix44 : 2 milliseconds

As you can see the offset calculation has an impact but the performance is much better. It's clear that it's better to create a structure without using static arrays.

Best regards
StarBootics
The Stone Age did not end due to a shortage of stones !
StarBootics
Enthusiast
Posts: 705
Joined: Sun Jul 07, 2013 11:35 am

### Re: Matrix Structure Speed test

Hello everyone,

A much closer to real life speed test about matrices : Concatenating matrices

Code: Select all

``````; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; Project name : Matrix Multiply
; File Name : Matrix Multiply.pb
; File version: 1.0.0
; Programming : OK
; Programmed by : StarBootics
; Date : 07-06-2021
; Last Update : 07-06-2021
; PureBasic code : V5.73 LTS
; Platform : Windows, Linux, MacOS X
; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Structure Matrix44

VirtualTable.i
e11.f
e21.f
e31.f
e41.f
e12.f
e22.f
e32.f
e42.f
e13.f
e23.f
e33.f
e43.f
e14.f
e24.f
e34.f
e44.f

EndStructure

Procedure.i IdentitySSE(*Identity.Matrix44)

*m4fOut = *Identity + OffsetOf(Matrix44\e11)
fValue.f = 1.0

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
! MOV     rax, [p.p_m4fOut]
! MOVSS   xmm0, [p.v_fValue]
! MOVUPS  [rax+00], xmm0
! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
! MOVUPS  [rax+16], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [rax+32], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [rax+48], xmm0
CompilerElse
! MOV     eax, [p.p_m4fOut]
! MOVSS   xmm0, [p.v_fValue]
! MOVUPS  [eax+00], xmm0
! SHUFPS  xmm0, xmm0, 10010011b   ; Koordinatenrotation: xyzw = wxyz
! MOVUPS  [eax+16], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [eax+32], xmm0
! SHUFPS  xmm0, xmm0, 10010011b
! MOVUPS  [eax+48], xmm0
CompilerEndIf

ProcedureReturn
EndProcedure

Procedure Multiply(*This.Matrix44, *Other.Matrix44)

e11.f = *This\e11 * *Other\e11 + *This\e12 * *Other\e21 + *This\e13 * *Other\e31 + *This\e14 * *Other\e41
e12.f = *This\e11 * *Other\e12 + *This\e12 * *Other\e22 + *This\e13 * *Other\e32 + *This\e14 * *Other\e42
e13.f = *This\e11 * *Other\e13 + *This\e12 * *Other\e23 + *This\e13 * *Other\e33 + *This\e14 * *Other\e43
e14.f = *This\e11 * *Other\e14 + *This\e12 * *Other\e24 + *This\e13 * *Other\e34 + *This\e14 * *Other\e44
e21.f = *This\e21 * *Other\e11 + *This\e22 * *Other\e21 + *This\e23 * *Other\e31 + *This\e24 * *Other\e41
e22.f = *This\e21 * *Other\e12 + *This\e22 * *Other\e22 + *This\e23 * *Other\e32 + *This\e24 * *Other\e42
e23.f = *This\e21 * *Other\e13 + *This\e22 * *Other\e23 + *This\e23 * *Other\e33 + *This\e24 * *Other\e43
e24.f = *This\e21 * *Other\e14 + *This\e22 * *Other\e24 + *This\e23 * *Other\e34 + *This\e24 * *Other\e44
e31.f = *This\e31 * *Other\e11 + *This\e32 * *Other\e21 + *This\e33 * *Other\e31 + *This\e34 * *Other\e41
e32.f = *This\e31 * *Other\e12 + *This\e32 * *Other\e22 + *This\e33 * *Other\e32 + *This\e34 * *Other\e42
e33.f = *This\e31 * *Other\e13 + *This\e32 * *Other\e23 + *This\e33 * *Other\e33 + *This\e34 * *Other\e43
e34.f = *This\e31 * *Other\e14 + *This\e32 * *Other\e24 + *This\e33 * *Other\e34 + *This\e34 * *Other\e44
e41.f = *This\e41 * *Other\e11 + *This\e42 * *Other\e21 + *This\e43 * *Other\e31 + *This\e44 * *Other\e41
e42.f = *This\e41 * *Other\e12 + *This\e42 * *Other\e22 + *This\e43 * *Other\e32 + *This\e44 * *Other\e42
e43.f = *This\e41 * *Other\e13 + *This\e42 * *Other\e23 + *This\e43 * *Other\e33 + *This\e44 * *Other\e43
e44.f = *This\e41 * *Other\e14 + *This\e42 * *Other\e24 + *This\e43 * *Other\e34 + *This\e44 * *Other\e44

*This\e11 = e11
*This\e12 = e12
*This\e13 = e13
*This\e14 = e14

*This\e21 = e21
*This\e22 = e22
*This\e23 = e23
*This\e24 = e24

*This\e31 = e31
*This\e32 = e32
*This\e33 = e33
*This\e34 = e34

*This\e41 = e41
*This\e42 = e42
*This\e43 = e43
*This\e44 = e44

EndProcedure

Procedure Translation(*This.Matrix44, Trans_x.f, Trans_y.f, Trans_z.f)

*This\e11 = 1.0
*This\e12 = 0.0
*This\e13 = 0.0
*This\e14 = Trans_x

*This\e21 = 0.0
*This\e22 = 1.0
*This\e23 = 0.0
*This\e24 = Trans_y

*This\e31 = 0.0
*This\e32 = 0.0
*This\e33 = 1.0
*This\e34 = Trans_z

*This\e41 = 0.0
*This\e42 = 0.0
*This\e43 = 0.0
*This\e44 = 1.0

EndProcedure

Procedure RotateX(*This.Matrix44, Theta.f)

; Protected Cos.f = Cos(Theta)
; Protected Sin.f = Sin(Theta)

Protected Cos.f, Sin.f

!FLD dword [p.v_Theta]
!FSINCOS
!FSTP dword [p.v_Cos]
!FSTP dword [p.v_Sin]

*This\e11 = 1.0
*This\e12 = 0.0
*This\e13 = 0.0
*This\e14 = 0.0

*This\e21 = 0.0
*This\e22 = Cos
*This\e23 = -Sin
*This\e24 = 0.0

*This\e31 = 0.0
*This\e32 = Sin
*This\e33 = Cos
*This\e34 = 0.0

*This\e41 = 0.0
*This\e42 = 0.0
*This\e43 = 0.0
*This\e44 = 1.0

EndProcedure

Procedure RotateY(*This.Matrix44, Theta.f)

; Protected Cos.f = Cos(Theta)
; Protected Sin.f = Sin(Theta)

Protected Cos.f, Sin.f

!FLD dword [p.v_Theta]
!FSINCOS
!FSTP dword [p.v_Cos]
!FSTP dword [p.v_Sin]

*This\e11 = Cos
*This\e12 = 0.0
*This\e13 = Sin
*This\e14 = 0.0

*This\e21 = 0.0
*This\e22 = 1.0
*This\e23 = 0.0
*This\e24 = 0.0

*This\e31 = -Sin
*This\e32 = 0.0
*This\e33 = Cos
*This\e34 = 0.0

*This\e41 = 0.0
*This\e42 = 0.0
*This\e43 = 0.0
*This\e44 = 1.0

EndProcedure

Translation(Translation.Matrix44, 25.0, 15.0, 3.0)
Translation(InvTranslation.Matrix44, -25.0, -15.0, -3.0)
Translation(SlideZ.Matrix44, 0.0, 0.0, 1.0)

For TestID = 0 To 4

TempsDepart0 = ElapsedMilliseconds()

For Index = 0 To 10000
IdentitySSE(AnimationMatrix.Matrix44)
Multiply(AnimationMatrix, InvTranslation)
Multiply(AnimationMatrix, InvRotationY)
Multiply(AnimationMatrix, InvRotationX)
Multiply(AnimationMatrix, SlideZ)
Multiply(AnimationMatrix, RotationX)
Multiply(AnimationMatrix, RotationY)
Multiply(AnimationMatrix, Translation)
Next

TempsEcoule0 = ElapsedMilliseconds()-TempsDepart0

MessageRequester("Elapsed Time", "Matrix Multiply : " + Str(TempsEcoule0) + " milliseconds")

Next

; <<<<<<<<<<<<<<<<<<<<<<<
; <<<<< END OF FILE <<<<<
; <<<<<<<<<<<<<<<<<<<<<<<``````
This the result I get :

7 milliseconds
13 milliseconds
10 milliseconds
7 milliseconds
22 milliseconds

I think the Multiply procedure need to be accelerated using ASM SSE Instructions but this out of reach for me at the moment.

If some volunteer can try this code with the C-Backend full optimized compilation and give some result it will be nice

Best regards
StarBootics
The Stone Age did not end due to a shortage of stones !
STARGÅTE
Posts: 1501
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

### Re: Matrix Structure Speed test

StarBootics wrote: Mon Jun 07, 2021 4:55 pm I think the Multiply procedure need to be accelerated using ASM SSE Instructions
Of course!
If the matrix structure is like:

Code: Select all

``````Structure UB2D_MATRIX4f
I11.f : I21.f : I31.f : I41.f
I12.f : I22.f : I32.f : I42.f
I13.f : I23.f : I33.f : I43.f
I14.f : I24.f : I34.f : I44.f
EndStructure
``````
The multiplication is:
(Don't be afraid of the length of the ASM code, it is much faster than the native PB code)

Code: Select all

``````Structure UB2D_MATRIX4f
I11.f : I21.f : I31.f : I41.f
I12.f : I22.f : I32.f : I42.f
I13.f : I23.f : I33.f : I43.f
I14.f : I24.f : I34.f : I44.f
EndStructure

Procedure.i UB2D_m4fMultiplication( *m4fResult.UB2D_MATRIX4f, *m4fLeft.UB2D_MATRIX4f, *m4fRight.UB2D_MATRIX4f )

Protected Backup.UB2D_MATRIX4f

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
! MOV rax, [p.p_m4fResult]
! MOV rcx, [p.p_m4fLeft]
! MOV rdx, [p.p_m4fRight]
! MOVUPS xmm0, [rcx+00]
! MOVUPS xmm1, [rcx+16]
! MOVUPS xmm2, [rcx+32]
! MOVUPS xmm3, [rcx+48]
; Backup von xmm4-xmm7
! MOVUPS [p.v_Backup+00], xmm4
! MOVUPS [p.v_Backup+16], xmm5
! MOVUPS [p.v_Backup+32], xmm6
CompilerElse
! MOV eax, [p.p_m4fResult]
! MOV ecx, [p.p_m4fLeft]
! MOV edx, [p.p_m4fRight]
! MOVUPS xmm0, [ecx+00]
! MOVUPS xmm1, [ecx+16]
! MOVUPS xmm2, [ecx+32]
! MOVUPS xmm3, [ecx+48]
; Backup von xmm4-xmm7
! MOVUPS [p.v_Backup+00], xmm4
! MOVUPS [p.v_Backup+16], xmm5
! MOVUPS [p.v_Backup+32], xmm6
CompilerEndIf

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
; Multiplikation mit rechter Matrix (1. Spalte)
! MOVUPS xmm4, [rdx+00]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+00], xmm6
; Multiplikation mit rechter Matrix (2. Spalte)
! MOVUPS xmm4, [rdx+16]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+16], xmm6
; Multiplikation mit rechter Matrix (3. Spalte)
! MOVUPS xmm4, [rdx+32]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+32], xmm6
; Multiplikation mit rechter Matrix (4. Spalte)
! MOVUPS xmm4, [rdx+48]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+48], xmm6
; Wiederherstellung von xmm4-xmm7
! MOVUPS xmm4, [p.v_Backup+00]
! MOVUPS xmm5, [p.v_Backup+16]
! MOVUPS xmm6, [p.v_Backup+32]
CompilerElse
; Multiplikation mit rechter Matrix (1. Spalte)
! MOVUPS xmm4, [edx+00]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [eax+00], xmm6
; Multiplikation mit rechter Matrix (2. Spalte)
! MOVUPS xmm4, [edx+16]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [eax+16], xmm6
; Multiplikation mit rechter Matrix (3. Spalte)
! MOVUPS xmm4, [edx+32]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [eax+32], xmm6
; Multiplikation mit rechter Matrix (4. Spalte)
! MOVUPS xmm4, [edx+48]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [eax+48], xmm6
; Wiederherstellung von xmm4-xmm7
! MOVUPS xmm4, [p.v_Backup+00]
! MOVUPS xmm5, [p.v_Backup+16]
! MOVUPS xmm6, [p.v_Backup+32]
CompilerEndIf

ProcedureReturn

EndProcedure``````
Edit: Bug-Fix
PB 5.73 ― Win 10, 20H2 ― Ryzen 9 3900X ― Radeon RX 5600 XT ITX ― Vivaldi 3.6 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
skywalk
Posts: 3516
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

### Re: Matrix Structure Speed test

STARGÅTE wrote:(Don't be afraid of the length of the ASM code, it is much faster than the native PB code)
And PB v6 + C emit + optimization will produce the same or better result?
We should leave handcoding ASM for SIMD stuff.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
STARGÅTE
Posts: 1501
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

### Re: Matrix Structure Speed test

skywalk wrote: Mon Jun 07, 2021 7:27 pm
STARGÅTE wrote:(Don't be afraid of the length of the ASM code, it is much faster than the native PB code)
And PB v6 + C emit + optimization will produce the same or better result?
We should leave handcoding ASM for SIMD stuff.
I don't know.
Measurements on my system give the following results:
PB 6.00, native code, ASM backend: 45 ns
PB 6.00, native code, C backend: 41 ns
PB 6.00, native code, C backend + optimization: 16 ns
PB 6.00, ASM Code, ASM backend: 6.3 ns
At the end, you need also the C-SSE-instructions for optimization.

For the tests, I need to use this strange code with random matrix, because the C-optimizer canceled the code when the result matrix isn't used anymore.

Code: Select all

``````Structure UB2D_MATRIX4f
I11.f : I21.f : I31.f : I41.f
I12.f : I22.f : I32.f : I42.f
I13.f : I23.f : I33.f : I43.f
I14.f : I24.f : I34.f : I44.f
EndStructure

Procedure.i UB2D_m4fMultiplicationASM( *m4fResult.UB2D_MATRIX4f, *m4fLeft.UB2D_MATRIX4f, *m4fRight.UB2D_MATRIX4f )

Protected Backup.UB2D_MATRIX4f

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
! MOV rax, [p.p_m4fResult]
! MOV rcx, [p.p_m4fLeft]
! MOV rdx, [p.p_m4fRight]
! MOVUPS xmm0, [rcx+00]
! MOVUPS xmm1, [rcx+16]
! MOVUPS xmm2, [rcx+32]
! MOVUPS xmm3, [rcx+48]
; Backup von xmm4-xmm7
! MOVUPS [p.v_Backup+00], xmm4
! MOVUPS [p.v_Backup+16], xmm5
! MOVUPS [p.v_Backup+32], xmm6
CompilerElse
! MOV eax, [p.p_m4fResult]
! MOV ecx, [p.p_m4fLeft]
! MOV edx, [p.p_m4fRight]
! MOVUPS xmm0, [ecx+00]
! MOVUPS xmm1, [ecx+16]
! MOVUPS xmm2, [ecx+32]
! MOVUPS xmm3, [ecx+48]
; Backup von xmm4-xmm7
! MOVUPS [p.v_Backup+00], xmm4
! MOVUPS [p.v_Backup+16], xmm5
! MOVUPS [p.v_Backup+32], xmm6
CompilerEndIf

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
; Multiplikation mit rechter Matrix (1. Spalte)
! MOVUPS xmm4, [rdx+00]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+00], xmm6
; Multiplikation mit rechter Matrix (2. Spalte)
! MOVUPS xmm4, [rdx+16]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+16], xmm6
; Multiplikation mit rechter Matrix (3. Spalte)
! MOVUPS xmm4, [rdx+32]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+32], xmm6
; Multiplikation mit rechter Matrix (4. Spalte)
! MOVUPS xmm4, [rdx+48]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+48], xmm6
; Wiederherstellung von xmm4-xmm7
! MOVUPS xmm4, [p.v_Backup+00]
! MOVUPS xmm5, [p.v_Backup+16]
! MOVUPS xmm6, [p.v_Backup+32]
CompilerElse
; Multiplikation mit rechter Matrix (1. Spalte)
! MOVUPS xmm4, [edx+00]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [eax+00], xmm6
; Multiplikation mit rechter Matrix (2. Spalte)
! MOVUPS xmm4, [edx+16]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [eax+16], xmm6
; Multiplikation mit rechter Matrix (3. Spalte)
! MOVUPS xmm4, [edx+32]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [eax+32], xmm6
; Multiplikation mit rechter Matrix (4. Spalte)
! MOVUPS xmm4, [edx+48]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [eax+48], xmm6
; Wiederherstellung von xmm4-xmm7
! MOVUPS xmm4, [p.v_Backup+00]
! MOVUPS xmm5, [p.v_Backup+16]
! MOVUPS xmm6, [p.v_Backup+32]
CompilerEndIf

ProcedureReturn

EndProcedure

Procedure.i UB2D_m4fMultiplicationNativ( *m4fResult.UB2D_MATRIX4f, *m4fLeft.UB2D_MATRIX4f, *m4fRight.UB2D_MATRIX4f )

Protected m4fBackup.UB2D_MATRIX4f

m4fBackup\I11 = *m4fLeft\I11 * *m4fRight\I11 + *m4fLeft\I12 * *m4fRight\I21 + *m4fLeft\I13 * *m4fRight\I31 + *m4fLeft\I14 * *m4fRight\I41
m4fBackup\I12 = *m4fLeft\I11 * *m4fRight\I12 + *m4fLeft\I12 * *m4fRight\I22 + *m4fLeft\I13 * *m4fRight\I32 + *m4fLeft\I14 * *m4fRight\I42
m4fBackup\I13 = *m4fLeft\I11 * *m4fRight\I13 + *m4fLeft\I12 * *m4fRight\I23 + *m4fLeft\I13 * *m4fRight\I33 + *m4fLeft\I14 * *m4fRight\I43
m4fBackup\I14 = *m4fLeft\I11 * *m4fRight\I14 + *m4fLeft\I12 * *m4fRight\I24 + *m4fLeft\I13 * *m4fRight\I34 + *m4fLeft\I14 * *m4fRight\I44

m4fBackup\I21 = *m4fLeft\I21 * *m4fRight\I11 + *m4fLeft\I22 * *m4fRight\I21 + *m4fLeft\I23 * *m4fRight\I31 + *m4fLeft\I24 * *m4fRight\I41
m4fBackup\I22 = *m4fLeft\I21 * *m4fRight\I12 + *m4fLeft\I22 * *m4fRight\I22 + *m4fLeft\I23 * *m4fRight\I32 + *m4fLeft\I24 * *m4fRight\I42
m4fBackup\I23 = *m4fLeft\I21 * *m4fRight\I13 + *m4fLeft\I22 * *m4fRight\I23 + *m4fLeft\I23 * *m4fRight\I33 + *m4fLeft\I24 * *m4fRight\I43
m4fBackup\I24 = *m4fLeft\I21 * *m4fRight\I14 + *m4fLeft\I22 * *m4fRight\I24 + *m4fLeft\I23 * *m4fRight\I34 + *m4fLeft\I24 * *m4fRight\I44

m4fBackup\I31 = *m4fLeft\I31 * *m4fRight\I11 + *m4fLeft\I32 * *m4fRight\I21 + *m4fLeft\I33 * *m4fRight\I31 + *m4fLeft\I34 * *m4fRight\I41
m4fBackup\I32 = *m4fLeft\I31 * *m4fRight\I12 + *m4fLeft\I32 * *m4fRight\I22 + *m4fLeft\I33 * *m4fRight\I32 + *m4fLeft\I34 * *m4fRight\I42
m4fBackup\I33 = *m4fLeft\I31 * *m4fRight\I13 + *m4fLeft\I32 * *m4fRight\I23 + *m4fLeft\I33 * *m4fRight\I33 + *m4fLeft\I34 * *m4fRight\I43
m4fBackup\I34 = *m4fLeft\I31 * *m4fRight\I14 + *m4fLeft\I32 * *m4fRight\I24 + *m4fLeft\I33 * *m4fRight\I34 + *m4fLeft\I34 * *m4fRight\I44

m4fBackup\I41 = *m4fLeft\I41 * *m4fRight\I11 + *m4fLeft\I42 * *m4fRight\I21 + *m4fLeft\I43 * *m4fRight\I31 + *m4fLeft\I44 * *m4fRight\I41
m4fBackup\I42 = *m4fLeft\I41 * *m4fRight\I12 + *m4fLeft\I42 * *m4fRight\I22 + *m4fLeft\I43 * *m4fRight\I32 + *m4fLeft\I44 * *m4fRight\I42
m4fBackup\I43 = *m4fLeft\I41 * *m4fRight\I13 + *m4fLeft\I42 * *m4fRight\I23 + *m4fLeft\I43 * *m4fRight\I33 + *m4fLeft\I44 * *m4fRight\I43
m4fBackup\I44 = *m4fLeft\I41 * *m4fRight\I14 + *m4fLeft\I42 * *m4fRight\I24 + *m4fLeft\I43 * *m4fRight\I34 + *m4fLeft\I44 * *m4fRight\I44

CopyMemory(@m4fBackup, *m4fResult, SizeOf(UB2D_MATRIX4f))

ProcedureReturn *m4fResult

EndProcedure

Procedure.i UB2D_m4fRandom( *m4fResult.UB2D_MATRIX4f, fMax.f = 1.0, fMin.f = 0.0 )

Protected I.i

For I = 0 To 15
PokeF(*m4fResult + SizeOf(Float)*I, (fMax-fMin) * 4.6566128752457969241e-10 * Random(2147483647) + fMin )
Next

ProcedureReturn *m4fResult

EndProcedure

Procedure   UB2D_m4fPrint( *m4fSource.UB2D_MATRIX4f )

With *m4fSource
PrintN( RSet(StrF(\I11, 3), 9)+RSet(StrF(\I12, 3), 9)+RSet(StrF(\I13, 3), 9)+RSet(StrF(\I14, 3), 9) )
PrintN( RSet(StrF(\I21, 3), 9)+RSet(StrF(\I22, 3), 9)+RSet(StrF(\I23, 3), 9)+RSet(StrF(\I24, 3), 9) )
PrintN( RSet(StrF(\I31, 3), 9)+RSet(StrF(\I32, 3), 9)+RSet(StrF(\I33, 3), 9)+RSet(StrF(\I34, 3), 9) )
PrintN( RSet(StrF(\I41, 3), 9)+RSet(StrF(\I42, 3), 9)+RSet(StrF(\I43, 3), 9)+RSet(StrF(\I44, 3), 9) )
EndWith

EndProcedure

Define.UB2D_MATRIX4f A, B
Define Time.i, TimeBias.i, I.i

#Count = 10000000

OpenConsole()
RandomSeed(1)
UB2D_m4fRandom(A, 0.6438257)
UB2D_m4fRandom(B, 0.6438257)

TimeBias = ElapsedMilliseconds()
For I = 1 To #Count
Next
TimeBias = ElapsedMilliseconds() - TimeBias
Time = ElapsedMilliseconds()
For I = 1 To #Count
UB2D_m4fMultiplicationNativ(A, A, B)
;UB2D_m4fMultiplicationASM(A, A, B)
Next
Time = ElapsedMilliseconds() - Time

UB2D_m4fPrint(A)

PrintN("Time: " + Str(Time-TimeBias)+" ms")
PrintN("Single Time: " + StrF(1.0e6*(Time-TimeBias)/#Count, 3)+" ns")
Input()

``````
PB 5.73 ― Win 10, 20H2 ― Ryzen 9 3900X ― Radeon RX 5600 XT ITX ― Vivaldi 3.6 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
StarBootics
Enthusiast
Posts: 705
Joined: Sun Jul 07, 2013 11:35 am

### Re: Matrix Structure Speed test

Hello STARGATE,

The matrix structure is more like this :

Code: Select all

``````Structure Matrix44

VirtualTable.i
e11.f : e21.f : e31.f : e41.f
e12.f : e22.f : e32.f : e42.f
e13.f : e23.f : e33.f : e43.f
e14.f : e24.f : e34.f : e44.f

EndStructure``````
The elements of the matrix are placed the same way (Compatibility with OpenGL), the only difference is the presence of a VirtualTable.

I have try to adapt your code to mine but I get an Invalid Memory Access error (For simplicity I have removed the 32 bit stuff since I don't need it)

Code: Select all

``````Procedure Multiply(*This.Matrix44, *Other.Matrix44)

; This procedure is supposed to mimic : This *= Other math operator in C++

Protected *Result.Matrix44 = AllocateStructure(Matrix44)
Protected *Backup.Matrix44 = AllocateStructure(Matrix44)

*m4fResult = *Result + OffsetOf(Matrix44\e11)
*m4fBackup = *Backup + OffsetOf(Matrix44\e11)
*m4fThis = *This + OffsetOf(Matrix44\e11)
*m4fOther = *Other + OffsetOf(Matrix44\e11)

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
! MOV rax, [p.p_m4fResult]
! MOV rcx, [p.p_m4fThis]
! MOV rdx, [p.p_m4fOther]
! MOVUPS xmm0, [rcx+00]
! MOVUPS xmm1, [rcx+16]
! MOVUPS xmm2, [rcx+32]
! MOVUPS xmm3, [rcx+48]
; Backup von xmm4-xmm7
! MOVUPS [p.p_m4fBackup+00], xmm4
! MOVUPS [p.p_m4fBackup+16], xmm5
! MOVUPS [p.p_m4fBackup+32], xmm6
CompilerEndIf

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
; Multiplikation mit rechter Matrix (1. Spalte)
! MOVUPS xmm4, [rdx+00]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+00], xmm6
; Multiplikation mit rechter Matrix (2. Spalte)
! MOVUPS xmm4, [rdx+16]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+16], xmm6
; Multiplikation mit rechter Matrix (3. Spalte)
! MOVUPS xmm4, [rdx+32]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+32], xmm6
; Multiplikation mit rechter Matrix (4. Spalte)
! MOVUPS xmm4, [rdx+48]
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+48], xmm6
; Wiederherstellung von xmm4-xmm7
! MOVUPS xmm4, [p.p_m4fBackup+00]
! MOVUPS xmm5, [p.p_m4fBackup+16]
! MOVUPS xmm6, [p.p_m4fBackup+32]
CompilerEndIf

CopyMemory(*Result, *This, SizeOf(Matrix44))
FreeStructure(*Result)
FreeStructure(*Backup)

ProcedureReturn
EndProcedure``````
As I said before my knowledge of the Assembler is very limited.
Any help or explanation will be welcome.

Best regards
StarBootics
The Stone Age did not end due to a shortage of stones !
STARGÅTE
Posts: 1501
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

### Re: Matrix Structure Speed test

The protected "Backup" variable is for the registers xmm4 to xmm7, which have to be saved before changing.
You need here a buffer structure with a size of 4*16 byte:

Try this:

Code: Select all

``````Structure Matrix44

VirtualTable.i
e11.f : e21.f : e31.f : e41.f
e12.f : e22.f : e32.f : e42.f
e13.f : e23.f : e33.f : e43.f
e14.f : e24.f : e34.f : e44.f

EndStructure

Structure Buffer
sse.f[16]
EndStructure

Procedure Multiply(*This.Matrix44, *Other.Matrix44)

Protected Backup.Buffer

; This procedure is supposed to mimic : This *= Other math operator in C++

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
! MOV rax, [p.p_This]
! MOV rcx, [p.p_This]
! MOV rdx, [p.p_Other]
! MOVUPS xmm0, [rcx+00+8]  ; + 8 because of VirtualTable.i
! MOVUPS xmm1, [rcx+16+8]  ; + 8 because of VirtualTable.i
! MOVUPS xmm2, [rcx+32+8]  ; + 8 because of VirtualTable.i
! MOVUPS xmm3, [rcx+48+8]  ; + 8 because of VirtualTable.i
; Backup von xmm4-xmm7
! MOVUPS [p.v_Backup+00], xmm4
! MOVUPS [p.v_Backup+16], xmm5
! MOVUPS [p.v_Backup+32], xmm6
CompilerEndIf

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
; Multiplikation mit rechter Matrix (1. Spalte)
! MOVUPS xmm4, [rdx+00+8] ; + 8 because of VirtualTable.i
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+00+8], xmm6  ; + 8 because of VirtualTable.i
; Multiplikation mit rechter Matrix (2. Spalte)
! MOVUPS xmm4, [rdx+16+8] ; + 8 because of VirtualTable.i
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+16+8], xmm6  ; + 8 because of VirtualTable.i
; Multiplikation mit rechter Matrix (3. Spalte)
! MOVUPS xmm4, [rdx+32+8] ; + 8 because of VirtualTable.i
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+32+8], xmm6  ; + 8 because of VirtualTable.i
; Multiplikation mit rechter Matrix (4. Spalte)
! MOVUPS xmm4, [rdx+48+8] ; + 8 because of VirtualTable.i
! MOVAPS xmm6, xmm4
! SHUFPS xmm6, xmm6, 00000000b
! MULPS  xmm6, xmm0
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 01010101b
! MULPS  xmm5, xmm1
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 10101010b
! MULPS  xmm5, xmm2
! MOVAPS xmm5, xmm4
! SHUFPS xmm5, xmm5, 11111111b
! MULPS  xmm5, xmm3
! MOVUPS [rax+48+8], xmm6  ; + 8 because of VirtualTable.i
; Wiederherstellung von xmm4-xmm7
! MOVUPS xmm4, [p.v_Backup+00]
! MOVUPS xmm5, [p.v_Backup+16]
! MOVUPS xmm6, [p.v_Backup+32]
CompilerEndIf

ProcedureReturn
EndProcedure
``````
Edit; Sry, bug fix
PB 5.73 ― Win 10, 20H2 ― Ryzen 9 3900X ― Radeon RX 5600 XT ITX ― Vivaldi 3.6 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
skywalk
Posts: 3516
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

### Re: Matrix Structure Speed test

STARGÅTE wrote:I don't know.
Measurements on my system give the following results:
PB 6.00, native code, ASM backend: 45 ns
PB 6.00, native code, C backend: 41 ns
PB 6.00, native code, C backend + optimization: 16 ns
PB 6.00, ASM Code, ASM backend: 6.3 ns
At the end, you need also the C-SSE-instructions for optimization.
This is great and matches Fred's earlier blog. There is easy 4x improvement just with C optimizer. Then, crafty ASM guys can even do 2x better again.

Next question, is actual data comparisons with C optimizer.
Do you really get the same numeric responses?
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
StarBootics
Enthusiast
Posts: 705
Joined: Sun Jul 07, 2013 11:35 am

### Re: Matrix Structure Speed test

Hello everyone,

1st : Thanks to STARGATE by chance you have some code done already. With your computer even with the native PureBasic code the execution is pretty fast already.

2nd : @skywalk : I can't tell about the C-Backend since I'm under Linux but the Assembly optimized code work absolutely fine. The animations are working the same way as before and little bit much faster indeed. That being said I'm testing everything at 30 FPS I have plenty of time between 2 frames to do the calculations. This is for the Editor, the game will have a bigger workload every frame so the less time wasted the better.

Best regards
StarBootics
The Stone Age did not end due to a shortage of stones !
StarBootics
Enthusiast
Posts: 705
Joined: Sun Jul 07, 2013 11:35 am

### Re: Matrix Structure Speed test

Hello everyone,

Some more SSE acceleration demonstration :

Code: Select all

``````; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; Project name : Matrix Structure Speed Test 3
; File Name : Matrix Structure Speed Test 3.pb
; File version: 1.0.0
; Programming : OK
; Programmed by : StarBootics
; Date : June 9th, 2021
; Last Update : June 9th, 2021
; PureBasic code : V5.73 LTS
; Platform : Windows, Linux, MacOS X
; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Structure Matrix44

VirtualTable.i
e11.f
e21.f
e31.f
e41.f
e12.f
e22.f
e32.f
e42.f
e13.f
e23.f
e33.f
e43.f
e14.f
e24.f
e34.f
e44.f

EndStructure

Structure Vector4
sse.f[4]
EndStructure

*This\e11 + *Other\e11
*This\e21 + *Other\e21
*This\e31 + *Other\e31
*This\e41 + *Other\e41
*This\e12 + *Other\e12
*This\e22 + *Other\e22
*This\e32 + *Other\e32
*This\e42 + *Other\e42
*This\e13 + *Other\e13
*This\e23 + *Other\e23
*This\e33 + *Other\e33
*This\e43 + *Other\e43
*This\e14 + *Other\e14
*This\e24 + *Other\e24
*This\e34 + *Other\e34
*This\e44 + *Other\e44

EndProcedure

Procedure Subtract(*This.Matrix44, *Other.Matrix44)

*This\e11 - *Other\e11
*This\e21 - *Other\e21
*This\e31 - *Other\e31
*This\e41 - *Other\e41
*This\e12 - *Other\e12
*This\e22 - *Other\e22
*This\e32 - *Other\e32
*This\e42 - *Other\e42
*This\e13 - *Other\e13
*This\e23 - *Other\e23
*This\e33 - *Other\e33
*This\e43 - *Other\e43
*This\e14 - *Other\e14
*This\e24 - *Other\e24
*This\e34 - *Other\e34
*This\e44 - *Other\e44

EndProcedure

Procedure ProductByScalar(*This.Matrix44, P_Scalar.f)

*This\e11 * P_Scalar
*This\e21 * P_Scalar
*This\e31 * P_Scalar
*This\e41 * P_Scalar
*This\e12 * P_Scalar
*This\e22 * P_Scalar
*This\e32 * P_Scalar
*This\e42 * P_Scalar
*This\e13 * P_Scalar
*This\e23 * P_Scalar
*This\e33 * P_Scalar
*This\e43 * P_Scalar
*This\e14 * P_Scalar
*This\e24 * P_Scalar
*This\e34 * P_Scalar
*This\e44 * P_Scalar

EndProcedure

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64

! MOV rax, [p.p_This]
! MOV rcx, [p.p_Other]

! MOVUPS xmm0, [rax+00+8]
! MOVUPS xmm1, [rcx+00+8]
! MOVUPS [rax+00+8], xmm1

! MOVUPS xmm0, [rax+16+8]
! MOVUPS xmm1, [rcx+16+8]
! MOVUPS [rax+16+8], xmm1

! MOVUPS xmm0, [rax+32+8]
! MOVUPS xmm1, [rcx+32+8]
! MOVUPS [rax+32+8], xmm1

! MOVUPS xmm0, [rax+48+8]
! MOVUPS xmm1, [rcx+48+8]
! MOVUPS [rax+48+8], xmm1

CompilerElse
! MOV eax, [p.p_This]
! MOV ecx, [p.p_Other]

! MOVUPS xmm0, [eax+00+8]
! MOVUPS xmm1, [ecx+00+8]
! MOVUPS [eax+00+8], xmm1

! MOVUPS xmm0, [eax+16+8]
! MOVUPS xmm1, [ecx+16+8]
! MOVUPS [eax+16+8], xmm1

! MOVUPS xmm0, [eax+32+8]
! MOVUPS xmm1, [ecx+32+8]
! MOVUPS [eax+32+8], xmm1

! MOVUPS xmm0, [eax+48+8]
! MOVUPS xmm1, [ecx+48+8]
! MOVUPS [eax+48+8], xmm1

CompilerEndIf

EndProcedure

Procedure SubtractSSE(*This.Matrix44, *Other.Matrix44)

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64

! MOV rax, [p.p_This]
! MOV rcx, [p.p_Other]

! MOVUPS xmm1, [rax+00+8]
! MOVUPS xmm0, [rcx+00+8]
! SUBPS  xmm1, xmm0
! MOVUPS [rax+00+8], xmm1

! MOVUPS xmm1, [rax+16+8]
! MOVUPS xmm0, [rcx+16+8]
! SUBPS  xmm1, xmm0
! MOVUPS [rax+16+8], xmm1

! MOVUPS xmm1, [rax+32+8]
! MOVUPS xmm0, [rcx+32+8]
! SUBPS  xmm1, xmm0
! MOVUPS [rax+32+8], xmm1

! MOVUPS xmm1, [rax+48+8]
! MOVUPS xmm0, [rcx+48+8]
! SUBPS  xmm1, xmm0
! MOVUPS [rax+48+8], xmm1

CompilerElse
! MOV eax, [p.p_This]
! MOV ecx, [p.p_Other]

! MOVUPS xmm1, [eax+00+8]
! MOVUPS xmm0, [ecx+00+8]
! SUBPS  xmm1, xmm0
! MOVUPS [eax+00+8], xmm1

! MOVUPS xmm1, [eax+16+8]
! MOVUPS xmm0, [ecx+16+8]
! SUBPS  xmm1, xmm0
! MOVUPS [eax+16+8], xmm1

! MOVUPS xmm1, [eax+32+8]
! MOVUPS xmm0, [ecx+32+8]
! SUBPS  xmm1, xmm0
! MOVUPS [eax+32+8], xmm1

! MOVUPS xmm1, [eax+48+8]
! MOVUPS xmm0, [ecx+48+8]
! SUBPS  xmm1, xmm0
! MOVUPS [eax+48+8], xmm1

CompilerEndIf

EndProcedure

Protected Vector.Vector4

Vector\sse[0] = P_Scalar
Vector\sse[1] = P_Scalar
Vector\sse[2] = P_Scalar
Vector\sse[3] = P_Scalar

CompilerIf #PB_Compiler_Processor = #PB_Processor_x64

! MOV rax, [p.p_This]
! MOVUPS xmm1, [rax+00+8]
! MOVUPS xmm0, [p.v_Vector]
! MULPS  xmm1, xmm0
! MOVUPS [rax+00+8], xmm1

! MOVUPS xmm1, [rax+16+8]
! MULPS  xmm1, xmm0
! MOVUPS [rax+16+8], xmm1

! MOVUPS xmm1, [rax+32+8]
! MULPS  xmm1, xmm0
! MOVUPS [rax+32+8], xmm1

! MOVUPS xmm1, [rax+48+8]
! MULPS  xmm1, xmm0
! MOVUPS [rax+48+8], xmm1

CompilerElse

! MOV eax, [p.p_This]
! MOVUPS xmm0, [p.v_Vector]

! MOVUPS xmm1, [eax+00+8]
! MULPS  xmm1, xmm0
! MOVUPS [eax+00+8], xmm1

! MOVUPS xmm1, [eax+16+8]
! MULPS  xmm1, xmm0
! MOVUPS [eax+16+8], xmm1

! MOVUPS xmm1, [eax+32+8]
! MULPS  xmm1, xmm0
! MOVUPS [eax+32+8], xmm1

! MOVUPS xmm1, [eax+48+8]
! MULPS  xmm1, xmm0
! MOVUPS [eax+48+8], xmm1

CompilerEndIf

EndProcedure

MatA.Matrix44
MatA\e11 = 1.0
MatA\e21 = 2.0
MatA\e31 = 3.0
MatA\e41 = 4.0

MatA\e12 = 5.0
MatA\e22 = 6.0
MatA\e32 = 7.0
MatA\e42 = 8.0

MatA\e13 = 9.0
MatA\e23 = 10.0
MatA\e33 = 11.0
MatA\e43 = 12.0

MatA\e14 = 13.0
MatA\e24 = 14.0
MatA\e34 = 15.0
MatA\e44 = 16.0

MatB.Matrix44
MatB\e11 = 1.0
MatB\e21 = 1.0
MatB\e31 = 1.0
MatB\e41 = 1.0

MatB\e12 = 1.0
MatB\e22 = 1.0
MatB\e32 = 1.0
MatB\e42 = 1.0

MatB\e13 = 1.0
MatB\e23 = 1.0
MatB\e33 = 1.0
MatB\e43 = 1.0

MatB\e14 = 1.0
MatB\e24 = 1.0
MatB\e34 = 1.0
MatB\e44 = 1.0

For TestID = 0 To 4

TempsDepart0 = ElapsedMilliseconds()

For Index = 0 To 100000
Subtract(MatA, MatB)
Next

TempsEcoule0 = ElapsedMilliseconds()-TempsDepart0

TempsDepart1 = ElapsedMilliseconds()

For Index = 0 To 100000
SubtractSSE(MatA, MatB)
Next

TempsEcoule1 = ElapsedMilliseconds()-TempsDepart1

TempsDepart2 = ElapsedMilliseconds()

For Index = 0 To 100000
ProductByScalar(MatB, 1.0)
Next

TempsEcoule2 = ElapsedMilliseconds()-TempsDepart2

TempsDepart3 = ElapsedMilliseconds()

For Index = 0 To 100000
Next

TempsEcoule3 = ElapsedMilliseconds()-TempsDepart3

MessageRequester("Elapsed Time", "Matrix Add-Subtract : " + Str(TempsEcoule0) + " milliseconds" + #LF\$ + "Matrix Add-Subtract SSE : " + Str(TempsEcoule1) + " milliseconds" + #LF\$ + "Matrix ProductByScalar : " + Str(TempsEcoule2) + " milliseconds" + #LF\$ + "Matrix ProductByScalar SSE : " + Str(TempsEcoule3) + " milliseconds")

Next

; <<<<<<<<<<<<<<<<<<<<<<<
; <<<<< END OF FILE <<<<<
; <<<<<<<<<<<<<<<<<<<<<<<``````
Matrix Add-Subtract SSE : 2 milliseconds
Matrix ProductByScalar : 4 milliseconds
Matrix ProductByScalar SSE : 2 milliseconds

Matrix Add-Subtract SSE : 8 milliseconds
Matrix ProductByScalar : 6 milliseconds
Matrix ProductByScalar SSE : 2 milliseconds

Matrix Add-Subtract SSE : 8 milliseconds
Matrix ProductByScalar : 18 milliseconds
Matrix ProductByScalar SSE : 5 milliseconds