Page 1 of 2
Optimizing rotations for quad
Posted: Tue Aug 23, 2011 11:15 pm
by netmaestro
Last week wilbert wrote some code for me for rotating quads using just the general purpose registers. This turned out to be faster than any other tries using SSE. Today I tried to do the same, without looking at his procedures and then see how similar the code was. The result was interesting in that for rotations of 31 bits or less, my code turned out faster (here, anyway) and for rotations of 32 or more bits, wilbert's code is faster. So a good question might be, is there a way to merge the two codes somehow to get the best features of both? (assuming the speed differences are happening on others' machines too)
Code: Select all
Procedure.q rotr64(val.q, n) ; By netmaestro
!mov eax, [esp+4]
!mov edx, [esp+8]
!mov ecx, [esp+12]
!cmp ecx, 31
!jg large1
!mov ebx, edx
!shrd edx, eax, cl
!shrd eax, ebx, cl
!jmp exit1
!large1:
!push ebx
!sub ecx, 32
!mov ebx, edx
!shrd edx, eax, 31
!shrd eax, ebx, 31
!mov ebx, edx
!shrd edx, eax, 1
!shrd eax, ebx, 1
!mov ebx, edx
!shrd edx, eax, cl
!shrd eax, ebx, cl
!pop ebx
!exit1:
ProcedureReturn
EndProcedure
Procedure.q Rotr64_(val.q, n) ; By wilbert
!mov ecx,[esp + 12]
!and cl,63
!cmp cl,32
!jb rotr64_1
!mov edx,[esp + 4]
!mov eax,[esp + 8]
!jmp rotr64_2
!rotr64_1:
!mov eax,[esp + 4]
!mov edx,[esp + 8]
!rotr64_2:
!and cl,31
!jz rotr64_3
!push eax
!shrd eax, edx, cl
!push eax
!mov eax, [esp + 4]
!shrd edx, eax, cl
!pop eax
!add esp,4
!rotr64_3:
ProcedureReturn
EndProcedure
tm1 = ElapsedMilliseconds()
For i = 1 To 100000000
b.q = rotR64(a, 31)
Next
tm2 = ElapsedMilliseconds()
For i = 1 To 100000000
b.q = rotr64_(a, 31)
Next
tm3 = ElapsedMilliseconds()
msg.s = "test1: " + Str(tm2 - tm1) + Chr(13) + Chr(10)
msg + "test2: " + Str(tm3 - tm2) + Chr(13) + Chr(10)
MessageRequester("Speed", msg)
Btw (offtopic): Thanks for the new subforum, Fred & Freak

Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 1:46 am
by Demivec
@netmaestro: I gave a cursory look at the routines. It would seem wilbert's gives a consistent speed for most of the rotations from 0 - 63 bits while yours is faster for rotations of 0 - 31 bits and slower for rotations of 32 - 63 bits. The speed gain for the cases when yours was faster is basically equivalent to the speed loss for the cases when it was slower. This means if the rotation amounts are equal it should come out the same.
Ignoring those results for a minute though, your routine doesn't preserve the ebx register for rotations that are less than 32 bits. By correcting this you may find the results are slower for your routine for every case. I made the correction to your code and it seems you still have a slight gain for rotating by 0 - 31 bits.
My results for the length of time to rotate a quad once by each possible bit combination from 0 - 63 is as follows: 82035ms (wilbert) and 84528ms (netmaestro). After the modification it was 81703ms (wilbert) and 87532ms (netmaestro). Just for curiosity here are the results for just rotations 0 - 31 both before and after the correction: 34624ms / 37093ms (netmaestro) and 40658ms / 39923 (wilbert).
With a little more time I see what I can do to answer your original question about combing the best of both.
Here's the corrected code I used for your routine:
Code: Select all
Procedure.q rotr64(val.q, n) ; By netmaestro with minor corrections by Demivec
!MOV Eax, [Esp+4]
!MOV Edx, [Esp+8]
!MOV Ecx, [Esp+12]
!PUSH Ebx
!CMP Ecx, 31
!JG large1
!MOV Ebx, Edx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!JMP exit1
!large1:
!SUB Ecx, 32
!MOV Ebx, Edx
!SHRD Edx, Eax, 31
!SHRD Eax, Ebx, 31
!MOV Ebx, Edx
!SHRD Edx, Eax, 1
!SHRD Eax, Ebx, 1
!MOV Ebx, Edx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!exit1:
!POP Ebx
ProcedureReturn
EndProcedure
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 2:31 am
by Demivec
Here's a version with an improvement:
Code: Select all
Procedure.q rotr64(val.q, n) ; By netmaestro with modifications by Demivec
!MOV Eax, [Esp+4]
!MOV Edx, [Esp+8]
!MOV Ecx, [Esp+12]
!PUSH Ebx
!CMP Ecx, 31
!JG large1
!MOV Ebx, Edx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!JMP exit1
!large1:
!SUB Ecx, 32
!MOV Ebx, Eax
!MOV Eax, Edx
!MOV Edx, Ebx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!exit1:
!POP Ebx
ProcedureReturn
EndProcedure
My timing show it beating wilbert's (on my single core AMD CPU) for almost all rotations of 32 - 63 bits and tying for rotations of 0 - 31 bits. wilbert still tromps you on 1 or 2 special cases like rotations of 0 or 32 bits.
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 4:41 am
by Demivec
@Edit: incorrect code removed.
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 4:53 am
by netmaestro
Thanks for working on it! For now though, your latest fails QC due diligence on the 32 bits rotation only:
Code: Select all
Procedure.q rotr64(val.q, n) ; By netmaestro with modifications by Demivec
!MOV Eax, [Esp+4]
!MOV Edx, [Esp+8]
!MOV Ecx, [Esp+12]
!PUSH Ebx
!CMP Ecx, 31
!JG large1
!MOV Ebx, Edx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!JMP exit1
!large1:
!AND cl,31
!JZ exit1
!MOV Ebx, Eax
!MOV Eax, Edx
!MOV Edx, Ebx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!exit1:
!POP Ebx
ProcedureReturn
EndProcedure
Procedure.q Rotr64_(val.q, n) ; By wilbert
!mov ecx,[esp + 12]
!and cl,63
!cmp cl,32
!jb rotr64_1
!mov edx,[esp + 4]
!mov eax,[esp + 8]
!jmp rotr64_2
!rotr64_1:
!mov eax,[esp + 4]
!mov edx,[esp + 8]
!rotr64_2:
!and cl,31
!jz rotr64_3
!push eax
!shrd eax, edx, cl
!push eax
!mov eax, [esp + 4]
!shrd edx, eax, cl
!pop eax
!add esp,4
!rotr64_3:
ProcedureReturn
EndProcedure
a.q = 4123456789
For i=0 To 63
b.q = rotr64(a, i)
c.q = rotr64_(a, i)
If c<>b
Debug "oops: " + Str(i)
EndIf
Next
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 6:07 am
by wilbert
With SSE2 register
Code: Select all
Procedure.q Rotr64_(val.q, n) ; By wilbert
!mov ecx,[esp + 12]
!btr ecx, 5
!jc rotr64_1
!mov eax,[esp + 4]
!mov edx,[esp + 8]
!jmp rotr64_2
!rotr64_1:
!mov edx,[esp + 4]
!mov eax,[esp + 8]
!rotr64_2:
!movd xmm0, ebx
!mov ebx, eax
!shrd eax, edx, cl
!shrd edx, ebx, cl
!movd ebx, xmm0
ProcedureReturn
EndProcedure
Without
Code: Select all
Procedure.q Rotr64_(val.q, n) ; By wilbert
!mov ecx,[esp + 12]
!btr ecx, 5
!jc rotr64_1
!mov eax,[esp + 4]
!mov edx,[esp + 8]
!jmp rotr64_2
!rotr64_1:
!mov edx,[esp + 4]
!mov eax,[esp + 8]
!rotr64_2:
!push ebx
!mov ebx, eax
!shrd eax, edx, cl
!shrd edx, ebx, cl
!pop ebx
ProcedureReturn
EndProcedure
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 6:13 am
by netmaestro
Did you test those for accuracy? Run against your earlier plain asm results, the sse fails on 49-63 and the next one fails on 2-63. Something isn't right.
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 6:29 am
by wilbert
Try again. I updated the code. I forgot one instruction.
I also used the and mask with value 63 for a reason.
When you rotate a quad 66 times to the right, the result should be the same as when you rotate it twice.
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 6:37 am
by netmaestro
Perfect, they both work now. The second one is faster, I'm doing more tests but I think it's turning out faster than your original (which hasn't been beat by anything yet)
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 6:57 am
by netmaestro
I also used the and mask with value 63 for a reason.
When you rotate a quad 66 times to the right, the result should be the same as when you rotate it twice.
That's going to take a while to sink in

but it'll come!
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 7:25 am
by wilbert
netmaestro wrote:That's going to take a while to sink in

but it'll come!
I'm sure you will figure it out
As for speed, if it is really important, you should do it by simulating a C style function
Code: Select all
Procedure rotr64_addr__() ; By wilbert
!mov eax, rotr64_addr_start
ProcedureReturn
!rotr64_addr_start:
!mov ecx,[esp + 12]
!btr ecx, 5
!jc rotr64_1
!mov eax,[esp + 4]
!mov edx,[esp + 8]
!jmp rotr64_2
!rotr64_1:
!mov edx,[esp + 4]
!mov eax,[esp + 8]
!rotr64_2:
!push ebx
!mov ebx, eax
!shrd eax, edx, cl
!shrd edx, ebx, cl
!pop ebx
!ret
EndProcedure
PrototypeC.q proto_rotr64(val.q, n)
Global Rotr64.proto_rotr64 = rotr64_addr__()
When I tested things, a cdecl call seems to be a little bit faster compared to stdcall (please correct me if I'm wrong).
You also avoid some pushing and popping PureBasic does for the functions it generates.
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 7:31 am
by netmaestro
When I tested things, a cdecl call seems to be a little bit faster compared to stdcall (please correct me if I'm wrong).
On Mac, it's probably faster but all the tests I've done on Windows show the same performance. In my sha256/224 project I've left the prototype in because even though I can't realize a speed difference here on Win7, it's no slower for sure and on Mac it's probably faster. So it stays.
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 7:52 am
by wilbert
netmaestro wrote:On Mac, it's probably faster but all the tests I've done on Windows show the same performance. In my sha256/224 project I've left the prototype in because even though I can't realize a speed difference here on Win7, it's no slower for sure and on Mac it's probably faster. So it stays.
It might have to do with the stack. On OS X the stack (esp) has to be aligned on 16 bytes at the moment you make a call otherwise the app probably will crash.
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 10:29 am
by Helle
@wilbert:
SHRD works intern with an and-bit-mask for CL; in 32-bit-mode are only bits 0-4 used. You can remove "and ecx, 63".
Not:
Code: Select all
!and ecx, 63
!btr ecx, 5
!jc rotr64_1
But:
Code: Select all
!test ecx,100000b ;or cmp ecx,31 / jg ...
!jnz rotr64_1
Helle
Re: Optimizing rotations for quad
Posted: Wed Aug 24, 2011 11:01 am
by wilbert
You are right Helle.
I removed the and instruction; should make it a little bit faster.