Page 1 of 2
Inline ASM slower than equivalent PB code...
Posted: Wed Jul 09, 2003 9:00 pm
by matthew180
Greetings,
I was doing some inline ASM testing using a simple function and found some distrubing results. Basically I was writing a modulus function since PB lacks this as a built in command, and decided I'd write it with ASM since you get modulus functionality for free with the DIV (divide) instruction. And, since a modulus function requires a divide, multiply, and subtraction in native PB, I assumed the ASM version would be quite faster... WRONG!
Here are the guts of each function:
ASM:
MOV eax, dividend
CDQ
IDIV divisor
MOV dividend, edx
ProcedureReturn dividend
PB:
ProcedureReturn dividend - ((dividend / divisor) * divisor)
In ASM, the result of the IDIV function is a quotient and remainder, and of course the remainder is the modulus, so you're basically done. The PB version has a multiply and subtraction to perform on top of the divide (which is common to both versions), so how is the PB version faster??? By a factor of two!!
All I can assume from this is that the preserve/restore part of doing inline ASM is quite costly. I can't tell what PB includes before and after the inline ASM code, but probably a push and pop of every register (complete state preservation), which is causing the slowdown.
So, based on this, before using inline ASM, make sure the ASM code will be executing something like a loop that has to execute many times. You have to save the preserve/restore time in your loop over the time it takes to execute the same loop using PB code; otherwise you lose the speed gained by using ASM.
Maybe Fred can provide some insight as to what might be going on? It seems to me that using ASM should speed things up, not slow them down.
Matthew
Posted: Wed Jul 09, 2003 9:44 pm
by Kale
Maybe this will help give you more info (it means nothing to me

)?
Code: Select all
Procedure MOD(a,b)
ProcedureReturn a-a/b*b
EndProcedure
Procedure ASM_MOD(dividend, divisor)
MOV eax, dividend
CDQ
IDIV divisor
MOV dividend, edx
ProcedureReturn dividend
EndProcedure
Debug MOD(20,7)
Debug ASM_MOD(20,7)
ASM Output:
Code: Select all
; Procedure MOD(a,b)
JMP _EndProcedure0
_Procedure0:
PUSH ebx
PUSH ecx
PUSH ebp
PUSH esi
PUSH edi
MOV esi,esp
SUB esp,8
MOV eax,esp
MOV edx,eax
ADD edx,8
_ClearLoop1:
MOV dword [eax],0
ADD eax,4
CMP eax,edx
JNE _ClearLoop1
MOV eax, dword [esi+24]
MOV dword [esp+0],eax
MOV eax, dword [esi+28]
MOV dword [esp+4],eax
; ProcedureReturn a-a/b*b
MOV ebx,dword [esp+0]
MOV edi,dword [esp+0]
PUSH dword [esp+4]
MOV eax,edi
POP edi
CDQ
IDIV edi
MOV edi, eax
IMUL edi,dword [esp+4]
SUB ebx,edi
MOV eax,ebx
JMP _EndProcedure1
; EndProcedure
XOR eax,eax
_EndProcedure1:
ADD esp,8
POP edi
POP esi
POP ebp
POP ecx
POP ebx
RET 8
_EndProcedure0:
;
; Procedure ASM_MOD(dividend, divisor)
JMP _EndProcedure2
_Procedure2:
PUSH ebx
PUSH ecx
PUSH ebp
PUSH esi
PUSH edi
MOV esi,esp
SUB esp,8
MOV eax,esp
MOV edx,eax
ADD edx,8
_ClearLoop3:
MOV dword [eax],0
ADD eax,4
CMP eax,edx
JNE _ClearLoop3
MOV eax, dword [esi+24]
MOV dword [esp+0],eax
MOV eax, dword [esi+28]
MOV dword [esp+4],eax
; MOV eax, dividend
MOV eax,dword [esp+0]
; CDQ
CDQ
; IDIV divisor
IDIV dword [esp+4]
; MOV dividend, edx
MOV dword [esp+0],edx
; ProcedureReturn dividend
MOV eax,dword [esp+0]
JMP _EndProcedure3
; EndProcedure
XOR eax,eax
_EndProcedure3:
ADD esp,8
POP edi
POP esi
POP ebp
POP ecx
POP ebx
RET 8
_EndProcedure2:
;
; Debug MOD(20,7)
; Debug ASM_MOD(20,7)
; ExecutableFormat=Windows
; CursorPosition=13
; FirstLine=1
; EOF
compiled with pbcompiler.exe mod.pb /COMMENTED /INLINEASM
Edit: changed middle of MOD() procedure from ProcedureReturn 20-20/7*7 to ProcedureReturn a-a/b*b and asm to match

Posted: Thu Jul 10, 2003 12:31 am
by VPureBasic
Hi,
Try this! Use "!" before asm words. You'll see something very different for speed and source length...
Roger
Posted: Thu Jul 10, 2003 4:43 am
by matthew180
Roger,
What exactally is the "!" supposed to do? Where can I get a reference on this use? I'm getting errors now when I try to compile the code and I have no way to know how to fix them.
Matthew
Posted: Thu Jul 10, 2003 8:33 am
by tinman
matthew180 wrote:What exactally is the "!" supposed to do? Where can I get a reference on this use? I'm getting errors now when I try to compile the code and I have no way to know how to fix them.
I think you have to do it at the start of the line. What it does is tell the compiler to pass that line straight to the assembler. If it's not in the inline assembly section of the manual then its probably missing.
Posted: Thu Jul 10, 2003 9:01 am
by Fred
In the Kale code, you can see than PB do as much 'compiler' optimisation as possible, so the whole numerical expression is reduced to its result as it's numeric only, so it's of course faster. About your the speed problem, it's odd as the PB code produce this:
Code: Select all
MOV ebx,dword [esp+0]
MOV edi,dword [esp+0]
PUSH dword [esp+4]
MOV eax,edi
POP edi
CDQ
IDIV edi
MOV edi, eax
IMUL edi,dword [esp+4]
SUB ebx,edi
MOV eax,ebx
JMP _EndProcedure1
while your code is (as expected):
Code: Select all
MOV eax, dividend
MOV eax,dword [esp+0]
CDQ
IDIV dword [esp+4]
MOV dword [esp+0],edx
MOV eax,dword [esp+0]
JMP _EndProcedure3
So it should be way faster (no multiply).
Posted: Thu Jul 10, 2003 12:18 pm
by jack
as tinman said, putting a ! at the beginning of a asm
statement, makes the compiler pass that line of code
directly to fasm, therefore your variable references
will no longer work as is.
here's your procedure to ilustrate.
Code: Select all
Procedure ASM_MOD(dividend, divisor)
! MOV eax, [esp]
! CDQ
! IDIV dword [esp+4]
! MOV [esp], edx
ProcedureReturn dividend
EndProcedure
notice that since you are using a procedure,
the variables, dividend and divisor, are pushed
onto the stack, and since they are of type long,
they occupy 4 bytes.
if you inline that procedure in your main program,
it becomes:
Code: Select all
divisor.l=2
For i.l=1 To 20
dividend.l=i
! MOV eax, [v_dividend]
! CDQ
! IDIV [v_divisor]
! MOV [v_dividend], edx
PrintN(Str(i)+" "+Str(dividend))
Next i
or what's beter, don't use ! when you are referencing variables.
Code: Select all
Procedure ASM_MOD(dividend, divisor)
MOV eax, dividend
! CDQ
IDIV divisor
MOV dividend, edx
ProcedureReturn dividend
EndProcedure
actually, the ! is only needed for some instructions, that
PureBasic don't recognize.
Posted: Thu Jul 10, 2003 1:31 pm
by Fred
Note: using '!' or inline asm is exactly the same regarding size or speed. You just have more control with '!' as you can even use the FASM macro/commands sets.
Posted: Thu Jul 10, 2003 3:08 pm
by freak
Using the '!', you don't have to remember enabling 'Enable InlineASM support'
Timo
Posted: Thu Jul 10, 2003 3:46 pm
by matthew180
First I'd like to say thanks to everyone for the information! I love messing with crap like this!
Well, I wrote the same test in C on my FreeBSD box, and using microseconds instead of milliseconds I was able to demonstrate that my version without the extra IMUL and SUB was indeed faster. However, it was only 3 milliseconds faster over 1000000 iterations... The gcc compiler created almost the exact code that pbcompile.exe created! Cudos to Fred! The C code was, however, about twice as fast as the PB code, and that's on my 300MHz P2 FreeBSD box, but my PB code is running on my 1.4GHz P4 workstation...
C function:
long
mod(long dividend, long divisor)
{
if ( divisor == 0 )
return(0);
return(dividend - ((dividend / divisor) * divisor));
}
gcc ASM of that function:
mod:
pushl %ebp
movl %esp,%ebp
cmpl $0,12(%ebp)
jne .L14
xorl %eax,%eax
jmp .L13
.p2align 2,0x90
.L14:
movl 8(%ebp),%eax
leal 12(%ebp),%ecx
cltd
idivl (%ecx)
movl %eax,%ecx
imull 12(%ebp),%ecx
movl 8(%ebp),%edx
subl %ecx,%edx
movl %edx,%eax
jmp .L13
.p2align 2,0x90
.L13:
leave
ret
If you don't know Unix style ASM, it gets a little funny looking. Commands end with a size indicator: b = byte(8), w = word(16), l = long(32). All registers are referenced with a %, and the parameters are passed as SRC, DEST instead of DEST, SRC. Also, that CLTD instruction equates to CDQ.
I'm going to attribute the lack of speed increase to the parallel nature of today's CPUs. The IMUL and SUB are probably already fetched, decoded, and ready to execute by the time IDIV is done. Kind of depressing really, takes all the fun out of ASM...
I am, however, still confused about the the PB speed issue and use of '!'. If I use ! in front of the commands, and convert variable access to [esp + n] then the ASM version runs as fast as the PB version (and in some cases a few milliseconds faster.) But without the ! in front of *every* command, it slows back to about double the PB version. But, looking at the ASM output from pbcompiler.exe, not using ! simply generates the same code as when using the !... Like this:
MOV eax, dividend
CDQ
IDIV divisor
MOV dividend, edx
That code gets converted to this by pbcompiler.exe:
; MOV eax, dividend
MOV eax,dword [esp+0]
; CDQ
CDQ
; IDIV divisor
IDIV dword [esp+4]
; MOV dividend, edx
MOV dword [esp+0],edx
But that's exactally what I had to do when I used the !, like this:
!MOV eax, dword [esp + 0]
!CDQ
!IDIV dword [esp + 4]
!MOV dword [esp + 0], edx
It complies to the *same* code, so why is the version without the ! two times slower? This is very confusing.
Thanks,
Matthew
Posted: Thu Jul 10, 2003 4:36 pm
by Pupil
Are you running the PB tests with debugger turned on, if so this will steal some CPU cycles from the actual task of MOD:ing

Posted: Thu Jul 10, 2003 4:48 pm
by matthew180
Yes, debug is on, but it is on in both tests (ASM version and PB version), so it should wash out. Here is my test:
Code: Select all
#ITERATIONS = 1000000
; ASM signed.
time_start = gettickcount_()
For i = 1 To #ITERATIONS Step 1
result = fl_mod_asm(i, 10)
Next
time_stop = gettickcount_()
Debug "ASM Signed Modulus, " + Str(#ITERATIONS) + " iterations: " + Str(time_stop - time_start) + " milliseconds."
The only thing I change between tests is the function call name in the loop... I think this is about as accurate as I can get, and both functions should be tested equally this way.
Like I said earlier, my C version was only showing about 3 milliseconds difference over 1 million iterations. The C code is faster, but after looking at PB's ASM output, I assume that is due to the heavy save/restore than PB is doing. Also, there is this funny loop which I'm not exactally sure why this has to happen:
Code: Select all
MOV esi,esp
SUB esp,8
MOV eax,esp
MOV edx,eax
ADD edx,8
_ClearLoop3:
MOV dword [eax],0
ADD eax,4
CMP eax,edx
JNE _ClearLoop3
MOV eax, dword [esi+24]
MOV dword [esp+0],eax
MOV eax, dword [esi+28]
MOV dword [esp+4],eax
It looks like the parameters are passed in at stack offsets 24 and 28, and this just moves them down to become zero based in the stack... Maybe I'm missing something, but is all this really necessary?
Matthew
Posted: Thu Jul 10, 2003 5:03 pm
by Fred
With debug ON, extra code are added to all, even inline asm one, that's why it's much slower. Always disable debugger for speed test as you don't know what pb does in debug mode (for example I check all divide against 0 division, which slow down a lot the runtime).
Posted: Thu Jul 10, 2003 5:08 pm
by Pupil
It might be that when debugger is turned ON the compiler inserts breakpoints for each line(Fred?) and that would explain why there's a speed difference between "normal" inline ASM versus the '!' inline ASM. This would also explain why the PB version is faster than the ASM version, because in the asm version you would get breakpoints after each asm command which when counted would add up to more breakpoints than the PB version...
If this assumptions of mine is totally wrong, i must say that i haven't got a clue of what's going on

Posted: Thu Jul 10, 2003 5:38 pm
by matthew180
Well, turning off debug sure made a *huge* difference!
Basically I'm down to both versions executing 1 million iterations in 47 milliseconds, every time. This is down from about 331 milliseconds with debugging on. Also, with debugging on the times varied, but with it off the times are always 47 milliseconds. This is on my 1.4GHz workstation, which makes it faster than my C code on my 300MHz P2, as it should be.
On my other 900MHz Athlon FreeBSD box, the C code runs at 68134 microseconds for the ASM version, and 72458 microseconds for the C version with the extra IMUL and SUB. So, about 4.324 milliseconds difference, and still slower (but not as much) than the 1.4GHz P4, which is a good thing sice it is only running at 900MHz, and we like to see PB going just as fast as C!
Thanks everyone for the info. I would, though, like to see the same 4.3 millisecond difference in PB, but I don't think I can get the timing that close... Is there a microsecond counter available to PB? Something like the Unix gettimeofday() function would be nice!
Matthew