Page 1 of 2

Inline ASM slower than equivalent PB code...

Posted: Wed Jul 09, 2003 9:00 pm
by matthew180
Greetings,

I was doing some inline ASM testing using a simple function and found some distrubing results. Basically I was writing a modulus function since PB lacks this as a built in command, and decided I'd write it with ASM since you get modulus functionality for free with the DIV (divide) instruction. And, since a modulus function requires a divide, multiply, and subtraction in native PB, I assumed the ASM version would be quite faster... WRONG!

Here are the guts of each function:

ASM:
MOV eax, dividend
CDQ
IDIV divisor
MOV dividend, edx
ProcedureReturn dividend

PB:
ProcedureReturn dividend - ((dividend / divisor) * divisor)

In ASM, the result of the IDIV function is a quotient and remainder, and of course the remainder is the modulus, so you're basically done. The PB version has a multiply and subtraction to perform on top of the divide (which is common to both versions), so how is the PB version faster??? By a factor of two!!

All I can assume from this is that the preserve/restore part of doing inline ASM is quite costly. I can't tell what PB includes before and after the inline ASM code, but probably a push and pop of every register (complete state preservation), which is causing the slowdown.

So, based on this, before using inline ASM, make sure the ASM code will be executing something like a loop that has to execute many times. You have to save the preserve/restore time in your loop over the time it takes to execute the same loop using PB code; otherwise you lose the speed gained by using ASM.

Maybe Fred can provide some insight as to what might be going on? It seems to me that using ASM should speed things up, not slow them down.

Matthew

Posted: Wed Jul 09, 2003 9:44 pm
by Kale
Maybe this will help give you more info (it means nothing to me :lol: )?

Code: Select all

Procedure MOD(a,b)
    ProcedureReturn a-a/b*b
EndProcedure

Procedure ASM_MOD(dividend, divisor)
    MOV eax, dividend
    CDQ
    IDIV divisor
    MOV dividend, edx
    ProcedureReturn dividend
EndProcedure

Debug MOD(20,7)
Debug ASM_MOD(20,7)
ASM Output:

Code: Select all

; Procedure MOD(a,b)
  JMP   _EndProcedure0
_Procedure0:
  PUSH   ebx
  PUSH   ecx
  PUSH   ebp
  PUSH   esi
  PUSH   edi
  MOV    esi,esp
  SUB    esp,8
  MOV    eax,esp
  MOV    edx,eax
  ADD    edx,8
_ClearLoop1:
  MOV    dword [eax],0
  ADD    eax,4
  CMP    eax,edx
  JNE   _ClearLoop1                                                                                                                     
  MOV    eax, dword [esi+24]
  MOV    dword [esp+0],eax
  MOV    eax, dword [esi+28]
  MOV    dword [esp+4],eax
; ProcedureReturn a-a/b*b
  MOV    ebx,dword [esp+0]
  MOV    edi,dword [esp+0]
  PUSH   dword [esp+4]
  MOV    eax,edi
  POP    edi
  CDQ
  IDIV   edi
  MOV    edi, eax
  IMUL   edi,dword [esp+4]
  SUB    ebx,edi
  MOV    eax,ebx
  JMP   _EndProcedure1
; EndProcedure
  XOR    eax,eax
_EndProcedure1:
  ADD    esp,8
  POP    edi
  POP    esi
  POP    ebp
  POP    ecx
  POP    ebx
  RET    8
_EndProcedure0:
; 
; Procedure ASM_MOD(dividend, divisor)
  JMP   _EndProcedure2
_Procedure2:
  PUSH   ebx
  PUSH   ecx
  PUSH   ebp
  PUSH   esi
  PUSH   edi
  MOV    esi,esp
  SUB    esp,8
  MOV    eax,esp
  MOV    edx,eax
  ADD    edx,8
_ClearLoop3:
  MOV    dword [eax],0
  ADD    eax,4
  CMP    eax,edx
  JNE   _ClearLoop3                                                                                                                     
  MOV    eax, dword [esi+24]
  MOV    dword [esp+0],eax
  MOV    eax, dword [esi+28]
  MOV    dword [esp+4],eax
; MOV eax, dividend
  MOV    eax,dword [esp+0]
; CDQ
  CDQ
; IDIV divisor
  IDIV   dword [esp+4]
; MOV dividend, edx
  MOV    dword [esp+0],edx
; ProcedureReturn dividend
  MOV    eax,dword [esp+0]
  JMP   _EndProcedure3
; EndProcedure
  XOR    eax,eax
_EndProcedure3:
  ADD    esp,8
  POP    edi
  POP    esi
  POP    ebp
  POP    ecx
  POP    ebx
  RET    8
_EndProcedure2:
; 
; Debug MOD(20,7)
; Debug ASM_MOD(20,7)
; ExecutableFormat=Windows
; CursorPosition=13
; FirstLine=1
; EOF
compiled with pbcompiler.exe mod.pb /COMMENTED /INLINEASM

Edit: changed middle of MOD() procedure from ProcedureReturn 20-20/7*7 to ProcedureReturn a-a/b*b and asm to match :oops:

Posted: Thu Jul 10, 2003 12:31 am
by VPureBasic
Hi,

Try this! Use "!" before asm words. You'll see something very different for speed and source length...

Roger

Posted: Thu Jul 10, 2003 4:43 am
by matthew180
Roger,

What exactally is the "!" supposed to do? Where can I get a reference on this use? I'm getting errors now when I try to compile the code and I have no way to know how to fix them.

Matthew

Posted: Thu Jul 10, 2003 8:33 am
by tinman
matthew180 wrote:What exactally is the "!" supposed to do? Where can I get a reference on this use? I'm getting errors now when I try to compile the code and I have no way to know how to fix them.
I think you have to do it at the start of the line. What it does is tell the compiler to pass that line straight to the assembler. If it's not in the inline assembly section of the manual then its probably missing.

Posted: Thu Jul 10, 2003 9:01 am
by Fred
In the Kale code, you can see than PB do as much 'compiler' optimisation as possible, so the whole numerical expression is reduced to its result as it's numeric only, so it's of course faster. About your the speed problem, it's odd as the PB code produce this:

Code: Select all

  MOV    ebx,dword [esp+0]
  MOV    edi,dword [esp+0]
  PUSH   dword [esp+4]
  MOV    eax,edi
  POP    edi
  CDQ
  IDIV   edi
  MOV    edi, eax
  IMUL   edi,dword [esp+4]
  SUB    ebx,edi
  MOV    eax,ebx
  JMP   _EndProcedure1
while your code is (as expected):

Code: Select all

  MOV eax, dividend 
  MOV    eax,dword [esp+0]
  CDQ
  IDIV   dword [esp+4]
  MOV    dword [esp+0],edx
  MOV    eax,dword [esp+0]
  JMP   _EndProcedure3
So it should be way faster (no multiply).

Posted: Thu Jul 10, 2003 12:18 pm
by jack
as tinman said, putting a ! at the beginning of a asm
statement, makes the compiler pass that line of code
directly to fasm, therefore your variable references
will no longer work as is.

here's your procedure to ilustrate.

Code: Select all

Procedure ASM_MOD(dividend, divisor) 
!    MOV eax, [esp] 
!    CDQ 
!    IDIV dword [esp+4] 
!    MOV [esp], edx 
    ProcedureReturn dividend 
EndProcedure
notice that since you are using a procedure,
the variables, dividend and divisor, are pushed
onto the stack, and since they are of type long,
they occupy 4 bytes.

if you inline that procedure in your main program,
it becomes:

Code: Select all

divisor.l=2 
For i.l=1 To 20
  dividend.l=i
!    MOV eax, [v_dividend] 
!    CDQ 
!    IDIV [v_divisor] 
!    MOV [v_dividend], edx 
  PrintN(Str(i)+"  "+Str(dividend))
Next i
or what's beter, don't use ! when you are referencing variables.

Code: Select all

Procedure ASM_MOD(dividend, divisor) 
     MOV eax, dividend 
!    CDQ 
     IDIV divisor 
     MOV dividend, edx 
    ProcedureReturn dividend 
EndProcedure
actually, the ! is only needed for some instructions, that
PureBasic don't recognize.

Posted: Thu Jul 10, 2003 1:31 pm
by Fred
Note: using '!' or inline asm is exactly the same regarding size or speed. You just have more control with '!' as you can even use the FASM macro/commands sets.

Posted: Thu Jul 10, 2003 3:08 pm
by freak
Using the '!', you don't have to remember enabling 'Enable InlineASM support' :wink:

Timo

Posted: Thu Jul 10, 2003 3:46 pm
by matthew180
First I'd like to say thanks to everyone for the information! I love messing with crap like this! ;-)

Well, I wrote the same test in C on my FreeBSD box, and using microseconds instead of milliseconds I was able to demonstrate that my version without the extra IMUL and SUB was indeed faster. However, it was only 3 milliseconds faster over 1000000 iterations... The gcc compiler created almost the exact code that pbcompile.exe created! Cudos to Fred! The C code was, however, about twice as fast as the PB code, and that's on my 300MHz P2 FreeBSD box, but my PB code is running on my 1.4GHz P4 workstation...

C function:

long
mod(long dividend, long divisor)
{
if ( divisor == 0 )
return(0);

return(dividend - ((dividend / divisor) * divisor));
}

gcc ASM of that function:

mod:
pushl %ebp
movl %esp,%ebp
cmpl $0,12(%ebp)
jne .L14
xorl %eax,%eax
jmp .L13
.p2align 2,0x90
.L14:
movl 8(%ebp),%eax
leal 12(%ebp),%ecx
cltd
idivl (%ecx)
movl %eax,%ecx
imull 12(%ebp),%ecx
movl 8(%ebp),%edx
subl %ecx,%edx
movl %edx,%eax
jmp .L13
.p2align 2,0x90
.L13:
leave
ret

If you don't know Unix style ASM, it gets a little funny looking. Commands end with a size indicator: b = byte(8), w = word(16), l = long(32). All registers are referenced with a %, and the parameters are passed as SRC, DEST instead of DEST, SRC. Also, that CLTD instruction equates to CDQ.

I'm going to attribute the lack of speed increase to the parallel nature of today's CPUs. The IMUL and SUB are probably already fetched, decoded, and ready to execute by the time IDIV is done. Kind of depressing really, takes all the fun out of ASM... ;-)

I am, however, still confused about the the PB speed issue and use of '!'. If I use ! in front of the commands, and convert variable access to [esp + n] then the ASM version runs as fast as the PB version (and in some cases a few milliseconds faster.) But without the ! in front of *every* command, it slows back to about double the PB version. But, looking at the ASM output from pbcompiler.exe, not using ! simply generates the same code as when using the !... Like this:

MOV eax, dividend
CDQ
IDIV divisor
MOV dividend, edx

That code gets converted to this by pbcompiler.exe:

; MOV eax, dividend
MOV eax,dword [esp+0]
; CDQ
CDQ
; IDIV divisor
IDIV dword [esp+4]
; MOV dividend, edx
MOV dword [esp+0],edx

But that's exactally what I had to do when I used the !, like this:

!MOV eax, dword [esp + 0]
!CDQ
!IDIV dword [esp + 4]
!MOV dword [esp + 0], edx

It complies to the *same* code, so why is the version without the ! two times slower? This is very confusing.

Thanks,
Matthew

Posted: Thu Jul 10, 2003 4:36 pm
by Pupil
Are you running the PB tests with debugger turned on, if so this will steal some CPU cycles from the actual task of MOD:ing :)

Posted: Thu Jul 10, 2003 4:48 pm
by matthew180
Yes, debug is on, but it is on in both tests (ASM version and PB version), so it should wash out. Here is my test:

Code: Select all

#ITERATIONS = 1000000

; ASM signed.
time_start = gettickcount_()

For i = 1 To #ITERATIONS Step 1
  result = fl_mod_asm(i, 10)
Next

time_stop = gettickcount_()

Debug "ASM Signed Modulus, " + Str(#ITERATIONS) + " iterations: " + Str(time_stop - time_start) + " milliseconds."

The only thing I change between tests is the function call name in the loop... I think this is about as accurate as I can get, and both functions should be tested equally this way.

Like I said earlier, my C version was only showing about 3 milliseconds difference over 1 million iterations. The C code is faster, but after looking at PB's ASM output, I assume that is due to the heavy save/restore than PB is doing. Also, there is this funny loop which I'm not exactally sure why this has to happen:

Code: Select all

  MOV    esi,esp
  SUB    esp,8
  MOV    eax,esp
  MOV    edx,eax
  ADD    edx,8
_ClearLoop3:
  MOV    dword [eax],0
  ADD    eax,4
  CMP    eax,edx
  JNE   _ClearLoop3                                                                                                                     
  MOV    eax, dword [esi+24]
  MOV    dword [esp+0],eax
  MOV    eax, dword [esi+28]
  MOV    dword [esp+4],eax
It looks like the parameters are passed in at stack offsets 24 and 28, and this just moves them down to become zero based in the stack... Maybe I'm missing something, but is all this really necessary?

Matthew

Posted: Thu Jul 10, 2003 5:03 pm
by Fred
With debug ON, extra code are added to all, even inline asm one, that's why it's much slower. Always disable debugger for speed test as you don't know what pb does in debug mode (for example I check all divide against 0 division, which slow down a lot the runtime).

Posted: Thu Jul 10, 2003 5:08 pm
by Pupil
It might be that when debugger is turned ON the compiler inserts breakpoints for each line(Fred?) and that would explain why there's a speed difference between "normal" inline ASM versus the '!' inline ASM. This would also explain why the PB version is faster than the ASM version, because in the asm version you would get breakpoints after each asm command which when counted would add up to more breakpoints than the PB version...

If this assumptions of mine is totally wrong, i must say that i haven't got a clue of what's going on ;)

Posted: Thu Jul 10, 2003 5:38 pm
by matthew180
Well, turning off debug sure made a *huge* difference!

Basically I'm down to both versions executing 1 million iterations in 47 milliseconds, every time. This is down from about 331 milliseconds with debugging on. Also, with debugging on the times varied, but with it off the times are always 47 milliseconds. This is on my 1.4GHz workstation, which makes it faster than my C code on my 300MHz P2, as it should be.

On my other 900MHz Athlon FreeBSD box, the C code runs at 68134 microseconds for the ASM version, and 72458 microseconds for the C version with the extra IMUL and SUB. So, about 4.324 milliseconds difference, and still slower (but not as much) than the 1.4GHz P4, which is a good thing sice it is only running at 900MHz, and we like to see PB going just as fast as C! :-)

Thanks everyone for the info. I would, though, like to see the same 4.3 millisecond difference in PB, but I don't think I can get the timing that close... Is there a microsecond counter available to PB? Something like the Unix gettimeofday() function would be nice! ;-)

Matthew