Calling a "PB" Routine compared to a "C"

inc. · Post by **inc.** » Tue Jun 05, 2007 2:59 pm

As announced I did release a PB port of the avisynth C interface so Im capable of writing videoediting plugins using PB.
Now beside a videoediting filter concept also the speed factor is a very importand point. I used a simple "invert" routine where all bytes in the pixel array will be simply inverted "pix = ~pix" or "pix = 255-pix".

I recognised that when using the C code compiled using VC++Express I do get a 40% CPU load when processing the Video while doing the realtime preview.
The PB Version of the same routine reached the 100% CPU load factor and so I did check my PB code and it seems ok.

The main routine which treats the processor is a very simple Invert-in-place routine performed on a 2 dimensional pixel array.

Code: Select all

Structure pPixel
  pix.b[0]
EndStructure

For y = 0 To height-1
  For x = 0 To width-1
    *pData\pix[x] = ~*pData\pix[x]
  Next
  *pData + width
Next

same issue when trying using the peek/poke approach ...

Code: Select all

For y = 0 To height-1
  For x = 0 To width-1
    PokeB(*pData+x, ~PeekB(*pData+x))
  Next
  *pData + width
Next

So I did make a testing environment where the same code approach using StdCall convention is compiled as dlls using the following compilers: PureBasic, VC++Express and GCC shipped with wxDevC++.

The Purebasic Dll code:

Code: Select all

Structure pPixel
  pix.b[0]
EndStructure
; StdCall

ProcedureDLL Test_PB(*pData, width.l, height.l)
  For y = 0 To height-1
    For x = 0 To width-1
      PokeB(*pData+x, ~PeekB(*pData+x))
    Next
    *pData + width
  Next
EndProcedure 

ProcedureDLL Test_PB2(*pData.pPixel, width.l, height.l)
  For y = 0 To height-1
    For x = 0 To width-1
      *pData\pix[x] = ~*pData\pix[x]
    Next
    *pData + width
  Next
EndProcedure

The VC++Express and GCC Dll code:
(on GCC I used Test_GCC as symbol name)

Code: Select all

#include <windows.h>
__declspec(dllexport) void __stdcall Test_VC(BYTE * pData, int width, int height )
{
	int x,y ;
	for (y = 0; y < height; y++) {
		for (x = 0; x < width; x++)
			pData[x] ^= 255;
		pData += width;
	}
}

In VC++Express I did choose Multithreaded Dll output and the optimization left at default (maximise speed). Compiled as C Code (/TC).
In wxDevC++ I choosed a C project, everything else left at default.

Now finally I used a PB DLL import code where also the speed of the external routines will be checked:

Code: Select all

Prototype proto_Test_VC(*p, w.l, h.l)
Prototype proto_Test_PB(*p, w.l, h.l)
Prototype proto_Test_PB2(*p, w.l, h.l)
Prototype proto_Test_GCC(*p, w.l, h.l)

OpenLibrary(0,"D:\test\VC_test\release\test.dll")
Test_VC.proto_Test_VC = GetFunction(0, "_test_VC@12")

OpenLibrary(1,"D:\test\PB_test\test_PB.dll")
Test_PB.proto_Test_PB = GetFunction(1, "Test_PB")
Test_PB2.proto_Test_PB2 = GetFunction(1, "Test_PB2")

OpenLibrary(2,"D:\test\GCC_Test\Output\MingW\Test_GCC.dll")
Test_GCC.proto_Test_GCC = GetFunction(2, "Test_GCC@12")

CompilerEndIf
    
y.l=0
x.l=0
row_size = 100*720*(SizeOf(Long))
pitch = row_size
height = 576
    
Dim Pic.l(row_size,height)
*pData = @Pic()
    
temp = ElapsedMilliseconds()
    
Test_VC(*pData, row_size, height)
    
timeVC = ElapsedMilliseconds()-temp
    
temp = ElapsedMilliseconds()
    
Test_PB(*pData, row_size, height)
    
timePB = ElapsedMilliseconds()-temp

temp = ElapsedMilliseconds()
    
Test_PB2(*pData, row_size, height)
    
timePB2 = ElapsedMilliseconds()-temp

temp = ElapsedMilliseconds()
    
Test_GCC(*pData, row_size, height)
    
timeGCC = ElapsedMilliseconds()-temp
    
mess.s + "VC 80 : "+Str(timeVC)+Chr(13)
mess +   "GCC : "+Str(timeGCC)+Chr(13)
mess +   "PB using poke: "+Str(timePB)+Chr(13)
mess +   "PB using pointer: "+Str(timePB2)+Chr(13)

MessageRequester("Info", mess)
    
CloseLibrary(0)
CloseLibrary(1)
CloseLibrary(2)

So here are my results in ms (debugger off):

VC 80 : 375
GCC : 265
PB using poke : 2093
PB using pointer : 2704

WOW!

Did I miss something on this? I mean thats what avisynth relies on -> processing pixel-arrays using such for/next loops on 2D Arrays. Not mention what would happen when using other complex tasks.
Above the PB compile is slower by a factor of almost 9 or 10 compared to VCs or GCCs output. I do hope this relies on an error of mine in the test approach

I also do see that in PB the Peek/Poke approach is faster than the accessing-bytes-via-pointer way. Hmmm ... I thought it would be the opposite!?

Your comments ....

Greets
Inc.

PS: It would be interesting to see how the For/Next Part in the first PB Dll code above would result in its speed when coded using inline ASM.
Could someone do a PB-ASM one of the PB Version?

inc. · Post by **inc.** » Tue Jun 05, 2007 3:14 pm

Code: Select all

ProcedureDLL Test_PB(*pData, width.l, height.l)
  For y = 0 To height-1
    For x = 0 To width-1
      PokeB(*pData+x, ~PeekB(*pData+x))
    Next
    *pData + width
  Next
EndProcedure

... will be interpreted by PB to asm in this way ...

Code: Select all

; ProcedureDLL Test_PB(*pData, width.l, height.l)
macro MP0{
_Procedure0:
  PUSH   ebx
  PS0=16
  XOR    eax,eax
  PUSH   eax
  PUSH   eax                                                                                                                                                                                                        
; For y = 0 To height-1
  MOV    dword [esp],0
_For1:
  MOV    ebx,dword [esp+PS0+8]
  DEC    ebx
  CMP    ebx,dword [esp]
  JL    _Next2
; For x = 0 To width-1
  MOV    dword [esp+4],0
_For3:
  MOV    ebx,dword [esp+PS0+4]
  DEC    ebx
  CMP    ebx,dword [esp+4]
  JL    _Next4
; PokeB(*pData+x, ~PeekB(*pData+x))
  MOV    ebx,dword [esp+PS0+0]
  ADD    ebx,dword [esp+4]
  MOV    eax,ebx
  CALL   PB_PeekB
  MOV    ebx,eax
  NOT    ebx
  PUSH   ebx
  MOV    ebx,dword [esp+PS0+4]
  ADD    ebx,dword [esp+8]
  MOV    eax,ebx
  CALL   PB_PokeB
; Next
_NextContinue4:
  INC    dword [esp+4]
  JMP   _For3
_Next4:
; *pData + width
  MOV    ebx,dword [esp+PS0+0]
  ADD    ebx,dword [esp+PS0+4]
  MOV    dword [esp+PS0+0],ebx
; Next
_NextContinue2:
  INC    dword [esp]
  JMP   _For1
_Next2:
; EndProcedure 
  XOR    eax,eax
_EndProcedure1:
  ADD    esp,8
  POP    ebx
  RET    12
}
;

and here's VC++Express' Output :

Code: Select all

__declspec(dllexport) void __stdcall Test_VC(BYTE * pData, int width, int height )
{
   int x,y ;
   for (y = 0; y < height; y++) {
      for (x = 0; x < width; x++)
         pData[x] ^= 255;
      pData += width;
   }
}

Will be interpreted as ...

Code: Select all

EXTRN	@__security_check_cookie@4:PROC
PUBLIC	_test_VC@12
; Function compile flags: /Ogtpy
;	COMDAT _test_VC@12
_TEXT	SEGMENT
_pData$ = 8						; size = 4
_width$ = 12						; size = 4
_height$ = 16						; size = 4
_test_VC@12 PROC					; COMDAT
; File d:\test\vc_test\test\test.cpp
; Line 36
	mov	eax, DWORD PTR _height$[esp-4]
	test	eax, eax
	jle	SHORT $LN4@test_VC
	mov	ecx, DWORD PTR _pData$[esp-4]
	push	esi
	mov	esi, DWORD PTR _width$[esp]
	push	edi
	mov	edi, eax
$LL6@test_VC:
; Line 37
	xor	eax, eax
	test	esi, esi
	jle	SHORT $LN1@test_VC
	npad	6
$LL3@test_VC:
; Line 38
	mov	dl, BYTE PTR [eax+ecx]
	not	dl
	mov	BYTE PTR [eax+ecx], dl
	add	eax, 1
	cmp	eax, esi
	jl	SHORT $LL3@test_VC
$LN1@test_VC:
; Line 39
	add	ecx, esi
	sub	edi, 1
	jne	SHORT $LL6@test_VC
	pop	edi
	pop	esi
$LN4@test_VC:
; Line 41
	ret	12					; 0000000cH
_test_VC@12 ENDP
_TEXT	ENDS

Trond · Post by **Trond** » Tue Jun 05, 2007 3:24 pm

PB using poke : 2093
PB using pointer : 2704

That was weird. Pointer is faster here (it should also be faster in theory).

inc. · Post by **inc.** » Tue Jun 05, 2007 4:02 pm

Hmmm Im using PB 4.02

In the PB Dll project I did switch from "all CPU" in the properties of the PB Dll Project to "SSE2" and the Result is this:

PB 40 using poke : 3125
PB 40 using pointer : 2297

Now when using "mmx" ...

PB 40 using poke : 2062
PB 40 using pointer : 2297

"Dynamic CPU" ...

PB 40 using poke : 3094
PB 40 using pointer : 2328

hmmmm weird ....

but anyhow ..... PBs external routine is still almost 9 times slower.

EDIT:

After Installing Deem's Optimizer for PB 4.0
http://www.purebasic.fr/english/viewtop ... er&start=0

I get this when using "dynamic" CPU ...

PB 40 using poke : 3172
PB 40 using pointer : 1734

But still slow compared to 344 of VC80 or 235 of GCC

Post by **Fred** » Tue Jun 05, 2007 6:12 pm

A faster one:

Code: Select all

ProcedureDLL Test_PB2(*pData.pPixel, width.l, height.l)
  width-1
  height-1
  For y = 0 To height
    *Cursor.BYTE = *pData
    For x = 0 To width
      *Cursor\b = ~*Cursor\b
      *Cursor+1
    Next
    *pData + width
  Next
EndProcedure

Trond · Post by **Trond** » Tue Jun 05, 2007 6:48 pm

Since you are allocating the memory in a multiple of sizeof(long) there is no reason to do the bytes individually (I think). This code is very fast:

Code: Select all

ProcedureDLL Test_PB4(*pData.Long, width.l, height.l)
  width - 1
  height - 1
  width = width*height+*pData
  While *pData < width
    *pData\l = ~*pData\l
    *pData + SizeOf(Long)
  Wend
EndProcedure

inc. · Post by **inc.** » Tue Jun 05, 2007 6:49 pm

Ok, lets see ...

(PB Dll compiled using "All CPU", Deems Optimizer off)

VC 80 : 328
GCC : 234
PB 40 using poke : 2032
PB 40 using pointer : 2281
PB 40 using Freds : 1469
PB 40 using Tronds : 390

Ah, ...

but its a little bit more complicated approach ....

Trond · Post by **Trond** » Tue Jun 05, 2007 7:04 pm

Inline asm for bytes:

Code: Select all

ProcedureDLL Test_PB5(pData, width, height)
  !mov eax, [p.v_width]
  !add eax, -1
  !mov ecx, [p.v_height]
  !add ecx, -1
  !mov edx, [p.v_pData]
  !imul ecx, eax
  !add ecx, edx
  !l_pb5_loop1:
  !not byte [edx]
  !add edx, 1
  !cmp edx, ecx
  !jl l_pb5_loop1
EndProcedure

Inline asm for longs:

Code: Select all

ProcedureDLL Test_PB5(pData, width, height)
  !mov eax, [p.v_width]
  !add eax, -1
  !mov ecx, [p.v_height]
  !add ecx, -1
  !mov edx, [p.v_pData]
  !imul ecx, eax
  !add ecx, edx
  !align 4
  !l_pb5_loop1:
  !not dword [edx]
  !add edx, 4
  !cmp edx, ecx
  !jl l_pb5_loop1
EndProcedure

kinglestat · Post by **kinglestat** » Thu Jul 12, 2007 8:52 am

Trond: I have a question

where did you get the \l from on the line
*pData\l = ~*pData\l

I'm still green on PB pointers

Thanks

KingLestat

kinglestat · Post by **kinglestat** » Thu Jul 12, 2007 8:59 am

inc: I am curious what would happen if you optimize the C function with the trond way?

eesau · Post by **eesau** » Thu Jul 12, 2007 9:03 am

Kinglestat: the \L is a field of the .LONG-structure. Looking at the procedure parameters, you can see that *pData is defined as a structured pointer, *pData.LONG.

kinglestat · Post by **kinglestat** » Thu Jul 12, 2007 10:41 am

yes I can see that
but why not simply use *pData.l ?
.l is long no?

eesau · Post by **eesau** » Thu Jul 12, 2007 11:01 am

kinglestat wrote:yes I can see that
but why not simply use *pData.l ?
.l is long no?

Yes it is, but that would be trying to change the pointer type, which, pardon the pun, is pointless (and as far as I can tell, not possible). You could use just *pData to point to the data, but then you would have to use peeks and pokes to manipulate the data in memory, which is slower than using a structured pointer.

I'm a fairly new PB user, so I'm really not 100% sure about all of this -- feel free to correct me if I'm wrong.

inc. · Post by **inc.** » Thu Jul 12, 2007 11:53 am

kinglestat wrote:inc: I am curious what would happen if you optimize the C function with the trond way?

The Function to me is not really optimized but directly written in ASM thats a totall different approach

On complex math routines or functions a direkt ASM sometimes is the way to do it if you got AMS skills, but a simple loop within a simple loop should end up a bit faster in the PB-compiler resulting ASM code. In C I also dont have to get ASM Skills to get such a routine running faster cause In C such an ASM part is not needed as it outputs a more optimized code.

I am aware that Purebasic doesnt come with an optimizing compiler, but ... forgive me but isnt it more needed to get the compilers output more optimized instead of adding more native libs like drag'n drop etc. to PB ?

No offense to the devs at all! This opinion is written down with still all my respect to the development and overall sympathy to Purebasic.