Calling a "PB" Routine compared to a "C"
Posted: Tue Jun 05, 2007 2:59 pm
As announced I did release a PB port of the avisynth C interface so Im capable of writing videoediting plugins using PB.
Now beside a videoediting filter concept also the speed factor is a very importand point. I used a simple "invert" routine where all bytes in the pixel array will be simply inverted "pix = ~pix" or "pix = 255-pix".
I recognised that when using the C code compiled using VC++Express I do get a 40% CPU load when processing the Video while doing the realtime preview.
The PB Version of the same routine reached the 100% CPU load factor and so I did check my PB code and it seems ok.
The main routine which treats the processor is a very simple Invert-in-place routine performed on a 2 dimensional pixel array.
same issue when trying using the peek/poke approach ...
So I did make a testing environment where the same code approach using StdCall convention is compiled as dlls using the following compilers: PureBasic, VC++Express and GCC shipped with wxDevC++.
The Purebasic Dll code:
The VC++Express and GCC Dll code:
(on GCC I used Test_GCC as symbol name)In VC++Express I did choose Multithreaded Dll output and the optimization left at default (maximise speed). Compiled as C Code (/TC).
In wxDevC++ I choosed a C project, everything else left at default.
Now finally I used a PB DLL import code where also the speed of the external routines will be checked:
So here are my results in ms (debugger off):
VC 80 : 375
GCC : 265
PB using poke : 2093
PB using pointer : 2704
WOW!
Did I miss something on this? I mean thats what avisynth relies on -> processing pixel-arrays using such for/next loops on 2D Arrays. Not mention what would happen when using other complex tasks.
Above the PB compile is slower by a factor of almost 9 or 10 compared to VCs or GCCs output. I do hope this relies on an error of mine in the test approach
I also do see that in PB the Peek/Poke approach is faster than the accessing-bytes-via-pointer way. Hmmm ... I thought it would be the opposite!?
Your comments ....
Greets
Inc.
PS: It would be interesting to see how the For/Next Part in the first PB Dll code above would result in its speed when coded using inline ASM.
Could someone do a PB-ASM one of the PB Version?
Now beside a videoediting filter concept also the speed factor is a very importand point. I used a simple "invert" routine where all bytes in the pixel array will be simply inverted "pix = ~pix" or "pix = 255-pix".
I recognised that when using the C code compiled using VC++Express I do get a 40% CPU load when processing the Video while doing the realtime preview.
The PB Version of the same routine reached the 100% CPU load factor and so I did check my PB code and it seems ok.
The main routine which treats the processor is a very simple Invert-in-place routine performed on a 2 dimensional pixel array.
Code: Select all
Structure pPixel
pix.b[0]
EndStructure
For y = 0 To height-1
For x = 0 To width-1
*pData\pix[x] = ~*pData\pix[x]
Next
*pData + width
Next
Code: Select all
For y = 0 To height-1
For x = 0 To width-1
PokeB(*pData+x, ~PeekB(*pData+x))
Next
*pData + width
Next
So I did make a testing environment where the same code approach using StdCall convention is compiled as dlls using the following compilers: PureBasic, VC++Express and GCC shipped with wxDevC++.
The Purebasic Dll code:
Code: Select all
Structure pPixel
pix.b[0]
EndStructure
; StdCall
ProcedureDLL Test_PB(*pData, width.l, height.l)
For y = 0 To height-1
For x = 0 To width-1
PokeB(*pData+x, ~PeekB(*pData+x))
Next
*pData + width
Next
EndProcedure
ProcedureDLL Test_PB2(*pData.pPixel, width.l, height.l)
For y = 0 To height-1
For x = 0 To width-1
*pData\pix[x] = ~*pData\pix[x]
Next
*pData + width
Next
EndProcedure
(on GCC I used Test_GCC as symbol name)
Code: Select all
#include <windows.h>
__declspec(dllexport) void __stdcall Test_VC(BYTE * pData, int width, int height )
{
int x,y ;
for (y = 0; y < height; y++) {
for (x = 0; x < width; x++)
pData[x] ^= 255;
pData += width;
}
}
In wxDevC++ I choosed a C project, everything else left at default.
Now finally I used a PB DLL import code where also the speed of the external routines will be checked:
Code: Select all
Prototype proto_Test_VC(*p, w.l, h.l)
Prototype proto_Test_PB(*p, w.l, h.l)
Prototype proto_Test_PB2(*p, w.l, h.l)
Prototype proto_Test_GCC(*p, w.l, h.l)
OpenLibrary(0,"D:\test\VC_test\release\test.dll")
Test_VC.proto_Test_VC = GetFunction(0, "_test_VC@12")
OpenLibrary(1,"D:\test\PB_test\test_PB.dll")
Test_PB.proto_Test_PB = GetFunction(1, "Test_PB")
Test_PB2.proto_Test_PB2 = GetFunction(1, "Test_PB2")
OpenLibrary(2,"D:\test\GCC_Test\Output\MingW\Test_GCC.dll")
Test_GCC.proto_Test_GCC = GetFunction(2, "Test_GCC@12")
CompilerEndIf
y.l=0
x.l=0
row_size = 100*720*(SizeOf(Long))
pitch = row_size
height = 576
Dim Pic.l(row_size,height)
*pData = @Pic()
temp = ElapsedMilliseconds()
Test_VC(*pData, row_size, height)
timeVC = ElapsedMilliseconds()-temp
temp = ElapsedMilliseconds()
Test_PB(*pData, row_size, height)
timePB = ElapsedMilliseconds()-temp
temp = ElapsedMilliseconds()
Test_PB2(*pData, row_size, height)
timePB2 = ElapsedMilliseconds()-temp
temp = ElapsedMilliseconds()
Test_GCC(*pData, row_size, height)
timeGCC = ElapsedMilliseconds()-temp
mess.s + "VC 80 : "+Str(timeVC)+Chr(13)
mess + "GCC : "+Str(timeGCC)+Chr(13)
mess + "PB using poke: "+Str(timePB)+Chr(13)
mess + "PB using pointer: "+Str(timePB2)+Chr(13)
MessageRequester("Info", mess)
CloseLibrary(0)
CloseLibrary(1)
CloseLibrary(2)
VC 80 : 375
GCC : 265
PB using poke : 2093
PB using pointer : 2704
WOW!

Did I miss something on this? I mean thats what avisynth relies on -> processing pixel-arrays using such for/next loops on 2D Arrays. Not mention what would happen when using other complex tasks.
Above the PB compile is slower by a factor of almost 9 or 10 compared to VCs or GCCs output. I do hope this relies on an error of mine in the test approach

I also do see that in PB the Peek/Poke approach is faster than the accessing-bytes-via-pointer way. Hmmm ... I thought it would be the opposite!?
Your comments ....
Greets
Inc.
PS: It would be interesting to see how the For/Next Part in the first PB Dll code above would result in its speed when coded using inline ASM.
Could someone do a PB-ASM one of the PB Version?