PureBasic Forums - English

Posted: **Wed Sep 12, 2012 5:11 pm**

Hi guys!

Can anyone help me squeeze as much speed as possible out of the following image FilterCallback? Maybe some Asm code? (has to run on both x86 x64):

Code: Select all

Procedure FilterCallback(x, y, SourceColor, TargetColor)
  
  r = Red(TargetColor)
  g = Green(TargetColor)
  b = Blue(TargetColor)
  a = Alpha(TargetColor)
  sa = Alpha(SourceColor)

  a = (a/255.0)*(sa)
  r = (r/255.0)*(sa)
  g = (g/255.0)*(sa)
  b = (b/255.0)*(sa)
  
  ProcedureReturn RGBA(r, g, b, a)
  
EndProcedure

The callback takes the alpha channel of the target and combines it with the source (alpha mask) using an image that already has pre-multiplied alpha.

Many thanks!

Chris.

Posted: **Wed Sep 12, 2012 5:24 pm**

Simple dont use a callback. Just access the image data directly. A callback means a call for every pixel, which is extremly slow.

Posted: **Wed Sep 12, 2012 6:55 pm**

@Thorium:
With access you need all pixels too.
FilterCallback() is not slow!

@PrincieD:

Code: Select all

Procedure FilterCallback(x, y, SourceColor, TargetColor)
	
	Protected.i R, G, B, A
	Protected S.f
	
	R = TargetColor & $FF
	G = TargetColor>>8 & $FF
	B = TargetColor>>16 & $FF
	A = TargetColor>>24 & $FF
	S = Alpha(SourceColor)/255.0
	
	R * S
	G * S
	B * S
	A * S
	
	ProcedureReturn R | G<<8 | B<<16 | A<<24
	
EndProcedure

Posted: **Wed Sep 12, 2012 7:06 pm**

Thanks very much STARGÅTE!

the speed has doubled now!
Would coding the same algorithm using asm make much of a difference?

Thanks!

Chris.

Posted: **Wed Sep 12, 2012 7:14 pm**

PrincieD wrote:Would coding the same algorithm using asm make much of a difference?

Yes, because then you can work with the registers optimized!

Posted: **Wed Sep 12, 2012 7:50 pm**

STARGÅTE wrote:
PrincieD wrote:Would coding the same algorithm using asm make much of a difference?
Yes, because then you can work with the registers optimized!

Cool cool

just wanted to confirm before spending lots of time coding in asm ugh lol

Thorium wrote:Simple dont use a callback. Just access the image data directly. A callback means a call for every pixel, which is extremly slow.

Thanks Thorium, I thought this would be the case too but I tried using GetDIBits/SetDIBits but was really slow compared to PB's filter callback method

Chris.

Posted: **Wed Sep 12, 2012 8:24 pm**

If you don't mind the result can be 1 off compared to the original function (like $62728292 instead of $63738393), you can try this

Code: Select all

Procedure FilterCallback(x, y, SourceColor, TargetColor)
  !pxor xmm0, xmm0
  !movd xmm1, [p.v_TargetColor]
  !punpcklbw xmm0, xmm1
  !movd xmm1, [p.v_SourceColor]
  !psrld xmm1, 24
  !punpcklbw xmm1, xmm1
  !pshuflw xmm1, xmm1, 0
  !pmulhuw xmm0, xmm1
  !psrlw xmm0, 8
  !packuswb xmm0, xmm0
  !movd eax, xmm0
  ProcedureReturn
EndProcedure

if you want the same result, you can try this but it's a little bit slower compared to the procedure above but still faster compared to the non asm code

Code: Select all

Procedure FilterCallback(x, y, SourceColor, TargetColor)
  !movd xmm0, [p.v_TargetColor]
  !pxor xmm1, xmm1
  !punpcklbw xmm0, xmm1
  !punpcklwd xmm0, xmm1
  !cvtdq2ps xmm0, xmm0
  !mov eax, dword [p.v_SourceColor]
  !shr eax, 24
  !cvtsi2ss xmm1, eax
  !mov eax, 255
  !cvtsi2ss xmm2, eax
  !divss xmm1, xmm2
  !pshufd xmm1, xmm1, 0
  !mulps xmm0, xmm1
  !cvtps2dq xmm0, xmm0
  !packssdw xmm0, xmm0
  !packuswb xmm0, xmm0
  !movd eax, xmm0
  ProcedureReturn
EndProcedure

Posted: **Wed Sep 12, 2012 9:06 pm**

wilbert wrote:If you don't mind the result can be 1 off compared to the original function (like $62728292 instead of $63738393), you can try this to see if it is faster
Code: Select all
Procedure FilterCallback(x, y, SourceColor, TargetColor)
  !pxor xmm0, xmm0
  !movd xmm1, [p.v_TargetColor]
  !punpcklbw xmm0, xmm1
  !movd xmm1, [p.v_SourceColor]
  !psrld xmm1, 24
  !punpcklbw xmm1, xmm1
  !pshuflw xmm1, xmm1, 0
  !pmulhuw xmm0, xmm1
  !psrlw xmm0, 8
  !packuswb xmm0, xmm0
  !movd eax, xmm0
  ProcedureReturn
EndProcedure

That's awesome wilbert!! really fast now

thanks!

I might try the GetDIBits/SetDIBits method again though (to avoid the JMP/RET every pixel, is that what PB is doing with the filtercallback?), I think I might have been doing something wrong.

My "rough n' ready" AlphaMaskBlit code (from the new ProGUI GDI drivers) currently looks like this anyway:

Code: Select all

If timage = #Null
    timage = CreateImage(#PB_Any, w, h, 32)
  EndIf
  hdc = StartDrawing(ImageOutput(timage))
  BitBlt_(hdc, 0, 0, w, h, *src\sBuf\hdc, src\left, src\top, #SRCCOPY)
  DrawingMode(#PB_2DDrawing_CustomFilter)
  CustomFilterCallback(@FilterCallback())
  *dstmask.masks = *dstImg\mask
  test = GrabImage(*dstmask\handle, #PB_Any, dst\left-*dst\rc\left, dst\top-*dst\rc\top, w, h)
  DrawImage(ImageID(test), 0, 0)
  FreeImage(test)
  GdiAlphaBlend_(*dst\sBuf\hdc, dst\left, dst\top, dw, dh, hdc, 0, 0, w, h, $1000000 | alpha<<16)
  StopDrawing()

It copies the source image (with alpha already premultiplied) from a section of the source superbuffer image into a temporary image (timage) using BitBlt.
Then grabs the corresponding section of the alphamask and draws it into the temporary image using the filtercallback to combine the alphas.
Finally the temporary image is then alpha blended onto a section of the destination superbuffer using hardware accelerated GdiAlphaBlend.

I'm just wondering if there's a better way to do it, if I could get rid of any unnecessary blit operations then the speed would be doubled again hmm

Thanks anyway guys, I really appreciate the help!

Chris.

Posted: **Wed Sep 12, 2012 9:50 pm**

PrincieD wrote:
Thorium wrote:Simple dont use a callback. Just access the image data directly. A callback means a call for every pixel, which is extremly slow.
Thanks Thorium, I thought this would be the case too but I tried using GetDIBits/SetDIBits but was really slow compared to PB's filter callback method

Chris.

Why use GetDIBits/SetDIBits? PB offers direct buffer access with: DrawingBuffer()

And callbacks are very slow, you have many unnecessary operations per pixel including a call, which is allways slow. The good thing about the filter callbacks is that they are easy to use, not that they are fast.

Without ASM the loop would look like that: (untested)

Code: Select all


Structure Pixel_BGRA
  Blue.a
  Green.a
  Red.a
  Alpha.a
EndStructure

Procedure DoStuff(*SrcBuffer, *DstBuffer, Height.i, Width.i, SrcPitch.i, DstPitch.i)
  
  Protected *SrcPixel.Pixel_BGRA
  Protected *DstPixel.Pixel_BGRA
  Protected X.i
  Protected Y.i
  Protected S.f
  
  Height - 1
  
  For Y = 0 To Height
    
    *SrcPixel = *SrcBuffer + (SrcPitch * Y)
    *DstPixel = *DstBuffer + (DstPitch * Y)

    For X = 1 To Width
      
      S = *SrcPixel\Alpha / 255.0
      
      *DstPixel\Blue  = *DstPixel\Blue * S
      *DstPixel\Green = *DstPixel\Green * S
      *DstPixel\Red   = *DstPixel\Red * S
      *DstPixel\Alpha = *DstPixel\Alpha * S

      *SrcPixel + 4
      *DstPixel + 4
      
    Next

  Next
   
EndProcedure

Now you can speed this up by at least 5 times with ASM.

Posted: **Wed Sep 12, 2012 10:50 pm**

@Thorium:

The function DoStuff(...) is good, if you want apply it on a rectangle.
With a FilterCallback all drawing functions will be work with this callback.

So if my mask is a circle FilterCallback() works too:

Code: Select all

Enumeration
	#Window
	#Gadget
EndEnumeration


OpenWindow(#Window, 0, 0, 800, 600, "WindowTitle", #PB_Window_MinimizeGadget|#PB_Window_ScreenCentered)
CanvasGadget(#Gadget, 0, 0, WindowWidth(#Window), WindowHeight(#Window))

Procedure Invert(X, Y, SourceColor, TargetColor)
	ProcedureReturn ~TargetColor
EndProcedure

If StartDrawing(CanvasOutput(#Gadget))
	DrawingMode(#PB_2DDrawing_Gradient)
	LinearGradient(0,0,800,0) : GradientColor(0.0, $FF0000) : GradientColor(1.0, $0000FF)
	Box(0,0,800,600)
	DrawingMode(#PB_2DDrawing_CustomFilter)
	CustomFilterCallback(@Invert())
	Ellipse(300, 300, 250, 100)
	Ellipse(500, 300, 200, 250)
	StopDrawing()
EndIf

Repeat
Until WaitWindowEvent() = #PB_Event_CloseWindow

an other example with text:

Code: Select all

Enumeration
	#Window
	#Gadget
	#Font
EndEnumeration

OpenWindow(#Window, 0, 0, 768, 256, "WindowTitle", #PB_Window_MinimizeGadget|#PB_Window_ScreenCentered)
CanvasGadget(#Gadget, 0, 0, WindowWidth(#Window), WindowHeight(#Window))
LoadFont(#Font, "Arial", 62)

Procedure Rotate(X, Y, SourceColor, TargetColor)
	If SourceColor & $FF000000 And Random(1)
		ProcedureReturn RGB(Green(TargetColor),Blue(TargetColor),Red(TargetColor))
	Else
		ProcedureReturn TargetColor
	EndIf
EndProcedure

If StartDrawing(CanvasOutput(#Gadget))
	DrawingMode(#PB_2DDrawing_Gradient)
	LinearGradient(0,0,768,0) : GradientColor(0.0, $FF0000) : GradientColor(1.0, $0000FF)
	Box(0,0,768,256)
	DrawingFont(FontID(#Font))
	DrawingMode(#PB_2DDrawing_CustomFilter|#PB_2DDrawing_Transparent)
	CustomFilterCallback(@Rotate())
	DrawText(32, 64, "Some drawed text!")
	StopDrawing()
EndIf

Repeat
Until WaitWindowEvent() = #PB_Event_CloseWindow

Posted: **Wed Sep 12, 2012 11:08 pm**

@Thorium: Good code man works well at the same speed as STARGÅTE's, I might be able to get rid of one of the blit steps using this method and combined with wilbert's kick ass ASM should be plenty fast

@STARGÅTE: good point!

I'll experiment further with the different methods, at least all are faster than my original code anyway

This is how it looks running on the Direct2D drivers with a 1000 balls at 60fps no performance hit (0% CPU usage) as Direct2D supports alpha masks on the GPU:

So far wilbert's asm can handle 100 balls at 60fps (24% CPU usage) using the GDI drivers, not too bad - hopefully we can squeeze a bit more speed out of it!

Chris.

Posted: **Thu Sep 13, 2012 7:03 am**

Compared to my previous procedure, this is a little bit faster and shorter but you will never get gpu speeds.
This code is the same for x86 and x64.

Code: Select all

Procedure FilterCallback(x, y, SourceColor, TargetColor)
  !movd xmm1, [p.v_SourceColor]
  !movd xmm0, [p.v_TargetColor]
  !punpcklbw xmm1, xmm1
  !punpcklbw xmm0, xmm0
  !pshuflw xmm1, xmm1, 0xff
  !pmulhuw xmm0, xmm1
  !psrlw xmm0, 8
  !packuswb xmm0, xmm0
  !movd eax, xmm0
  ProcedureReturn
EndProcedure

Combined with the approach from Thorium it could be something like this for x64.

Code: Select all

Procedure AlphaMultiply(*SrcBuffer, *DstBuffer, Height.i, Width.i, SrcPitch.i, DstPitch.i)
  
  !movq xmm2, [p.p_SrcBuffer]
  !movq xmm3, [p.p_DstBuffer]
  !movq xmm4, [p.v_SrcPitch]
  !movq xmm5, [p.v_DstPitch]
  
  !alpha_multiply_loop0:  
  !movq rax, xmm2
  !movq rdx, xmm3
  !mov rcx, qword [p.v_Width]
  
  !alpha_multiply_loop1:
  !movd xmm1, [rax]
  !movd xmm0, [rdx]
  !punpcklbw xmm1, xmm1
  !punpcklbw xmm0, xmm0
  !pshuflw xmm1, xmm1, 0xff
  !pmulhuw xmm0, xmm1
  !psrlw xmm0, 8
  !packuswb xmm0, xmm0
  !movd [rdx], xmm0
  !add rax, 4
  !add rdx, 4
  !dec rcx
  !jnz alpha_multiply_loop1
  
  !paddq xmm2, xmm4
  !paddq xmm3, xmm5
  !dec qword [p.v_Height]
  !jnz alpha_multiply_loop0
  
EndProcedure

For x86 with sse instead of sse2 so it also runs on older hardware.

Code: Select all

Procedure AlphaMultiply(*SrcBuffer, *DstBuffer, Height.i, Width.i, SrcPitch.i, DstPitch.i)
  
  !movd mm2, [p.p_SrcBuffer]
  !movd mm3, [p.p_DstBuffer]
  !movd mm4, [p.v_SrcPitch]
  !movd mm5, [p.v_DstPitch]
  
  !alpha_multiply_loop0:  
  !movd eax, mm2
  !movd edx, mm3
  !mov ecx, dword [p.v_Width]
  
  !alpha_multiply_loop1:
  !movd mm1, [eax]
  !movd mm0, [edx]
  !punpcklbw mm1, mm1
  !punpcklbw mm0, mm0
  !pshufw mm1, mm1, 0xff
  !pmulhuw mm0, mm1
  !psrlw mm0, 8
  !packuswb mm0, mm0
  !movd [edx], mm0
  !add eax, 4
  !add edx, 4
  !dec ecx
  !jnz alpha_multiply_loop1
  
  !paddd mm2, mm4
  !paddd mm3, mm5
  !dec dword [p.v_Height]
  !jnz alpha_multiply_loop0
  
  !emms
  
EndProcedure

Posted: **Thu Sep 13, 2012 4:19 pm**

Note: You can access a rectangle of destination and source image without any performance hit by adjusting the values of *DstBuffer, *SrcBuffer (first pixels to be processed) and DstPitch, SrcPitch (length of a image line in memory). Just make The buffer pointer point to the first pixel and calculate the pitch so that it will go to the pixel in same column in the next line.

STARGÅTE wrote: So if my mask is a circle FilterCallback() works too:

Yes, with different shapes it becomes a little bit more complicated, but not to much.
The big plus of the code is that you can process multiple pixels at once with smart SIMD code. Even without you can unrole the loop and process multiple pixel per iteration to reduce loop overhead.

Posted: **Fri Sep 14, 2012 5:07 am**

Thanks for the help guys! especially Wilbert for his excellent ASM (it would have taken me forever to attempt this and probably wouldn't be as fast!

). I think I'll go with the filtercallback method as this seems the most versatile and runs quickest with the new smaller ASM algorithm. Thorium, thanks for pointing me in the direction of SIMD too - this could potentially be even quicker but I think it may be a bit over my head lol (might try and experiment tomorrow, I'm pretty tired now heh).

Cheers!

Chris.

Posted: **Fri Sep 14, 2012 7:11 am**

Glad to help

And when it comes to SIMD, there are different ways you could approach the problem.
In a loop, you could process multiple pixels at once and that might be faster.
The callback as it as now however also uses SIMD instructions but in this case to handle the four color channels at once.

PureBasic Forums - English

Help optimising small FilterCallback

Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback

Re: Help optimising small FilterCallback