Code optimisation or translate to assembly?

Didaktik · Post by **Didaktik** » Thu Sep 29, 2016 9:44 pm

I write converter ZX Spectrum screen to RGBA.

I need super fast speed. I create table all varyants of pixel and color, but I think it can be accelerated further?

Code: Select all


Procedure InitAttributesTable()
  
  For pixels.a = 0 To 255
    For attr.a = 0 To 255
      
      Paper = (attr >> 3) & $0F 
      ink   = (attr & 7)  | ((attr & 64) >> 3)
      
      bit = 128
      For z = 0 To 7
        
        If pixels & bit
          
          color.l = color( ink )
          
        Else  
          
          color = color( paper )
          
        EndIf
        
        c.l = $FF000000 | Red(color) << 16 | Green(color) << 8 | Blue(color)
        PokeL(attributes_table + ((((pixels&bit) << 8) | attr )<<2), c )
                        
        bit >> 1
      Next z 
            
    Next attr
    
  Next pixels    
    
EndProcedure

Code: Select all


Procedure scr2texture (texture, *mem)
    
  For y = 191 To 0 Step -1
    
    pixelLine.l = 32 * ((y & $C0) | ((y << 3) & $38) | ((y >> 3) & $07)) 
    attr.l      = 6144 + ((y & $F8) << 2)                                
    
    For x = 0 To 31
      
      chr_attr.l     =  PeekA(*mem + attr + x ) 
      chr_pixels.l   =  PeekA(*mem + pixelLine + x )
    
      CopyMemory( attributes_table + ((((chr_pixels&128) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
      CopyMemory( attributes_table + ((((chr_pixels&64 ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
      CopyMemory( attributes_table + ((((chr_pixels&32 ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
      CopyMemory( attributes_table + ((((chr_pixels&16 ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
      CopyMemory( attributes_table + ((((chr_pixels&8  ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
      CopyMemory( attributes_table + ((((chr_pixels&4  ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
      CopyMemory( attributes_table + ((((chr_pixels&2  ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
      CopyMemory( attributes_table + ((((chr_pixels&1  ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
      
    Next x
    
  Next y
    
EndProcedure

Original code, without optimisation:

Code: Select all


  For y = 0 To 191; 192 lines of screen
    
    pixelLine = 32 * ((y & $C0) | ((y << 3) & $38) | ((y >> 3) & $07))
    attr      = 6144 + ((y & $F8) << 2)                               
    
    For x = 0 To 31; 32 columns of screen
      
      chr_attr     =  PeekC(*mem + attr + x ) 
      chr_pixels   =  PeekC(*mem + pixelLine + x )
      
      Paper = (chr_attr >> 3) & $0F 
      ink   = (chr_attr & 7)  | ((chr_attr & 64) >> 3)
      
      bit = 128
      For z = 0 To 7; 8 pixels of column
        
        If chr_pixels & bit
          
          Plot( x<<3 + z, y, color( ink )) ; pixel color
          
        Else  
          
          Plot( x<<3 + z, y, color( Paper )); background color
          
        EndIf
        
        bit >> 1
      Next z 
      
    Next x
    
  Next y

perhaps it makes sense to use sse2 instructions?

Didaktik · Post by **Didaktik** » Thu Sep 29, 2016 10:15 pm

An interesting tutorial how to optimize the graphics card.

http://www.yoyogames.com/blog/89

djes · Post by **djes** » Thu Sep 29, 2016 10:16 pm

There's a lot of things to optimise, but the first I see there from my bed is to unroll the z loop, the second is to create arrays of colour composantes to avoid red(), green() blue() calls.

Didaktik · Post by **Didaktik** » Thu Sep 29, 2016 11:45 pm

djes wrote:There's a lot of things to optimise, but the first I see there from my bed is to unroll the z loop, the second is to create arrays of colour composantes to avoid red(), green() blue() calls.

colors that are drawn in the texture is already stored in RGB format.
I'm a little more optimized:

Code: Select all


  For adr.l = 0 To 6143*4 Step 4
    
    chr_pixels.l   =  PeekA(*mem + PeekL(pixels_adress_table + adr) )
    chr_attr.l     =  PeekA(*mem + PeekL(attributes_adress_table + adr)  ) 
    
    CopyMemory( attributes_table + ((((chr_pixels&128) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
    CopyMemory( attributes_table + ((((chr_pixels&64 ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
    CopyMemory( attributes_table + ((((chr_pixels&32 ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
    CopyMemory( attributes_table + ((((chr_pixels&16 ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
    CopyMemory( attributes_table + ((((chr_pixels&8  ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
    CopyMemory( attributes_table + ((((chr_pixels&4  ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
    CopyMemory( attributes_table + ((((chr_pixels&2  ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
    CopyMemory( attributes_table + ((((chr_pixels&1  ) << 8) | (chr_attr))<<2), texture, 4 ):texture + 4
    
  Next adr

djes · Post by **djes** » Fri Sep 30, 2016 8:19 am

Seems way better ! Where do the 6143 comes ? As there's a lot of <<8, I see the possibility of reading 1 byte on the right and merging the "and" to avoid this (for eg a peekw([chr_pixels])&$100). Maybe the <<2 could be hardwired in assembly, at least it could be at the chr_ lines only one time for all, and changing the "and" accordingly...

wilbert · Post by **wilbert** » Fri Sep 30, 2016 11:29 am

The "Original code, without optimisation" look pretty good.
The biggest problem probably is that you are using Plot.
Using the DrawingBuffer directly should be quite a bit faster.

Otrebor · Post by **Otrebor** » Fri Sep 30, 2016 11:25 pm

Hi

This is part of my code used in simpleZX.
Probably need a lot of optimisation (and rewrite)

Few tips that help me with speed:
I use an array to compare if something changed since last frame.
When there is no pixel to draw, i use Box() to draw the paper.

Code: Select all

Dim Cor(1,7):Dim scr.u(28671):Dim mem.a(65535)
Cor(0,0)=#Black:Cor(0,1)=RGB(0,0,191):Cor(0,2)=RGB(191,0,0):Cor(0,3)=RGB(191,0,191):Cor(0,4)=RGB(0,191,0):Cor(0,5)=RGB(0,191,191):Cor(0,6)=RGB(191,191,0):Cor(0,7)=RGB(191,191,191)
Cor(1,0)=#Black:Cor(1,1)=#Blue:Cor(1,2)=#Red:Cor(1,3)=#Magenta:Cor(1,4)=#Green:Cor(1,5)=#Cyan:Cor(1,6)=#Yellow:Cor(1,7)=#White
x=32:pixel=16383:attr=22528:linha=0

      While x <> 288
        pixel+1
        paper.a = mem(attr) >>3 & 7
        ink.a  =  mem(attr) & 7
        brite.a = mem(attr) >>6 & 1
        flash.a = mem(attr) >>7
        byte.a=mem(pixel)
        If (mem(attr) <> scr(pixel+6144) ! flash) Or byte <> scr(pixel);verify in the array if something changed
          If flash & f_flash:Swap ink,paper:EndIf;IF Changed => need to draw again
          ink_=Cor(brite,ink)
          paper_=Cor(brite,paper)
          If Not byte ; All is paper (use box is faster)
            Box(x, linha-40, 8, 1,  paper_)
            x+8
          ElseIf byte=255 ; All is ink (use box is faster)
            Box(x, linha-40, 8, 1,  ink_)
            x+8
          Else
            Box(x, linha-40, 8, 1,  paper_);first draw the paper...
            For cursor = 7 To 0 Step -1
              If byte>>cursor & 1  
                Plot (x , linha-40, ink_);...and now the ink
              EndIf
              x+1
            Next
          EndIf
          scr(pixel)=byte:scr(pixel+6144)=mem(attr);save in the array
        Else
          x+8
        EndIf
        attr+1
      Wend     
      attr-32:pixel+224:x=32:a1+1    
      If a1=8
        attr+32:pixel-2016:a1=0:line+1
        If line=8:line=0:pixel+1792:EndIf 
      EndIf  
      linha+1

wilbert · Post by **wilbert** » Sat Oct 01, 2016 5:39 pm

Didaktik wrote:perhaps it makes sense to use sse2 instructions?

If you do that it's best to decode one 8x8 pixel block after the other so you don't have to load the attributes so often.
But using SSE2 only makes sense if you always need to output to a 32 bit buffer.

It helps to know what your goal exactly is.
Is the target a PB image, a CanvasGadget or simply a memory area; do you prefer ASM of PB code; should the blink bit also be supported ?

Didaktik · Post by **Didaktik** » Sun Oct 02, 2016 5:22 pm

wilbert wrote:
Didaktik wrote:perhaps it makes sense to use sse2 instructions?
If you do that it's best to decode one 8x8 pixel block after the other so you don't have to load the attributes so often.
But using SSE2 only makes sense if you always need to output to a 32 bit buffer.

It helps to know what your goal exactly is.
Is the target a PB image, a CanvasGadget or simply a memory area; do you prefer ASM of PB code; should the blink bit also be supported ?

I use OpenGL gadget.

I do it because I need the picture is show ad 50-25% resize with anti-aliasing. A resizeimage quite slow.

wilbert · Post by **wilbert** » Sun Oct 02, 2016 5:35 pm

Didaktik wrote:I do it because I need the picture is show ad 50-25% resize with anti-aliasing. A resizeimage quite slow.

In that case you could try the shader implementation you referred to in one of your previous posts.

wilbert · Post by **wilbert** » Mon Oct 03, 2016 8:52 am

I tried a SSE2 version since I like old 8 bit computers.
Please let me know how it performs compared to the original code.

#FlipY and #SwapRB can be used to configure the output.
The colors were taken from the javascript source of a spectrum emulator.
You can change these as you like.

Module

Code: Select all

; ZXScreen module by Wilbert

; *SCR has to be 6912 bytes of ZX Spectrum screen
; *Target has to be a buffer for 32 bit color data
; When FlashState is set to 1, pen and ink will be
; switched for blocks that have the flash bit set.

DeclareModule ZXScreen
  
  Declare RenderSCR(*SCR, *Target, FlashState = 0)
  
EndDeclareModule

Module ZXScreen
  
  #FlipY = #True
  #SwapRB = #True
  
  EnableExplicit
  DisableDebugger           ; disabling debugger is required !!
  EnableASM
  
  ;- Data section
  
  DataSection
    ColorLUT:
    Data.l $ff000000,$ffc03020,$ff1040c0,$ffc040c0,$ff10b040,$ffb0c050,$ff10c0e0,$ffc0c0c0
    Data.l $ff000000,$ffff4030,$ff3040ff,$fff070ff,$ff10e050,$ffffe050,$ff50e8ff,$ffffffff
  EndDataSection
    
  ;- Structures
  
  Structure RenderLUT
    bit_expand.l[256 * 8]   ; offset 0
    color.l[256 * 2]        ; offset 8192
    offset.u[768]           ; offset 10240
  EndStructure  
  
  ;- Global variables
  
  Global *Mem, *RenderLUT.RenderLUT
  
  ;- Init lookup table
  
  Procedure SwapRB(color.l)
    !mov eax, [p.v_color]
    !bswap eax
    !ror eax, 8
    ProcedureReturn
  EndProcedure
    
  Procedure InitTable()
    Protected.i bit, col, i, ink, paper, row
    If Not *RenderLUT
      *Mem = AllocateMemory(SizeOf(RenderLUT) + 32)
      *RenderLUT.RenderLUT = (*Mem + 31) & -32
      For i = 0 To 255
        If i & $80
          ink = (i >> 3 & 15) : paper = (i >> 3 & 8) | (i & 7)  ; flash
        Else
          paper = (i >> 3 & 15) : ink = (i >> 3 & 8) | (i & 7)  ; normal
        EndIf
        CompilerIf #SwapRB
          *RenderLUT\color[i << 1    ] = SwapRB(PeekL(?ColorLUT + paper << 2))
          *RenderLUT\color[i << 1 + 1] = SwapRB(PeekL(?ColorLUT + ink   << 2))
        CompilerElse
          *RenderLUT\color[i << 1    ] = PeekL(?ColorLUT + paper << 2)  ; paper
          *RenderLUT\color[i << 1 + 1] = PeekL(?ColorLUT + ink   << 2)  ; ink
        CompilerEndIf
      Next  
      For row = 0 To 7
        For col = 0 To 31
          *RenderLUT\offset[row << 5 | col] = row << 5 | col
          *RenderLUT\offset[row << 5 | col | 256] = row << 5 | col | 2048
          *RenderLUT\offset[row << 5 | col | 512] = row << 5 | col | 4096
        Next
      Next
      For i = 0 To 255
        For bit = 0 To 7
          If i & ($80 >> bit)
            *RenderLUT\bit_expand[i << 3 | bit] = -1
          EndIf
        Next
      Next
    EndIf
  EndProcedure
  
  InitTable()
  
  ;- Main code
  
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
    Macro rax : eax : EndMacro
    Macro rbx : ebx : EndMacro
    Macro rcx : ecx : EndMacro
    Macro rdx : edx : EndMacro
    Macro rsi : esi : EndMacro
    Macro rdi : edi : EndMacro
    Macro rsp : esp : EndMacro
  CompilerEndIf
  
  Macro M_8x1(line)
    !movzx eax, byte [rsi + rbx + line*256] ; get bits
    !shl eax, 5
    !movdqa xmm0, [rdx    + rax]            ; expand every bit to 32 bits 
    !movdqa xmm1, [rdx+16 + rax]            ; expand every bit to 32 bits 
    !pand xmm0, xmm3
    !pand xmm1, xmm3
    !pxor xmm0, xmm2
    !pxor xmm1, xmm2
    CompilerIf #FlipY
      !movdqu [rdi    + (7-line)*1024], xmm0
      !movdqu [rdi+16 + (7-line)*1024], xmm1
    CompilerElse
      !movdqu [rdi    + line*1024], xmm0
      !movdqu [rdi+16 + line*1024], xmm1     
    CompilerEndIf
  EndMacro
  
  Procedure RenderSCR(*SCR, *Target, FlashState = 0)
    FlashState = (FlashState << 7) ! $7f
    ; backup registers without push
    ; so references to local variables stay valid
    mov [rsp -  8], rbx
    mov [rsp - 16], rsi
    mov [rsp - 24], rdi
    ; load registers
    mov rdx, *RenderLUT
    mov rsi, *SCR
    mov rdi, *Target
    CompilerIf #FlipY
      add rdi, 188416
    CompilerEndIf    
    ; block loop
    !xor ecx, ecx                           ; ecx = counter (0 - 767)
    !.block_loop:
    movzx eax, byte [rsi+6144 + rcx]        ; get attributes of block
    !and eax, [p.v_FlashState]
    movq xmm0, [rdx+8192 + rax*8]
    !pshufd xmm2, xmm0, 00000000b           ; paper
    !pshufd xmm3, xmm0, 01010101b           ; ink
    !pxor xmm3, xmm2
    movzx ebx, word [rdx+10240 + rcx*2]     ; get block offset
    M_8x1(0)                                ; block line 0
    M_8x1(1)                                ; block line 1
    M_8x1(2)                                ; block line 2
    M_8x1(3)                                ; block line 3
    M_8x1(4)                                ; block line 4
    M_8x1(5)                                ; block line 5
    M_8x1(6)                                ; block line 6
    M_8x1(7)                                ; block line 7
    add rdi, 32
    !inc ecx
    !test ecx, 31                           ; new row check
    !jnz .block_loop
    CompilerIf #FlipY
      sub rdi, 9216
    CompilerElse
      add rdi, 7168
    CompilerEndIf
    !cmp ecx, 768
    !jne .block_loop
    ; restore registers
    mov rdi, [rsp - 24]
    mov rsi, [rsp - 16]
    mov rbx, [rsp -  8]
  EndProcedure
  
EndModule

Example

Code: Select all

*SCR = AllocateMemory(6912)
If ReadFile(0, "zx.scr")
  ReadData(0, *SCR, 6912)
  CloseFile(0)
EndIf

CreateImage(0, 256, 192, 32)
StartDrawing(ImageOutput(0))
ZXScreen::RenderSCR(*SCR, DrawingBuffer(), 0)
StopDrawing()

OpenWindow(0, 0, 0, 256, 192, "ZXScreen", #PB_Window_SystemMenu | #PB_Window_ScreenCentered)
ImageGadget(0, 0, 0, 256, 192, ImageID(0))
Repeat
  Event = WaitWindowEvent()
Until Event = #PB_Event_CloseWindow

Otrebor · Post by **Otrebor** » Mon Oct 03, 2016 4:24 pm

Very nice

Didaktik · Post by **Didaktik** » Mon Oct 03, 2016 7:18 pm

wilbert wrote:I tried a SSE2 version since I like old 8 bit computers.
Please let me know how it performs compared to the original code.

#FlipY and #SwapRB can be used to configure the output.
The colors were taken from the javascript source of a spectrum emulator.
You can change these as you like.

Thank you Wilbert!!

I'm in real time converts 7 images. My version hesitate about 50-55 fps. Your version gives 62-64fps.
This is on windows 10 tablet with Atom processor.

I also tried to use threads. Since the Atom has 4 cores. But it did not give a noticeable speed increase.
(I was assigned to each thread processing of two screens - a total of 3 threads and 4 th thread that is waiting for their execution)

; t1 = CreateThread(@RenderScreens12(),1)
; t2 = CreateThread(@RenderScreens34(),1)
; t3 = CreateThread(@RenderScreens56(),1)
; WaitThread(t1)
; WaitThread(t2)
; WaitThread(t3)

wilbert · Post by **wilbert** » Mon Oct 03, 2016 7:45 pm

Didaktik wrote:I'm in real time converts 7 images. My version hesitate about 50-55 fps. Your version gives 62-64fps.

It's at least a few more frames per second.

Are these fps for a complete Spectrum emulator or just the screen conversion ?
If the procedure itself I created can only execute about 62 times a second, an Atom is much slower as my Core i5.

Didaktik wrote:I also tried to use threads. Since the Atom has 4 cores. But it did not give a noticeable speed increase.

Normally I would expect multiple threads to make a difference but I'm not familiar with the capabilities of an Atom cpu for tablet.

Didaktik · Post by **Didaktik** » Mon Oct 03, 2016 8:57 pm

wilbert wrote:
Didaktik wrote:I'm in real time converts 7 images. My version hesitate about 50-55 fps. Your version gives 62-64fps.
It's at least a few more frames per second.
Are these fps for a complete Spectrum emulator or just the screen conversion ?
If the procedure itself I created can only execute about 62 times a second, an Atom is much slower as my Core i5.

Didaktik wrote:I also tried to use threads. Since the Atom has 4 cores. But it did not give a noticeable speed increase.
Normally I would expect multiple threads to make a difference but I'm not familiar with the capabilities of an Atom cpu for tablet.

I made software for VJ'ing. As footage i use ZX Spectrum demo, games, GIF's etc in native ZX Spectrum screen format.
1 frame video = no packed 6912 byte screen.

https://www.youtube.com/watch?v=hzBvGabmBy8
https://www.youtube.com/watch?v=jlIauv1Yn9s

so I have 6 layers of video that can be mixed and distorted a different way. Total 6 images obtained on layers 1 final mixed picture and 1 picture for preview footage.
Of course during all the layers are not always used, usually 3-4 pieces.
But I try to do whatever proagramma gave out 60 fps at maximum load.

Of course on a standard PC scr2texture speed has no significant effect on performance. I encountered this only on the tablet. There 1.3gg processor frequency and possibly other features.

PureBasic Forums - English

Code optimisation or translate to assembly?

Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?

Re: Code optimisation or translate to assembly?