Optimize code size

TI-994A · Post by **TI-994A** » Mon Jul 30, 2012 4:47 pm

Tenaja wrote:
TI-994A wrote:...but coding in Assembly isn't exactly within the reach of the average PureBasic programmer...
I would venture to guess that "very good" code is also not within the reach of the average programmer! (PB or not!)

But good, clean code is.

xorc1zt · Post by **xorc1zt** » Mon Jul 30, 2012 8:15 pm

Code: Select all

Procedure a (va.i, vb.i, vc.i, vd.i, vf.i)
    Define r = va+vb-vc-vd+vf
    
    If (r = 0)
        Debug "0"
    ElseIf (r = 1)
        Debug "1"
    ElseIf (r = 2)
        Debug "2"
    EndIf
EndProcedure

a(1,3,5,3,5)
a(1,3,5,3,5)
a(1,3,5,3,5)

the calls

Code: Select all

; a(1,3,5,3,5)
  SUB    rsp,8
  PUSH   qword 5
  PUSH   qword 3
  PUSH   qword 5
  PUSH   qword 3
  PUSH   qword 1
  POP    rcx
  POP    rdx
  POP    r8
  POP    r9
  SUB    rsp,32
  CALL  _Procedure0
  ADD    rsp,48
; a(1,3,5,3,5)
  SUB    rsp,8
  PUSH   qword 5
  PUSH   qword 3
  PUSH   qword 5
  PUSH   qword 3
  PUSH   qword 1
  POP    rcx
  POP    rdx
  POP    r8
  POP    r9
  SUB    rsp,32
  CALL  _Procedure0
  ADD    rsp,48
; a(1,3,5,3,5)
  SUB    rsp,8
  PUSH   qword 5
  PUSH   qword 3
  PUSH   qword 5
  PUSH   qword 3
  PUSH   qword 1
  POP    rcx
  POP    rdx
  POP    r8
  POP    r9
  SUB    rsp,32
  CALL  _Procedure0
  ADD    rsp,48

fromVB · Post by **fromVB** » Tue Jul 31, 2012 3:22 pm

TI-994A wrote:
luis wrote:Actually Fred said use some kind of peephole optimization ...
Even the slightest optimization is still good news. Thanks for pointing that out, luis.

luis wrote:... there is still the fact some unused procedure are compiled in the final exe, if we want to consider the "size" side of things...
Knowing that PureBasic is a single-pass compiler, bloat caused by the failure to remove unused variables/constants/procedures/etc. is the programmer's fault. A good programmer should not only be able to code well, but should also know the capabilities and limitations of the tools and resources he uses.

luis wrote:... The code you write in PB can be great PB code, but sometime you could have written it in a total different way in hand-made ASM, so it's not a guarantee optimized code in PB can give the best optimized ASM code for that specific problem. Hand written ASM using a different approach could be better ... the code generated by the compiler, but it will always be "bloated" compared to hand written ASM ... The most astute compiler cannot beat the average ASM programmer in the way the general code layout is generated ...
It would be highly unfair to compare PureBasic-generated Assembly with manually-coded Assembly; automated translations will invariably be less efficient than manual ones, and that's true for all compilers. But still, the important thing is that PureBasic and FASM does not worsen your code. It all comes down to the programmer; if you write good, clean code, PureBasic would be able to produce fairly decent and efficient executables. If you don't, no amount of manual Assembly-tweaking, and no optimizer in the world would be able to help you.

Thank you for taking the time to explain the intricacies to me; your points directed me to some good resources in the forum, which have clarified my concerns.

So it is translated to assembly that is not optimized? I'm confused.

wilbert · Post by **wilbert** » Tue Jul 31, 2012 4:26 pm

fromVB wrote:So it is translated to assembly that is not optimized? I'm confused.

Source code

Code: Select all

A = 20
B = 32
C = A / 4 + B / 4

PB output

Code: Select all

; A = 20
  MOV    dword [v_A],20
; B = 32
  MOV    dword [v_B],32
; C = A / 4 + B / 4
  MOV    ebx,dword [v_A]
  MOV    eax,ebx
  MOV    ecx,4
  CDQ
  IDIV   ecx
  MOV    ebx,eax
  MOV    edi,dword [v_B]
  MOV    eax,edi
  MOV    ecx,4
  CDQ
  IDIV   ecx
  MOV    edi,eax
  ADD    ebx,edi
  MOV    dword [v_C],ebx

Hand optimized code

Code: Select all

mov eax, 20
mov edx, 32
mov dword [v_A], eax
mov dword [v_B], edx
add eax, edx
sar eax, 2
mov dword [v_C], eax

A human understands that A / 4 + B / 4 is the same as (A + B) / 4. The second approach is faster and requires less operations.
4 is a power of 2. Therefore a shift operation (sar) can be used to perform the division. This is faster compared to an integer division (idiv).
The hand optimized code also has less memory access (3 times instead of 5).

C compilers try to analyze the source code and perform optimizations like that automatically while PureBasic just does a literal translation.
Does that make PureBasic slow ? No. In most situations you won't notice the difference. But when you have code that is executed millions of times inside a loop it can be very rewarding to optimize that part of your application using hand written asm code.

Demivec · Post by **Demivec** » Tue Jul 31, 2012 4:46 pm

wilbert wrote:A human understands that A / 4 + B / 4 is the same as (A + B) / 4. The second approach is faster and requires less operations.
4 is a power of 2. Therefore a shift operation (sar) can be used to perform the division. This is faster compared to an integer division (idiv).
The hand optimized code also has less memory access (3 times instead of 5).

C compilers try to analyze the source code and perform optimizations like that automatically while PureBasic just does a literal translation.
Does that make PureBasic slow ? No. In most situations you won't notice the difference. But when you have code that is executed millions of times inside a loop it can be very rewarding to optimize that part of your application using hand written asm code.

Hand optimizing the assembly is not needed in the simple case you presented. Instead what is needed is hand optimizing the PureBasic before it is compiled.

Original PureBasic source code:

Code: Select all

A = 20
B = 32
C = A / 4 + B / 4

Here are your hand optimized assembly results:

Code: Select all

mov eax, 20
mov edx, 32
mov dword [v_A], eax
mov dword [v_B], edx
add eax, edx
sar eax, 2
mov dword [v_C], eax

Compare to the assembly of hand optimized PureBasic source code:

Code: Select all

; A = 20
  MOV    dword [v_A],20
; B = 32
  MOV    dword [v_B],32
; C = (A + B) >> 2
  MOV    ebx,dword [v_a]
  ADD    ebx,dword [v_b]
  SAR    ebx,2
  MOV    dword [v_C],ebx

In this case they are almost identical. The hand optimized PureBasic source code actually uses one less MOV instruction when it is compiled but it still has 5 memory accesses. I am not skilled enough to know if this results in code that is faster or slower overall

.

The point of this illustration is that there are definite improvements to be gained from wisely structuring code in PureBasic. I agree that using assembly code directly can provide the greatest improvements, but as I demonstrated, it isn't necessary to go directly to assembly to obtain at least some of these improvements.

wilbert · Post by **wilbert** » Tue Jul 31, 2012 4:57 pm

I tried to explain what compiler optimizations could look like.
It's not uncommon you can get similar results optimizing the source code.
Your remark demonstrates quite well that compiler optimizations have less impact on good quality source code.

Tenaja · Post by **Tenaja** » Tue Jul 31, 2012 5:47 pm

fromVB wrote:So it is translated to assembly that is not optimized? I'm confused.

This is correct. Check this out, for example:

Code: Select all

Procedure test1()
	Protected b.i = 1
	Protected c.i = 0
	Protected d.i = 0
	Protected e.i = 0
	Protected f.i = 0
	Protected g.i = 0
	; Insert Code here...
EndProcedure

test1()

The Proc compiles to this:

Code: Select all

_Procedure0:
  PS0=28
  XOR    eax,eax          ; initialize all local variables to 0:
  PUSH   eax
  PUSH   eax
  PUSH   eax
  PUSH   eax
  PUSH   eax
  PUSH   eax                                                                                                                                                                          
; Protected b.i = 1
  MOV    dword [esp],1
; Protected c.i = 0
  MOV    dword [esp+4],0        ; duplicate the initialization of variables to 0:
; Protected d.i = 0
  MOV    dword [esp+8],0
; Protected e.i = 0
  MOV    dword [esp+12],0
; Protected f.i = 0
  MOV    dword [esp+16],0
; Protected g.i = 0
  MOV    dword [esp+20],0

The coder is punished for exercising good programming practice of initializing every variable before use, because in this example, they are all initialized TWICE!

And regarding the previous example with the a/4 + b/4 vs (a+b) >> 2, that is the difference between a strong optimizing compiler (It does it automatically for you!) vs. one with NO optimizing, which requires YOU to code it all in a way that ends up more optimized. The biggest problem with this is that VERY FEW PB coders know how to even SEE the asm code, much less actually analyze it. Also, very few understand why one code snip compiles more efficiently than another. An optimizing compiler takes care of all of this for you so it just doesn't matter--the coder can be naive, and blissfully so, since the end result is about the same. Not so with PB; since its primary focus is beginners, who are still naive about these manual optimizations, it merrily produces poor code because of the skillset of the coder base.

And YES, it IS SLOWER. How much slower may very well be irrelevant, in most situations. Consider if you are calculating an internal window size after a major window resize; nobody will notice a dozen extra instructions being executed, even on a slow cpu.

On the other hand, if you are performing numerous CPU-intensive calculations, those dozen extra instructions could easily turn into minutes, if not hours given a large enough task.

Thorium · Post by **Thorium** » Tue Jul 31, 2012 6:08 pm

Not to forget that PB does not have any constructs to enable the programmer to use SIMD operations. With PB (and many other compilers) this is only possible using ASM. And having no SIMD means you are loosing a whole layer of parallelization.

For any parallelizable code SIMD is a great performance boost. And SIMD can't be applied automaticly by compilers. Some compilers like Intel C++ offer commands to use SIMD on the high language level.

TI-994A · Post by **TI-994A** » Wed Aug 01, 2012 10:41 am

Demivec wrote:...what is needed is hand optimizing the PureBasic before it is compiled...there are definite improvements to be gained from wisely structuring code in PureBasic....

wilbert wrote:Your remark demonstrates quite well that compiler optimizations have less impact on good quality source code.

Thanks for your illustrative examples; precisely the point I've been trying to make.

Tenaja wrote:...The coder is punished for exercising good programming practice of initializing every variable before use, because in this example, they are all initialized TWICE!

The coder is punished for not knowing that PureBasic already initializes all variables with zero. Good practices are always contingent.

Tenaja wrote:(thread link)...the Java-written Open Office is so pathetically slow with a million values in a spreadsheet that it took nearly an hour to process a simple graph. In MS Excel, it was nearly instantaneous. There is NO WAY that C is that much faster than Java in ANY benchmark test... the Java implementation either has poor memory management, or poor coding...

An optimizing compiler takes care of all of this for you so it just doesn't matter--the coder can be naive, and blissfully so, since the end result is about the same.

Even highly optimized Java, with its bytecode, JVM, and JIT compilation, cannot help poor coding. Thank you for pointing that out, although it contradicts your second statement above. The results are clearly not the same.

Vitor_Boss® · Post by **Vitor_Boss®** » Sun Aug 19, 2012 5:17 am

Thank you for all codes and explanation about ASM coding. I tried optimize memory comparison procedure and got this

Code: Select all

Procedure CmpMem(*Source, *Test, Size)
push     edi
push     esi
mov     eax, [esp+14h] ; Size
mov     esi, [esp+10h] ; *Test2
mov     edi, [esp+0Ch] ; *Source
dec eax
cmp     eax, 3
jl     loc_Byte_cmp_init ; if (Size > 4)
add esi, 0FFFFFFFDh ; Needed to alignment of last byte
add edi, 0FFFFFFFDh ; Needed to alignment of last byte
cmp   eax, 7
jl     loc_4Bytes_cmp ; if (Size > 8)
add esi, 0FFFFFFFCh ; Needed to alignment of last byte
add edi, 0FFFFFFFCh ; Needed to alignment of last byte
!loc_8Bytes_cmp:
movq     mm0, [eax+edi] ;Source[i]
pcmpeqd mm0, [eax+esi] ;Test[i]
psrlq mm0, 16
movd edx, mm0
cmp edx, 0FFFFFFFFh
jne     loc_return_0 ; If (Source[i]!=Test[i])
add     eax, 0FFFFFFF8h ; i-=8
jl     loc_return_1
cmp     eax, 7
jg      loc_8Bytes_cmp ; If (i>8)
mov     al, 7
jmp      loc_8Bytes_cmp ; If (i>7)
!loc_4Bytes_cmp:
mov     edx, [eax+edi] ;Source[i]
cmp     edx, [eax+esi] ;Test[i]
jnz     loc_return_0 ; If (Source[i]!=Test[i])
add     eax, 0FFFFFFFCh ; i-=4
jl     loc_return_1
cmp     eax, 2
jg      loc_4Bytes_cmp ; If (i>3)
mov     al, 3
jmp		loc_4Bytes_cmp
!loc_Byte_cmp_init:
test    eax, eax
jle     loc_return_0 ;If (Size > 0)
!loc_Byte_cmp:
mov     dl, [eax+edi] ;Source[i]
cmp     dl, [eax+esi] ;Test[i]
jnz      loc_return_0 ; If (Source[i]!=Test[i])
dec     eax ; --i
jge      loc_Byte_cmp ; If (i>=0)
!loc_return_1:
pop     edi
pop     esi
mov     eax, 1
!retn 0ch
!loc_return_0:
pop     edi
pop     esi
EndProcedure

In a 10 millions test loop I got up to 50% speed improvement.
All processors since 98 has MMX right?

Fred you can use it if you want.

Thorium · Post by **Thorium** » Mon Aug 20, 2012 8:26 pm

Check out the Rep and Scan instructions, they should speed it up more.
Or use 128bit SSE registers for the compare.

Vitor_Boss® · Post by **Vitor_Boss®** » Tue Aug 28, 2012 4:50 am

Thorium wrote:Check out the Rep and Scan instructions, they should speed it up more.
Or use 128bit SSE registers for the compare.

Updated with 64bit MMX instructions for a fastest results, all memory use a bus 64bits wide right?

PureBasic Forums - English

Optimize code size

Re: Optimize code size

Re: Optimize code size

Re: Optimize code size

Re: Optimize code size

Re: Optimize code size

Re: Optimize code size

Re: Optimize code size

Re: Optimize code size

Re: Optimize code size

Re: Optimize code size

Re: Optimize code size

Re: Optimize code size