PureBasic Forums - English

Posted: **Thu May 21, 2015 1:04 pm**

Your friends are at the very beginning of evolution. Can't compare C++ with

Posted: **Thu May 21, 2015 6:08 pm**

TI-994A wrote:
Thorium wrote:Actualy it's very slow compared to optimizing C or C++ compilers ... the resulting executable is much slower...
Hi Thorium. Very slow? Much slower?

With no optimisation options, speed disparities are to be expected. But what do you consider very much slower?

It's not easy to give a good answer. It depends on the code. Some years ago someone benchmarked some code from different compilers and compared it to PureBasic. It's somewhere on the forums, but i can't find it. It also included Java.

As a general statement i would say 50% performance would be normal, compared to C/C++.
If you use instrinsics for SIMD it should be around 10 to 20%.
As i said you can compensate for that by using some inline assembly. So, it's no issue for me.

Posted: **Thu May 21, 2015 6:43 pm**

Thorium wrote:...As a general statement i would say 50% performance would be normal, compared to C/C++.

If you use instrinsics for SIMD it should be around 10 to 20%.

Hi Thorium. At best, double the speed sounds about right. Anything more would require some painstaking, hand-coded compiler directives to achieve.

Posted: **Thu May 21, 2015 10:41 pm**

TI-994A wrote:At best, double the speed sounds about right. Anything more would require some painstaking, hand-coded compiler directives to achieve.

Not necessarily.
For example i implemented a image filter in PureBasic and optimized it with assembly.
The result was that a simple assembly version without any instruction extension was allready about 5 times faster. Many C/C++ compilers can achieve this without any inline assembly needed.
It was a very small loop, with just a few variables. That can be very good optimized by a compiler.

That was the benchmark of the different implementations:

Code: Select all

PureBasic   197 MB/s  100%
Assembler  1178 MB/s  598%
MMX        7133 MB/s 3621%
SSE2      10997 MB/s 5582%

Posted: **Thu May 21, 2015 11:43 pm**

Thorium wrote:Some years ago someone benchmarked some code from different compilers and compared it to PureBasic. It's somewhere on the forums, but i can't find it. It also included Java.

The only one I remember is this one, but it isn't saying much: http://www.purebasic.fr/english/viewtop ... =7&t=48202

Not the one you remember ? If it's not this then I must have missed it.

PS: this thread is ... amusing.

Posted: **Fri May 22, 2015 4:39 am**

Thorium wrote:For example i implemented a image filter in PureBasic and optimized it with assembly ... It was a very small loop, with just a few variables.

Hi Thorium. Very impressive results; more than fifty times faster compared to vanilla PureBasic, and ten times faster than assembly.

It would be really great if we could see how each of the codes were implemented. Maybe we could all learn something about optimisation.

Posted: **Fri May 22, 2015 6:41 am**

TI-994A wrote:
Thorium wrote:For example i implemented a image filter in PureBasic and optimized it with assembly ... It was a very small loop, with just a few variables.
Hi Thorium. Very impressive results; more than fifty times faster compared to vanilla PureBasic, and ten times faster than assembly.

It would be really great if we could see how each of the codes were implemented. Maybe we could all learn something about optimisation.

It's not uncommon to get a massive speed increase when you hand code some part with assembler code especially when you can use SSE.
But already a very simple thing is that you can use registers instead of memory to store temporary variables.
PureBasic handles code line by line, storing variables all the time which is convenient if you want to mix PB code with assembler but also inefficient.
Take for example this code

Code: Select all

result.i = 0
t1 = ElapsedMilliseconds()
For i = 1 To 100000000
  result = result + i
  result = result - 50000
  result = result >> 5
Next  
t2 = ElapsedMilliseconds()
MessageRequester(Str(t2-t1)+" ms", Str(result))

If you look at the x64 assembler source of the for/next loop, you will see immediately that it can be optimized

Code: Select all

; For i = 1 To 100000000
  MOV    qword [v_i],1
_For1:
  MOV    rax,100000000
  CMP    rax,qword [v_i]
  JL    _Next2
; result = result + i
  MOV    r15,qword [v_result]
  ADD    r15,qword [v_i]
  MOV    qword [v_result],r15
; result = result - 50000
  MOV    r15,qword [v_result]
  ADD    r15,-50000
  MOV    qword [v_result],r15
; result = result >> 5
  MOV    r15,qword [v_result]
  SAR    r15,5
  MOV    qword [v_result],r15
; Next  
_NextContinue2:
  INC    qword [v_i]
  JNO   _For1
_Next2:

Of course you can optimize the PB code like this

Code: Select all

  result = (result + i - 50000) >> 5

which makes it a lot faster but that's the kind of optimizations a compiler could also make for you.

But for most applications, the computer is waiting for user input a lot of the time and you won't notice the difference.
And like Thorium said, for some special routines that are called a lot you can mix in assembler.

Posted: **Fri May 22, 2015 7:34 am**

luis wrote: The only one I remember is this one, but it isn't saying much: http://www.purebasic.fr/english/viewtop ... =7&t=48202

Yes, thats the one.

TI-994A wrote:
Thorium wrote:For example i implemented a image filter in PureBasic and optimized it with assembly ... It was a very small loop, with just a few variables.
Hi Thorium. Very impressive results; more than fifty times faster compared to vanilla PureBasic, and ten times faster than assembly.

It would be really great if we could see how each of the codes were implemented. Maybe we could all learn something about optimisation.

It's nothing special. Was one of my first tries on SIMD.
It's unfiltering the "up filter" of the PNG image file format.
The code is very simple, it's a lot because of the many implementations.

If you want to get into optimization just learn some assembly and SIMD (MMX, SSE, AVX). SIMD can be hard to use but it can be very rewarding.

Code: Select all

Structure Tsi_Pixel_Channel
  Channel.a
EndStructure

;Undos the up filter.
Procedure Tsi_UnFilterUp(*ImageData, Width.i, Height.i, PixelSize.i)

  CompilerSelect #PB_Compiler_Processor
  
    CompilerCase #PB_Processor_x86

      If Tsi_Sse2Supported = #True

        ;save registers
        !push esi
        !push edi
        !push ebx

        ;calculate the pointers
        !mov edi,[p.p_ImageData+12]
        !mov esi,edi
        !mov eax,[p.v_Width+12]
        !mul dword[p.v_PixelSize+12]
        !mov edx,eax
        !add edi,edx

        ;calculate the counters
        !mov eax,[p.v_Height+12]
        !dec eax
        !mul dword[p.v_Width+12]
        !mul dword[p.v_PixelSize+12]
        !mov ecx,eax
        !shr ecx,7
        !and eax,127
        !mov ebx,eax
        
        ;process a part of the data to cut the length to a multiple of 128
        !test ebx,ebx
        !je Tsi_UnFilterUp_Sse2CutLengthEnd
        
        !align 4
        !Tsi_UnFilterUp_Sse2CutLengthStart:

          !mov al,[edi]
          !add al,[esi]
          !mov [edi],al
          
          !inc esi
          !inc edi
      
        !dec ebx
        !jne Tsi_UnFilterUp_Sse2CutLengthStart

        !align 4
        !Tsi_UnFilterUp_Sse2CutLengthEnd:
        
        ;process the rest of the data
        !test ecx,ecx
        !je Tsi_UnFilterUp_Sse2LoopEnd
        
        !align 4
        !Tsi_UnFilterUp_Sse2LoopStart:

          !movdqu xmm0,[esi]
          !movdqu xmm1,[esi+16]
          !movdqu xmm2,[esi+32]
          !movdqu xmm3,[esi+48]
          !movdqu xmm4,[esi+64]
          !movdqu xmm5,[esi+80]
          !movdqu xmm6,[esi+96]
          !movdqu xmm7,[esi+112]
          
          !paddb xmm0,[edi]
          !paddb xmm1,[edi+16]
          !paddb xmm2,[edi+32]
          !paddb xmm3,[edi+48]
          !paddb xmm4,[edi+64]
          !paddb xmm5,[edi+80]
          !paddb xmm6,[edi+96]
          !paddb xmm7,[edi+112]

          !movdqu [edi],xmm0
          !movdqu [edi+16],xmm1
          !movdqu [edi+32],xmm2
          !movdqu [edi+48],xmm3
          !movdqu [edi+64],xmm4
          !movdqu [edi+80],xmm5
          !movdqu [edi+96],xmm6
          !movdqu [edi+112],xmm7

          !add esi,128
          !add edi,128
        
        !dec ecx
        !jne Tsi_UnFilterUp_Sse2LoopStart
        
        !align 4
        !Tsi_UnFilterUp_Sse2LoopEnd:

        ;restore the registers
        !pop ebx
        !pop edi
        !pop esi

        ;end SSE2 state
        !emms

      ElseIf Tsi_MmxSupported = #True

        ;save registers
        !push esi
        !push edi
        !push ebx

        ;calculate the pointers
        !mov edi,[p.p_ImageData+12]
        !mov esi,edi
        !mov eax,[p.v_Width+12]
        !mul dword[p.v_PixelSize+12]
        !mov edx,eax
        !add edi,edx

        ;calculate the counters
        !mov eax,[p.v_Height+12]
        !dec eax
        !mul dword[p.v_Width+12]
        !mul dword[p.v_PixelSize+12]
        !mov ecx,eax
        !shr ecx,6
        !and eax,63
        !mov ebx,eax

        ;process a part of the data to cut the length to a multiple of 64
        !test ebx,ebx
        !je Tsi_UnFilterUp_MmxCutLengthEnd
        
        !align 4
        !Tsi_UnFilterUp_MmxCutLengthStart:
      
          !mov al,[edi]
          !add al,[esi]
          !mov [edi],al
          
          !inc esi
          !inc edi
      
        !dec ebx
        !jne Tsi_UnFilterUp_MmxCutLengthStart

        !align 4
        !Tsi_UnFilterUp_MmxCutLengthEnd:

        ;process the rest of the data
        !test ecx,ecx
        !je Tsi_UnFilterUp_MmxLoopEnd

        !align 4
        !Tsi_UnFilterUp_MmxLoopStart:

          !movq mm0,[esi]
          !movq mm1,[esi+8]
          !movq mm2,[esi+16]
          !movq mm3,[esi+24]
          !movq mm4,[esi+32]
          !movq mm5,[esi+40]
          !movq mm6,[esi+48]
          !movq mm7,[esi+56]

          !paddb mm0,[edi]
          !paddb mm1,[edi+8]
          !paddb mm2,[edi+16]
          !paddb mm3,[edi+24]
          !paddb mm4,[edi+32]
          !paddb mm5,[edi+40]
          !paddb mm6,[edi+48]
          !paddb mm7,[edi+56]

          !movq [edi],mm0
          !movq [edi+8],mm1
          !movq [edi+16],mm2
          !movq [edi+24],mm3
          !movq [edi+32],mm4
          !movq [edi+40],mm5
          !movq [edi+48],mm6
          !movq [edi+56],mm7
          
          !add esi,64
          !add edi,64

        !dec ecx
        !jne Tsi_UnFilterUp_MmxLoopStart

        !align 4
        !Tsi_UnFilterUp_MmxLoopEnd:

        ;restore the registers
        !pop ebx
        !pop edi
        !pop esi

        ;end MMX state
        !emms
      
      Else
      
        !push esi
        !push edi
  
        !mov eax,[p.v_Height+8]
        !dec eax
        !mul dword[p.v_Width+8]
        !mul dword[p.v_PixelSize+8]
        !mov ecx,eax
        
        !mov edi,[p.p_ImageData+8]
        !mov esi,edi
        
        !mov eax,[p.v_Width+8]
        !mul dword[p.v_PixelSize+8]
        !mov edx,eax
        !add edi,edx
        
        !align 4
        !Tsi_UnFilterUp_LoopStart:
        
          !mov al,[edi]
          !add al,[esi]
          !mov [edi],al
          
          !inc esi
          !inc edi
          
        !dec ecx
        !jne Tsi_UnFilterUp_LoopStart
      
        !pop edi
        !pop esi
      
      EndIf
    
    CompilerCase #PB_Processor_x64

      If Tsi_Sse2Supported = #True

        ;save registers
        !push rsi
        !push rdi

        ;calculate the pointers
        !mov rdi,[p.p_ImageData+16]
        !mov rsi,rdi
        !mov rax,[p.v_Width+16]
        !mul qword[p.v_PixelSize+16]
        !mov rdx,rax
        !add rdi,rdx

        ;calculate the counters
        !mov rax,[p.v_Height+16]
        !dec rax
        !mul qword[p.v_Width+16]
        !mul qword[p.v_PixelSize+16]
        !mov rcx,rax
        !shr rcx,7
        !and rax,127
        !mov r10,rax

        ;process a part of the data to cut the length to a multiple of 128
        !test r10,r10
        !je Tsi_UnFilterUp_Sse2CutLengthEnd
        
        !align 8
        !Tsi_UnFilterUp_Sse2CutLengthStart:
      
          !mov al,[rdi]
          !add al,[rsi]
          !mov [rdi],al
          
          !inc rsi
          !inc rdi
      
        !dec r10
        !jne Tsi_UnFilterUp_Sse2CutLengthStart

        !align 8
        !Tsi_UnFilterUp_Sse2CutLengthEnd:
        
        ;process the rest of the data
        !test rcx,rcx
        !je Tsi_UnFilterUp_Sse2LoopEnd
        
        !align 8
        !Tsi_UnFilterUp_Sse2LoopStart:

          !movdqu xmm0,[rsi]
          !movdqu xmm1,[rsi+16]
          !movdqu xmm2,[rsi+32]
          !movdqu xmm3,[rsi+48]
          !movdqu xmm4,[rsi+64]
          !movdqu xmm5,[rsi+80]
          !movdqu xmm6,[rsi+96]
          !movdqu xmm7,[rsi+112]
          
          !paddb xmm0,[rdi]
          !paddb xmm1,[rdi+16]
          !paddb xmm2,[rdi+32]
          !paddb xmm3,[rdi+48]
          !paddb xmm4,[rdi+64]
          !paddb xmm5,[rdi+80]
          !paddb xmm6,[rdi+96]
          !paddb xmm7,[rdi+112]

          !movdqu [rdi],xmm0
          !movdqu [rdi+16],xmm1
          !movdqu [rdi+32],xmm2
          !movdqu [rdi+48],xmm3
          !movdqu [rdi+64],xmm4
          !movdqu [rdi+80],xmm5
          !movdqu [rdi+96],xmm6
          !movdqu [rdi+112],xmm7

          !add rsi,128
          !add rdi,128
        
        !dec rcx
        !jne Tsi_UnFilterUp_Sse2LoopStart
        
        !align 8
        !Tsi_UnFilterUp_Sse2LoopEnd:

        ;restore the registers
        !pop rdi
        !pop rsi

        ;end SSE2 state
        !emms

      ElseIf Tsi_MmxSupported = #True

        ;save registers
        !push rsi
        !push rdi

        ;calculate the pointers
        !mov rdi,[p.p_ImageData+16]
        !mov rsi,rdi
        !mov rax,[p.v_Width+16]
        !mul qword[p.v_PixelSize+16]
        !mov rdx,rax
        !add rdi,rdx

        ;calculate the counters
        !mov rax,[p.v_Height+16]
        !dec rax
        !mul qword[p.v_Width+16]
        !mul qword[p.v_PixelSize+16]
        !mov rcx,rax
        !shr rcx,6
        !and rax,63
        !mov r10,rax

        ;process a part of the data to cut the length to a multiple of 64
        !test r10,r10
        !je Tsi_UnFilterUp_MmxCutLengthEnd
        
        !align 8
        !Tsi_UnFilterUp_MmxCutLengthStart:
      
          !mov al,[rdi]
          !add al,[rsi]
          !mov [rdi],al
          
          !inc rsi
          !inc rdi
      
        !dec r10
        !jne Tsi_UnFilterUp_MmxCutLengthStart

        !align 8
        !Tsi_UnFilterUp_MmxCutLengthEnd:

        ;process the rest of the data
        !test rcx,rcx
        !je Tsi_UnFilterUp_MmxLoopEnd

        !align 8
        !Tsi_UnFilterUp_MmxLoopStart:

          !movq mm0,[rsi]
          !movq mm1,[rsi+8]
          !movq mm2,[rsi+16]
          !movq mm3,[rsi+24]
          !movq mm4,[rsi+32]
          !movq mm5,[rsi+40]
          !movq mm6,[rsi+48]
          !movq mm7,[rsi+56]

          !paddb mm0,[rdi]
          !paddb mm1,[rdi+8]
          !paddb mm2,[rdi+16]
          !paddb mm3,[rdi+24]
          !paddb mm4,[rdi+32]
          !paddb mm5,[rdi+40]
          !paddb mm6,[rdi+48]
          !paddb mm7,[rdi+56]

          !movq [rdi],mm0
          !movq [rdi+8],mm1
          !movq [rdi+16],mm2
          !movq [rdi+24],mm3
          !movq [rdi+32],mm4
          !movq [rdi+40],mm5
          !movq [rdi+48],mm6
          !movq [rdi+56],mm7
          
          !add rsi,64
          !add rdi,64

        !dec rcx
        !jne Tsi_UnFilterUp_MmxLoopStart

        !align 8
        !Tsi_UnFilterUp_MmxLoopEnd:

        ;restore the registers
        !pop rdi
        !pop rsi

        ;end MMX state
        !emms

      Else
      
        !push rsi
        !push rdi

        !mov rax,[p.v_Height+16]
        !dec rax
        !mul qword[p.v_Width+16]
        !mul qword[p.v_PixelSize+16]        
        !mov rcx,rax
        
        !mov rdi,[p.p_ImageData+16]
        !mov rsi,rdi
        !mov rax,[p.v_Width+16]
        !mul qword[p.v_PixelSize+16]
        !mov rdx,rax        
        !add rdi,rdx
        
        !align 8
        !Tsi_UnFilterUp_LoopStart:
        
          !mov al,[rdi]
          !add al,[rsi]
          !mov [rdi],al
          
          !inc rsi
          !inc rdi
          
        !dec rcx
        !jne Tsi_UnFilterUp_LoopStart
      
        !pop rdi
        !pop rsi
      
      EndIf

    CompilerDefault
    
      Protected.i X, ByteSize
      Protected *ActualChannel.Tsi_Pixel_Channel
      Protected *PriorChannel.Tsi_Pixel_Channel
      
      *PriorChannel  = *ImageData
      *ActualChannel = *ImageData + Width * PixelSize
      
      Height - 1
      ByteSize = Width * Height * PixelSize
      
      For X = 1 To ByteSize
        
        *ActualChannel\Channel = *ActualChannel\Channel + *PriorChannel\Channel
        *ActualChannel + 1
        *PriorChannel + 1
    
      Next

  CompilerEndSelect

EndProcedure

Posted: **Fri May 22, 2015 7:53 am**

wilbert wrote:It's not uncommon to get a massive speed increase when you hand code some part with assembler code especially when you can use SSE.

Hi wilbert. Great example of how good coding makes a difference. Thank you.

Thorium wrote:SIMD can be hard to use but it can be very rewarding.

Thanks for the code, Thorium. That's my point exactly.

Posted: **Fri May 22, 2015 9:06 am**

This is actually the thread I remember:
http://www.purebasic.fr/english/viewtop ... 17&t=49882

Posted: **Fri May 22, 2015 10:51 am**

TI-994A wrote:
wilbert wrote:It's not uncommon to get a massive speed increase when you hand code some part with assembler code especially when you can use SSE.
Hi wilbert. Great example of how good coding makes a difference. Thank you.

Thorium wrote:SIMD can be hard to use but it can be very rewarding.
Thanks for the code, Thorium. That's my point exactly.

There are also optimization that can be done in PureBasic that are often overlooked.

For example the filter loop in PureBasic is fairly optimized.

-It precalculates the pointers, so inside the loop it moves to the next pixel channel just by incrementing the pointer. No need for pointer calculation inside the loop.
Allways precalculate as much as possible befor entering a loop. Especialy multiplication and division kills performance.

-It uses just one loop to process the whole image. No nested loop, so no additional overhead. A image has hight and width. But it's a continous string of pixels in memory. So you dont need to have a loop for Y and a nested loop for X. You can just calculate the image size in bytes and use only one loop. You might end up processing line padding bytes as well, but it can still be faster.

-Additional optimizations could be done. For example unrolling the loop. Processing multiple channels and pixels per iteration.

Code: Select all

      Protected.i X, ByteSize
      Protected *ActualChannel.Tsi_Pixel_Channel
      Protected *PriorChannel.Tsi_Pixel_Channel
     
      *PriorChannel  = *ImageData
      *ActualChannel = *ImageData + Width * PixelSize
     
      Height - 1
      ByteSize = Width * Height * PixelSize
     
      For X = 1 To ByteSize
       
        *ActualChannel\Channel = *ActualChannel\Channel + *PriorChannel\Channel
        *ActualChannel + 1
        *PriorChannel + 1
   
      Next

Posted: **Fri Oct 23, 2015 4:24 pm**

As a newbie to PureBasic (two monthsish) and only a hobbyist programmer moving from VB. I find the following from a previous post brilliant.

- executables require absolutely no dependencies or frameworks
- executables can be run out of the box without any installation

The users of the application I am attempting to convert want to be able to run the application from a memory stick just moving from one computer to another. Managed it in VB but took ages and is quite bloated with all the DLLs i have to include plus the cost to my sanity.

I have found PB easy to learn and the only problems I am having are mainly down to my understanding of the bits and pieces. As I olearn more it becomes more apparent that PB is just the way to go. Just scratching the surface as yet.

The lack of lots of code samples is a big drop from VB but the forum here is good for getting answers.

I also have a couple of friends who program for a living and when they start asking me what I am learning I just say PureBasic and put on the most idiotic smile I can to make them feel at home.

Posted: **Fri Oct 23, 2015 9:19 pm**

Code: Select all

MessageRequester("Hello World","Hello World")

One line of code and i now have a GUI app for Windows, Linux, and Mac OSX, and both 32-bit and 64-bit for each, not to mention small native standalone executables

(didn't take me long to code either! lol)
Ok a messagebox is probably cheating, but OpenWindow() is only one line too

I try many different compilers before choosing Purebasic, and some of them couldn't even say Hello World in less than 1 megabyte!

i know 1 megabyte doesnt mean much in 2015 but for me it just doesnt quite feel right when ive seen asm exes that can do it in 1 kilobyte, and i value qualities like accountability, efficiency, performance.

Code: Select all

MessageRequester("Hello World","Hello World")
! xor rax, rax

Version 2.0 featuring inline x86/x64 assembly, because, well, we have that power and not all languages do! it got me interested in the lower levels

try to port that to Java or VB heehee
some great examples of this posted above in this thread

Posted: **Fri Oct 23, 2015 9:28 pm**

collectordave wrote:

The lack of lots of code samples is a big drop from VB but the forum here is good for getting answers.

RosettaCode currently has 492 pages in its PureBasic category
http://rosettacode.org/wiki/PureBasic

The pages contain programming problems solved using PureBasic.

Posted: **Sat Oct 24, 2015 9:06 am**

Yeeeaaaahhhh you can make a product in 1/5 the turn-around but guess what: PB is sometimes 1% less efficient than a C++ solution written by one of the rare C++ coders who actually knows about compiler design and optimizations..

P.S. PB makes machine code PE and ELF on all platforms just like C&C++ compilers.

If I had about $25,000.00 for every person I know who has never finished a product and doesn't know what a buffer overflow is OR how to reverse-engineer a binary tell me about why C&C++ is better; I'd buy the company..

PureBasic Forums - English

Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"

Re: Help me defend PB "reputation"