Re: Help me defend PB "reputation"
Posted: Thu May 21, 2015 1:04 pm
Your friends are at the very beginning of evolution. Can't compare C++ with 

http://www.purebasic.com
https://www.purebasic.fr/english/
It's not easy to give a good answer. It depends on the code. Some years ago someone benchmarked some code from different compilers and compared it to PureBasic. It's somewhere on the forums, but i can't find it. It also included Java.TI-994A wrote:Hi Thorium. Very slow? Much slower?Thorium wrote:Actualy it's very slow compared to optimizing C or C++ compilers ... the resulting executable is much slower...![]()
With no optimisation options, speed disparities are to be expected. But what do you consider very much slower?
Hi Thorium. At best, double the speed sounds about right. Anything more would require some painstaking, hand-coded compiler directives to achieve.Thorium wrote:...As a general statement i would say 50% performance would be normal, compared to C/C++.
If you use instrinsics for SIMD it should be around 10 to 20%.
Not necessarily.TI-994A wrote:At best, double the speed sounds about right. Anything more would require some painstaking, hand-coded compiler directives to achieve.
Code: Select all
PureBasic 197 MB/s 100%
Assembler 1178 MB/s 598%
MMX 7133 MB/s 3621%
SSE2 10997 MB/s 5582%
The only one I remember is this one, but it isn't saying much: http://www.purebasic.fr/english/viewtop ... =7&t=48202Thorium wrote:Some years ago someone benchmarked some code from different compilers and compared it to PureBasic. It's somewhere on the forums, but i can't find it. It also included Java.
Hi Thorium. Very impressive results; more than fifty times faster compared to vanilla PureBasic, and ten times faster than assembly.Thorium wrote:For example i implemented a image filter in PureBasic and optimized it with assembly ... It was a very small loop, with just a few variables.
It's not uncommon to get a massive speed increase when you hand code some part with assembler code especially when you can use SSE.TI-994A wrote:Hi Thorium. Very impressive results; more than fifty times faster compared to vanilla PureBasic, and ten times faster than assembly.Thorium wrote:For example i implemented a image filter in PureBasic and optimized it with assembly ... It was a very small loop, with just a few variables.
It would be really great if we could see how each of the codes were implemented. Maybe we could all learn something about optimisation.
Code: Select all
result.i = 0
t1 = ElapsedMilliseconds()
For i = 1 To 100000000
result = result + i
result = result - 50000
result = result >> 5
Next
t2 = ElapsedMilliseconds()
MessageRequester(Str(t2-t1)+" ms", Str(result))
Code: Select all
; For i = 1 To 100000000
MOV qword [v_i],1
_For1:
MOV rax,100000000
CMP rax,qword [v_i]
JL _Next2
; result = result + i
MOV r15,qword [v_result]
ADD r15,qword [v_i]
MOV qword [v_result],r15
; result = result - 50000
MOV r15,qword [v_result]
ADD r15,-50000
MOV qword [v_result],r15
; result = result >> 5
MOV r15,qword [v_result]
SAR r15,5
MOV qword [v_result],r15
; Next
_NextContinue2:
INC qword [v_i]
JNO _For1
_Next2:
Code: Select all
result = (result + i - 50000) >> 5
Yes, thats the one.luis wrote: The only one I remember is this one, but it isn't saying much: http://www.purebasic.fr/english/viewtop ... =7&t=48202
It's nothing special. Was one of my first tries on SIMD.TI-994A wrote:Hi Thorium. Very impressive results; more than fifty times faster compared to vanilla PureBasic, and ten times faster than assembly.Thorium wrote:For example i implemented a image filter in PureBasic and optimized it with assembly ... It was a very small loop, with just a few variables.
It would be really great if we could see how each of the codes were implemented. Maybe we could all learn something about optimisation.
Code: Select all
Structure Tsi_Pixel_Channel
Channel.a
EndStructure
;Undos the up filter.
Procedure Tsi_UnFilterUp(*ImageData, Width.i, Height.i, PixelSize.i)
CompilerSelect #PB_Compiler_Processor
CompilerCase #PB_Processor_x86
If Tsi_Sse2Supported = #True
;save registers
!push esi
!push edi
!push ebx
;calculate the pointers
!mov edi,[p.p_ImageData+12]
!mov esi,edi
!mov eax,[p.v_Width+12]
!mul dword[p.v_PixelSize+12]
!mov edx,eax
!add edi,edx
;calculate the counters
!mov eax,[p.v_Height+12]
!dec eax
!mul dword[p.v_Width+12]
!mul dword[p.v_PixelSize+12]
!mov ecx,eax
!shr ecx,7
!and eax,127
!mov ebx,eax
;process a part of the data to cut the length to a multiple of 128
!test ebx,ebx
!je Tsi_UnFilterUp_Sse2CutLengthEnd
!align 4
!Tsi_UnFilterUp_Sse2CutLengthStart:
!mov al,[edi]
!add al,[esi]
!mov [edi],al
!inc esi
!inc edi
!dec ebx
!jne Tsi_UnFilterUp_Sse2CutLengthStart
!align 4
!Tsi_UnFilterUp_Sse2CutLengthEnd:
;process the rest of the data
!test ecx,ecx
!je Tsi_UnFilterUp_Sse2LoopEnd
!align 4
!Tsi_UnFilterUp_Sse2LoopStart:
!movdqu xmm0,[esi]
!movdqu xmm1,[esi+16]
!movdqu xmm2,[esi+32]
!movdqu xmm3,[esi+48]
!movdqu xmm4,[esi+64]
!movdqu xmm5,[esi+80]
!movdqu xmm6,[esi+96]
!movdqu xmm7,[esi+112]
!paddb xmm0,[edi]
!paddb xmm1,[edi+16]
!paddb xmm2,[edi+32]
!paddb xmm3,[edi+48]
!paddb xmm4,[edi+64]
!paddb xmm5,[edi+80]
!paddb xmm6,[edi+96]
!paddb xmm7,[edi+112]
!movdqu [edi],xmm0
!movdqu [edi+16],xmm1
!movdqu [edi+32],xmm2
!movdqu [edi+48],xmm3
!movdqu [edi+64],xmm4
!movdqu [edi+80],xmm5
!movdqu [edi+96],xmm6
!movdqu [edi+112],xmm7
!add esi,128
!add edi,128
!dec ecx
!jne Tsi_UnFilterUp_Sse2LoopStart
!align 4
!Tsi_UnFilterUp_Sse2LoopEnd:
;restore the registers
!pop ebx
!pop edi
!pop esi
;end SSE2 state
!emms
ElseIf Tsi_MmxSupported = #True
;save registers
!push esi
!push edi
!push ebx
;calculate the pointers
!mov edi,[p.p_ImageData+12]
!mov esi,edi
!mov eax,[p.v_Width+12]
!mul dword[p.v_PixelSize+12]
!mov edx,eax
!add edi,edx
;calculate the counters
!mov eax,[p.v_Height+12]
!dec eax
!mul dword[p.v_Width+12]
!mul dword[p.v_PixelSize+12]
!mov ecx,eax
!shr ecx,6
!and eax,63
!mov ebx,eax
;process a part of the data to cut the length to a multiple of 64
!test ebx,ebx
!je Tsi_UnFilterUp_MmxCutLengthEnd
!align 4
!Tsi_UnFilterUp_MmxCutLengthStart:
!mov al,[edi]
!add al,[esi]
!mov [edi],al
!inc esi
!inc edi
!dec ebx
!jne Tsi_UnFilterUp_MmxCutLengthStart
!align 4
!Tsi_UnFilterUp_MmxCutLengthEnd:
;process the rest of the data
!test ecx,ecx
!je Tsi_UnFilterUp_MmxLoopEnd
!align 4
!Tsi_UnFilterUp_MmxLoopStart:
!movq mm0,[esi]
!movq mm1,[esi+8]
!movq mm2,[esi+16]
!movq mm3,[esi+24]
!movq mm4,[esi+32]
!movq mm5,[esi+40]
!movq mm6,[esi+48]
!movq mm7,[esi+56]
!paddb mm0,[edi]
!paddb mm1,[edi+8]
!paddb mm2,[edi+16]
!paddb mm3,[edi+24]
!paddb mm4,[edi+32]
!paddb mm5,[edi+40]
!paddb mm6,[edi+48]
!paddb mm7,[edi+56]
!movq [edi],mm0
!movq [edi+8],mm1
!movq [edi+16],mm2
!movq [edi+24],mm3
!movq [edi+32],mm4
!movq [edi+40],mm5
!movq [edi+48],mm6
!movq [edi+56],mm7
!add esi,64
!add edi,64
!dec ecx
!jne Tsi_UnFilterUp_MmxLoopStart
!align 4
!Tsi_UnFilterUp_MmxLoopEnd:
;restore the registers
!pop ebx
!pop edi
!pop esi
;end MMX state
!emms
Else
!push esi
!push edi
!mov eax,[p.v_Height+8]
!dec eax
!mul dword[p.v_Width+8]
!mul dword[p.v_PixelSize+8]
!mov ecx,eax
!mov edi,[p.p_ImageData+8]
!mov esi,edi
!mov eax,[p.v_Width+8]
!mul dword[p.v_PixelSize+8]
!mov edx,eax
!add edi,edx
!align 4
!Tsi_UnFilterUp_LoopStart:
!mov al,[edi]
!add al,[esi]
!mov [edi],al
!inc esi
!inc edi
!dec ecx
!jne Tsi_UnFilterUp_LoopStart
!pop edi
!pop esi
EndIf
CompilerCase #PB_Processor_x64
If Tsi_Sse2Supported = #True
;save registers
!push rsi
!push rdi
;calculate the pointers
!mov rdi,[p.p_ImageData+16]
!mov rsi,rdi
!mov rax,[p.v_Width+16]
!mul qword[p.v_PixelSize+16]
!mov rdx,rax
!add rdi,rdx
;calculate the counters
!mov rax,[p.v_Height+16]
!dec rax
!mul qword[p.v_Width+16]
!mul qword[p.v_PixelSize+16]
!mov rcx,rax
!shr rcx,7
!and rax,127
!mov r10,rax
;process a part of the data to cut the length to a multiple of 128
!test r10,r10
!je Tsi_UnFilterUp_Sse2CutLengthEnd
!align 8
!Tsi_UnFilterUp_Sse2CutLengthStart:
!mov al,[rdi]
!add al,[rsi]
!mov [rdi],al
!inc rsi
!inc rdi
!dec r10
!jne Tsi_UnFilterUp_Sse2CutLengthStart
!align 8
!Tsi_UnFilterUp_Sse2CutLengthEnd:
;process the rest of the data
!test rcx,rcx
!je Tsi_UnFilterUp_Sse2LoopEnd
!align 8
!Tsi_UnFilterUp_Sse2LoopStart:
!movdqu xmm0,[rsi]
!movdqu xmm1,[rsi+16]
!movdqu xmm2,[rsi+32]
!movdqu xmm3,[rsi+48]
!movdqu xmm4,[rsi+64]
!movdqu xmm5,[rsi+80]
!movdqu xmm6,[rsi+96]
!movdqu xmm7,[rsi+112]
!paddb xmm0,[rdi]
!paddb xmm1,[rdi+16]
!paddb xmm2,[rdi+32]
!paddb xmm3,[rdi+48]
!paddb xmm4,[rdi+64]
!paddb xmm5,[rdi+80]
!paddb xmm6,[rdi+96]
!paddb xmm7,[rdi+112]
!movdqu [rdi],xmm0
!movdqu [rdi+16],xmm1
!movdqu [rdi+32],xmm2
!movdqu [rdi+48],xmm3
!movdqu [rdi+64],xmm4
!movdqu [rdi+80],xmm5
!movdqu [rdi+96],xmm6
!movdqu [rdi+112],xmm7
!add rsi,128
!add rdi,128
!dec rcx
!jne Tsi_UnFilterUp_Sse2LoopStart
!align 8
!Tsi_UnFilterUp_Sse2LoopEnd:
;restore the registers
!pop rdi
!pop rsi
;end SSE2 state
!emms
ElseIf Tsi_MmxSupported = #True
;save registers
!push rsi
!push rdi
;calculate the pointers
!mov rdi,[p.p_ImageData+16]
!mov rsi,rdi
!mov rax,[p.v_Width+16]
!mul qword[p.v_PixelSize+16]
!mov rdx,rax
!add rdi,rdx
;calculate the counters
!mov rax,[p.v_Height+16]
!dec rax
!mul qword[p.v_Width+16]
!mul qword[p.v_PixelSize+16]
!mov rcx,rax
!shr rcx,6
!and rax,63
!mov r10,rax
;process a part of the data to cut the length to a multiple of 64
!test r10,r10
!je Tsi_UnFilterUp_MmxCutLengthEnd
!align 8
!Tsi_UnFilterUp_MmxCutLengthStart:
!mov al,[rdi]
!add al,[rsi]
!mov [rdi],al
!inc rsi
!inc rdi
!dec r10
!jne Tsi_UnFilterUp_MmxCutLengthStart
!align 8
!Tsi_UnFilterUp_MmxCutLengthEnd:
;process the rest of the data
!test rcx,rcx
!je Tsi_UnFilterUp_MmxLoopEnd
!align 8
!Tsi_UnFilterUp_MmxLoopStart:
!movq mm0,[rsi]
!movq mm1,[rsi+8]
!movq mm2,[rsi+16]
!movq mm3,[rsi+24]
!movq mm4,[rsi+32]
!movq mm5,[rsi+40]
!movq mm6,[rsi+48]
!movq mm7,[rsi+56]
!paddb mm0,[rdi]
!paddb mm1,[rdi+8]
!paddb mm2,[rdi+16]
!paddb mm3,[rdi+24]
!paddb mm4,[rdi+32]
!paddb mm5,[rdi+40]
!paddb mm6,[rdi+48]
!paddb mm7,[rdi+56]
!movq [rdi],mm0
!movq [rdi+8],mm1
!movq [rdi+16],mm2
!movq [rdi+24],mm3
!movq [rdi+32],mm4
!movq [rdi+40],mm5
!movq [rdi+48],mm6
!movq [rdi+56],mm7
!add rsi,64
!add rdi,64
!dec rcx
!jne Tsi_UnFilterUp_MmxLoopStart
!align 8
!Tsi_UnFilterUp_MmxLoopEnd:
;restore the registers
!pop rdi
!pop rsi
;end MMX state
!emms
Else
!push rsi
!push rdi
!mov rax,[p.v_Height+16]
!dec rax
!mul qword[p.v_Width+16]
!mul qword[p.v_PixelSize+16]
!mov rcx,rax
!mov rdi,[p.p_ImageData+16]
!mov rsi,rdi
!mov rax,[p.v_Width+16]
!mul qword[p.v_PixelSize+16]
!mov rdx,rax
!add rdi,rdx
!align 8
!Tsi_UnFilterUp_LoopStart:
!mov al,[rdi]
!add al,[rsi]
!mov [rdi],al
!inc rsi
!inc rdi
!dec rcx
!jne Tsi_UnFilterUp_LoopStart
!pop rdi
!pop rsi
EndIf
CompilerDefault
Protected.i X, ByteSize
Protected *ActualChannel.Tsi_Pixel_Channel
Protected *PriorChannel.Tsi_Pixel_Channel
*PriorChannel = *ImageData
*ActualChannel = *ImageData + Width * PixelSize
Height - 1
ByteSize = Width * Height * PixelSize
For X = 1 To ByteSize
*ActualChannel\Channel = *ActualChannel\Channel + *PriorChannel\Channel
*ActualChannel + 1
*PriorChannel + 1
Next
CompilerEndSelect
EndProcedure
Hi wilbert. Great example of how good coding makes a difference. Thank you.wilbert wrote:It's not uncommon to get a massive speed increase when you hand code some part with assembler code especially when you can use SSE.
Thanks for the code, Thorium. That's my point exactly.Thorium wrote:SIMD can be hard to use but it can be very rewarding.
There are also optimization that can be done in PureBasic that are often overlooked.TI-994A wrote:Hi wilbert. Great example of how good coding makes a difference. Thank you.wilbert wrote:It's not uncommon to get a massive speed increase when you hand code some part with assembler code especially when you can use SSE.![]()
Thanks for the code, Thorium. That's my point exactly.Thorium wrote:SIMD can be hard to use but it can be very rewarding.
Code: Select all
Protected.i X, ByteSize
Protected *ActualChannel.Tsi_Pixel_Channel
Protected *PriorChannel.Tsi_Pixel_Channel
*PriorChannel = *ImageData
*ActualChannel = *ImageData + Width * PixelSize
Height - 1
ByteSize = Width * Height * PixelSize
For X = 1 To ByteSize
*ActualChannel\Channel = *ActualChannel\Channel + *PriorChannel\Channel
*ActualChannel + 1
*PriorChannel + 1
Next
Code: Select all
MessageRequester("Hello World","Hello World")
Code: Select all
MessageRequester("Hello World","Hello World")
! xor rax, rax
RosettaCode currently has 492 pages in its PureBasic categoryThe lack of lots of code samples is a big drop from VB but the forum here is good for getting answers.