SwapMemory()
SwapMemory()
PureBasic has memory functions to copy and move memory, why not one to swap memory contents?
This is a common element with memory operations and would find many uses.
Something like: SwapMemory(*SourceMemoryID, *DestinationMemoryID, length)
This is a common element with memory operations and would find many uses.
Something like: SwapMemory(*SourceMemoryID, *DestinationMemoryID, length)
Hmm, good point! Normally you would.
But I can imagine situations where you would need to physically swap the memory contents.
And when it comes to swapping partial memory then it's not possible to do so by simple pointer swapping.
Here is a comparison between two PureBasic and two ASM implementations of a MemorySwap()
Please note that these tests just swap entire memory PureBasic allocations and they must match in length.
Further improvements would be to modify it so it'll work with any memory (PB allocated or not), and maybe some optimization in the looping etc.
These are the test results on my system using size=1024*1024*910 (Which x2 equals around 1.8GB memory total)
Processor=x64, T1=2389ms, T2=2604ms, T3=6406ms, T4=1400ms.
Processor=x86, T1=2243ms, T2=2575ms, T3=6453ms, T4=1220ms.
I was rather surprised myself, it seems the PureBasic team has some very CPU cache friendly Swap code.
Note that this was on a AMD Phenom X3, how does it behave on Core 2 or i7 ?
Oh and I did an extra test on x64 using size=1024*1024*1500
which totals to 3GB, something not possible using x86.
Processor=x64, T1=3963ms, T2=4344ms, T3=10656ms, T4=2335ms.
Wow! Using Quads and Swap seems to really fly on large memory amounts.
NOTE! Only the "native" MemorySwap() and MemorySwap2() works fully on x64, the asm variants will probably choke on x64 systems with more than 4GB mem where the referenced memory is above the 4GB range.
And here is one that will work with any memory and length, and probably what the original poster really wanted, speed should be the same as MemorySwap2() above:
EDIT: Fixed a bug with MemorySwap(*src.Quad,*dst.Quad,length.i) the byte copy part didn't reduce length count, oops.
But I can imagine situations where you would need to physically swap the memory contents.
And when it comes to swapping partial memory then it's not possible to do so by simple pointer swapping.
Here is a comparison between two PureBasic and two ASM implementations of a MemorySwap()
Please note that these tests just swap entire memory PureBasic allocations and they must match in length.
Further improvements would be to modify it so it'll work with any memory (PB allocated or not), and maybe some optimization in the looping etc.
These are the test results on my system using size=1024*1024*910 (Which x2 equals around 1.8GB memory total)
Processor=x64, T1=2389ms, T2=2604ms, T3=6406ms, T4=1400ms.
Processor=x86, T1=2243ms, T2=2575ms, T3=6453ms, T4=1220ms.
I was rather surprised myself, it seems the PureBasic team has some very CPU cache friendly Swap code.
Note that this was on a AMD Phenom X3, how does it behave on Core 2 or i7 ?
Oh and I did an extra test on x64 using size=1024*1024*1500
which totals to 3GB, something not possible using x86.
Processor=x64, T1=3963ms, T2=4344ms, T3=10656ms, T4=2335ms.
Wow! Using Quads and Swap seems to really fly on large memory amounts.
NOTE! Only the "native" MemorySwap() and MemorySwap2() works fully on x64, the asm variants will probably choke on x64 systems with more than 4GB mem where the referenced memory is above the 4GB range.
Code: Select all
EnableExplicit
CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
#Processor=64
CompilerElse
#Processor=86
CompilerEndIf
Procedure.i MemorySwapASM2(*src,*dst) ;Swap PB allocated memory, size must be the same.
Protected result.i=#False,srclen.i,dstlen.i,t.b,l.l
srclen=MemorySize(*src)
dstlen=MemorySize(*dst)
If (srclen=dstlen) And (srclen>0)
If srclen>=SizeOf(Long)
Repeat
!MOV ecx,dword [p.p_src]
!MOV edx,dword [p.p_dst]
!MOV eax,dword [ecx]
!XCHG dword [edx],eax
!XCHG dword [ecx],eax
srclen-SizeOf(Long)
*src+SizeOf(Long)
*dst+SizeOf(Long)
Until srclen<SizeOf(Long)
EndIf
If srclen>0
Repeat
t=PeekB(*src)
PokeB(*src,PeekB(*dst))
PokeB(*dst,t)
srclen-SizeOf(Byte)
*src+SizeOf(Byte)
*dst+SizeOf(Byte)
Until srclen<SizeOf(Byte)
EndIf
result=#True
EndIf
ProcedureReturn result
EndProcedure
Procedure.i MemorySwapASM(*src,*dst) ;Swap PB allocated memory, size must be the same.
Protected result.i=#False,srclen.i,dstlen.i,t.b
srclen=MemorySize(*src)
dstlen=MemorySize(*dst)
If (srclen=dstlen) And (srclen>0)
If srclen>=SizeOf(Long)
Repeat
!MOV ecx,dword [p.p_src]
!MOV eax,dword [ecx]
!MOV edx,dword [p.p_dst]
!XOR eax,dword [edx]
!XOR dword [edx],eax
!XOR eax,dword [edx]
!MOV dword [ecx],eax
srclen-SizeOf(Long)
*src+SizeOf(Long)
*dst+SizeOf(Long)
Until srclen<SizeOf(Long)
EndIf
If srclen>0
Repeat
t=PeekB(*src)
PokeB(*src,PeekB(*dst))
PokeB(*dst,t)
srclen-SizeOf(Byte)
*src+SizeOf(Byte)
*dst+SizeOf(Byte)
Until srclen<SizeOf(Byte)
EndIf
result=#True
EndIf
ProcedureReturn result
EndProcedure
Procedure.i MemorySwap2(*src.Quad,*dst.Quad) ;Swap PB allocated memory, size must be the same.
Protected result.i=#False,srclen.i,dstlen.i,t.b
srclen=MemorySize(*src)
dstlen=MemorySize(*dst)
If (srclen=dstlen) And (srclen>0)
If srclen>=SizeOf(Quad)
Repeat
Swap *src\q,*dst\q
srclen-SizeOf(Quad)
*src+SizeOf(Quad)
*dst+SizeOf(Quad)
Until srclen<SizeOf(Quad)
EndIf
If srclen>0
Repeat
t=PeekB(*src)
PokeB(*src,PeekB(*dst))
PokeB(*dst,t)
srclen-SizeOf(Byte)
*src+SizeOf(Byte)
*dst+SizeOf(Byte)
Until srclen<SizeOf(Byte)
EndIf
result=#True
EndIf
ProcedureReturn result
EndProcedure
Procedure.i MemorySwap(*src.Long,*dst.Long) ;Swap PB allocated memory, size must be the same.
Protected result.i=#False,srclen.i,dstlen.i,t.b
srclen=MemorySize(*src)
dstlen=MemorySize(*dst)
If (srclen=dstlen) And (srclen>0)
If srclen>=SizeOf(Long)
Repeat
Swap *src\l,*dst\l
srclen-SizeOf(Long)
*src+SizeOf(Long)
*dst+SizeOf(Long)
Until srclen<SizeOf(Long)
EndIf
If srclen>0
Repeat
t=PeekB(*src)
PokeB(*src,PeekB(*dst))
PokeB(*dst,t)
srclen-SizeOf(Byte)
*src+SizeOf(Byte)
*dst+SizeOf(Byte)
Until srclen<SizeOf(Byte)
EndIf
result=#True
EndIf
ProcedureReturn result
EndProcedure
CompilerIf #PB_Compiler_Debugger
;Example
Define size.i,*source,*destination,*pos.Byte,text$
size=15
*source=AllocateMemory(size)
*destination=AllocateMemory(size)
FillMemory(*source,size,$AB,#PB_Byte)
FillMemory(*destination,size,$CD,#PB_Byte)
text$=""
For *pos=*source To *source+(size-1)
text$+RSet(Hex(*pos\b,#PB_Byte),2,"0")
Next
Debug "Src before: "+text$
text$=""
For *pos=*destination To *destination+(size-1)
text$+RSet(Hex(*pos\b,#PB_Byte),2,"0")
Next
Debug "Dst before: "+text$
Debug ""
If MemorySwapASM(*source,*destination)
text$=""
For *pos=*source To *source+(size-1)
text$+RSet(Hex(*pos\b,#PB_Byte),2,"0")
Next
Debug "Src after: "+text$
text$=""
For *pos=*destination To *destination+(size-1)
text$+RSet(Hex(*pos\b,#PB_Byte),2,"0")
Next
Debug "Dst after: "+text$
Else
Debug "Size not equal!"
EndIf
FreeMemory(*source)
FreeMemory(*destination)
CompilerElse
;Speed test, Compile without debugger to run.
Define size.i,*source,*destination
Define t.l,t1.l,t2.l,t3.l,t4.l,memerror.i=#False
timeBeginPeriod_(1)
size=1024*1024*100 ;set this as high as you are able to.
*source=AllocateMemory(size)
*destination=AllocateMemory(size)
If *source=#Null
memerror=#True
EndIf
If *destination=#Null
memerror=#True
EndIf
If Not memerror
FillMemory(*source,size,$AB,#PB_Byte)
FillMemory(*destination,size,$CD,#PB_Byte)
t1=timeGetTime_()
MemorySwap(*source,*destination)
t=timeGetTime_()
t1=t-t1
t2=timeGetTime_()
MemorySwapASM(*source,*destination)
t=timeGetTime_()
t2=t-t2
t3=timeGetTime_()
MemorySwapASM2(*source,*destination)
t=timeGetTime_()
t3=t-t3
t4=timeGetTime_()
MemorySwap2(*source,*destination)
t=timeGetTime_()
t4=t-t4
MessageRequester("Result","Processor=x"+Str(#Processor)+", T1="+Str(t1)+"ms, "+"T2="+Str(t2)+"ms, "+"T3="+Str(t3)+"ms, "+"T4="+Str(t4)+"ms.")
Else
MessageRequester("Result","Not enough memory for allocations!")
EndIf
If *source
FreeMemory(*source)
EndIf
If *destination
FreeMemory(*destination)
EndIf
timeEndPeriod_(1)
;End
CompilerEndIf
End
Code: Select all
Procedure.i MemorySwap(*src.Quad,*dst.Quad,length.i)
Protected result.i=#False,t.b
If length>0
If length>=SizeOf(Quad)
Repeat
Swap *src\q,*dst\q
length-SizeOf(Quad)
*src+SizeOf(Quad)
*dst+SizeOf(Quad)
Until length<SizeOf(Quad)
EndIf
If length>0
Repeat
t=PeekB(*src)
PokeB(*src,PeekB(*dst))
PokeB(*dst,t)
length-SizeOf(Byte)
*src+SizeOf(Byte)
*dst+SizeOf(Byte)
Until length<SizeOf(Byte)
EndIf
result=#True
EndIf
ProcedureReturn result
EndProcedure
Last edited by Rescator on Tue Sep 08, 2009 2:36 pm, edited 1 time in total.
Yup!
I'd never have though of doing it this way though.
FLD and FSTP are float ops after all.
I guess the only way to make a Memory Swap faster would be to use MMX and SSE features, in other words this is as fast as it gets using normal x86 calls? or is it possible to improve the Swap using x64 code for PureBasic x64 ?
Code: Select all
; Swap *src\q,*dst\q
MOV ebp,dword [esp+PS4+0]
LEA eax,[ebp]
MOV ebp,dword [esp+PS4+8-4]
FLD qword [ebp]
FLD qword [eax]
FSTP qword [ebp]
FSTP qword [eax]
FLD and FSTP are float ops after all.

I guess the only way to make a Memory Swap faster would be to use MMX and SSE features, in other words this is as fast as it gets using normal x86 calls? or is it possible to improve the Swap using x64 code for PureBasic x64 ?
Just wrote a "real" Assembler procedure. real = pure assembler, no PB in procedure.
Speed: 329ms on 1024*1024*910
Speed: 329ms on 1024*1024*910
Code: Select all
Procedure MemorySwapASM3(*src, *dst, len.i)
!mov esi,[p.p_src]
!mov edi,[p.p_dst]
!mov ecx,[p.v_len]
!mov ebx,ecx
!shr ecx,2
!align 4
!MemorySwapASM3_LoopStart1:
!cmp ecx,0
!je MemorySwapASM3_LoopEnd1
!mov eax,[esi]
!mov edx,[edi]
!mov [esi],edx
!mov [edi],eax
!add esi,4
!add edi,4
!dec ecx
!jmp MemorySwapASM3_LoopStart1
!MemorySwapASM3_LoopEnd1:
!mov ecx,ebx
!and ecx,3
!MemorySwapASM3_LoopStart2:
!cmp ecx,0
!je MemorySwapASM3_LoopEnd2
!mov al,[esi]
!mov dl,[edi]
!mov [esi],dl
!mov [edi],al
!inc esi
!inc edi
!dec ecx
!jmp MemorySwapASM3_LoopStart2
!MemorySwapASM3_LoopEnd2:
ProcedureReturn #True
EndProcedure
x64 Version of my procedure is faster than x86:
T1=1017 T2=1149 T3=4257 T4=538
my prodecure = 276
on 1024*1024*910
T1=1017 T2=1149 T3=4257 T4=538
my prodecure = 276
on 1024*1024*910
Code: Select all
Procedure MemorySwapASM3(*src, *dst, len.i)
!mov rsi,[p.p_src]
!mov rdi,[p.p_dst]
!mov rcx,[p.v_len]
!mov rbx,rcx
!shr rcx,3
!align 4
!MemorySwapASM3_LoopStart1:
!cmp rcx,0
!je MemorySwapASM3_LoopEnd1
!mov rax,[rsi]
!mov rdx,[rdi]
!mov [rsi],rdx
!mov [rdi],rax
!add rsi,8
!add rdi,8
!dec rcx
!jmp MemorySwapASM3_LoopStart1
!MemorySwapASM3_LoopEnd1:
!mov rcx,rbx
!and rcx,7
!MemorySwapASM3_LoopStart2:
!cmp rcx,0
!je MemorySwapASM3_LoopEnd2
!mov al,[rsi]
!mov dl,[rdi]
!mov [rsi],dl
!mov [rdi],al
!inc rsi
!inc rdi
!dec rcx
!jmp MemorySwapASM3_LoopStart2
!MemorySwapASM3_LoopEnd2:
ProcedureReturn #True
EndProcedure
Hmm! I just remembered something...
x64 CPU's, don't all of them have at least SSE2, possibly even SSE3? (aka SIMD instructions?)
If that is the case then a lot of the x64 code could just utilize those features. (I'm sure PB GFX stuff would benefit a lot)
But simple things like this could too?
Ah found it: http://en.wikipedia.org/wiki/X64
x64 CPU's, don't all of them have at least SSE2, possibly even SSE3? (aka SIMD instructions?)
If that is the case then a lot of the x64 code could just utilize those features. (I'm sure PB GFX stuff would benefit a lot)
But simple things like this could too?
Ah found it: http://en.wikipedia.org/wiki/X64
So this means all x64 CPU's (AMD/Intel) have at least SSE+SSE2 capabilities, thus one can assume by default that SSE and SSE2 is available on x64 (unlike on x86). The question is whether the PureBasic team has enough time to SIMD'ify PureBasic x64 that much, hopefully in the long run though.SSE and SSE2 are available in 32-bit mode in modern x86 processors; however, if they're used in 32-bit programs, those programs will only work on systems with processors that have the feature. This is not an issue in 64-bit programs, as all AMD64 processors have SSE and SSE2, so using SSE and SSE2 instructions instead of x87 instructions does not reduce the set of machines on which x64 programs can be run. SSE and SSE2 are generally faster than, and duplicate most of the features of the traditional x87 instructions, MMX, and 3DNow!.
Swap with two longs uses the stack rather than registers in 4.40 b2 x86:
This could be optimized into this:
Code: Select all
; Swap a, b
PUSH dword [v_b]
MOV edx,dword [v_a]
MOV dword [v_b],edx
POP dword [v_a]
Code: Select all
!mov eax, [v_a]
!mov ebx, [v_b]
!mov [v_a], ebx
!mov [v_b], eax