SwapMemory()

Demivec · Post by **Demivec** » Tue Sep 08, 2009 12:00 am

PureBasic has memory functions to copy and move memory, why not one to swap memory contents?

This is a common element with memory operations and would find many uses.

Something like: SwapMemory(*SourceMemoryID, *DestinationMemoryID, length)

milan1612 · Post by **milan1612** » Tue Sep 08, 2009 9:10 am

Very good suggestion, +1

Trond · Post by **Trond** » Tue Sep 08, 2009 9:14 am

Why don't you just swap the pointers?

Rescator · Post by **Rescator** » Tue Sep 08, 2009 11:00 am

Hmm, good point! Normally you would.
But I can imagine situations where you would need to physically swap the memory contents.
And when it comes to swapping partial memory then it's not possible to do so by simple pointer swapping.

Here is a comparison between two PureBasic and two ASM implementations of a MemorySwap()
Please note that these tests just swap entire memory PureBasic allocations and they must match in length.

Further improvements would be to modify it so it'll work with any memory (PB allocated or not), and maybe some optimization in the looping etc.

These are the test results on my system using size=1024*1024*910 (Which x2 equals around 1.8GB memory total)
Processor=x64, T1=2389ms, T2=2604ms, T3=6406ms, T4=1400ms.
Processor=x86, T1=2243ms, T2=2575ms, T3=6453ms, T4=1220ms.

I was rather surprised myself, it seems the PureBasic team has some very CPU cache friendly Swap code.
Note that this was on a AMD Phenom X3, how does it behave on Core 2 or i7 ?

Oh and I did an extra test on x64 using size=1024*1024*1500
which totals to 3GB, something not possible using x86.
Processor=x64, T1=3963ms, T2=4344ms, T3=10656ms, T4=2335ms.
Wow! Using Quads and Swap seems to really fly on large memory amounts.

NOTE! Only the "native" MemorySwap() and MemorySwap2() works fully on x64, the asm variants will probably choke on x64 systems with more than 4GB mem where the referenced memory is above the 4GB range.

Code: Select all

EnableExplicit

CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
 #Processor=64
CompilerElse
 #Processor=86
CompilerEndIf

Procedure.i MemorySwapASM2(*src,*dst) ;Swap PB allocated memory, size must be the same.
 Protected result.i=#False,srclen.i,dstlen.i,t.b,l.l
 srclen=MemorySize(*src)
 dstlen=MemorySize(*dst)
 If (srclen=dstlen) And (srclen>0)
  If srclen>=SizeOf(Long)
   Repeat
    !MOV ecx,dword [p.p_src]
    !MOV edx,dword [p.p_dst]
    !MOV eax,dword [ecx]
    !XCHG dword [edx],eax
    !XCHG dword [ecx],eax
    srclen-SizeOf(Long)
    *src+SizeOf(Long)
    *dst+SizeOf(Long)
   Until srclen<SizeOf(Long)
  EndIf
  If srclen>0
   Repeat
    t=PeekB(*src)
    PokeB(*src,PeekB(*dst))
    PokeB(*dst,t)
    srclen-SizeOf(Byte)
    *src+SizeOf(Byte)
    *dst+SizeOf(Byte)
   Until srclen<SizeOf(Byte)
  EndIf
  result=#True
 EndIf
 ProcedureReturn result
EndProcedure

Procedure.i MemorySwapASM(*src,*dst) ;Swap PB allocated memory, size must be the same.
 Protected result.i=#False,srclen.i,dstlen.i,t.b
 srclen=MemorySize(*src)
 dstlen=MemorySize(*dst)
 If (srclen=dstlen) And (srclen>0)
  If srclen>=SizeOf(Long)
   Repeat
    !MOV ecx,dword [p.p_src]
    !MOV eax,dword [ecx]
    !MOV edx,dword [p.p_dst]
    !XOR eax,dword [edx]
    !XOR dword [edx],eax
    !XOR eax,dword [edx]
    !MOV dword [ecx],eax
    srclen-SizeOf(Long)
    *src+SizeOf(Long)
    *dst+SizeOf(Long)
   Until srclen<SizeOf(Long)
  EndIf
  If srclen>0
   Repeat
    t=PeekB(*src)
    PokeB(*src,PeekB(*dst))
    PokeB(*dst,t)
    srclen-SizeOf(Byte)
    *src+SizeOf(Byte)
    *dst+SizeOf(Byte)
   Until srclen<SizeOf(Byte)
  EndIf
  result=#True
 EndIf
 ProcedureReturn result
EndProcedure

Procedure.i MemorySwap2(*src.Quad,*dst.Quad) ;Swap PB allocated memory, size must be the same.
 Protected result.i=#False,srclen.i,dstlen.i,t.b
 srclen=MemorySize(*src)
 dstlen=MemorySize(*dst)
 If (srclen=dstlen) And (srclen>0)
  If srclen>=SizeOf(Quad)
   Repeat
    Swap *src\q,*dst\q
    srclen-SizeOf(Quad)
    *src+SizeOf(Quad)
    *dst+SizeOf(Quad)
   Until srclen<SizeOf(Quad)
  EndIf
  If srclen>0
   Repeat
    t=PeekB(*src)
    PokeB(*src,PeekB(*dst))
    PokeB(*dst,t)
    srclen-SizeOf(Byte)
    *src+SizeOf(Byte)
    *dst+SizeOf(Byte)
   Until srclen<SizeOf(Byte)
  EndIf
  result=#True
 EndIf
 ProcedureReturn result
EndProcedure

Procedure.i MemorySwap(*src.Long,*dst.Long) ;Swap PB allocated memory, size must be the same.
 Protected result.i=#False,srclen.i,dstlen.i,t.b
 srclen=MemorySize(*src)
 dstlen=MemorySize(*dst)
 If (srclen=dstlen) And (srclen>0)
  If srclen>=SizeOf(Long)
   Repeat
    Swap *src\l,*dst\l
    srclen-SizeOf(Long)
    *src+SizeOf(Long)
    *dst+SizeOf(Long)
   Until srclen<SizeOf(Long)
  EndIf
  If srclen>0
   Repeat
    t=PeekB(*src)
    PokeB(*src,PeekB(*dst))
    PokeB(*dst,t)
    srclen-SizeOf(Byte)
    *src+SizeOf(Byte)
    *dst+SizeOf(Byte)
   Until srclen<SizeOf(Byte)
  EndIf
  result=#True
 EndIf
 ProcedureReturn result
EndProcedure

CompilerIf #PB_Compiler_Debugger

;Example
Define size.i,*source,*destination,*pos.Byte,text$
size=15
*source=AllocateMemory(size)
*destination=AllocateMemory(size)
FillMemory(*source,size,$AB,#PB_Byte)
FillMemory(*destination,size,$CD,#PB_Byte)

text$=""
For *pos=*source To *source+(size-1)
 text$+RSet(Hex(*pos\b,#PB_Byte),2,"0")
Next
Debug "Src before: "+text$
text$=""
For *pos=*destination To *destination+(size-1)
 text$+RSet(Hex(*pos\b,#PB_Byte),2,"0")
Next
Debug "Dst before: "+text$
Debug ""

If MemorySwapASM(*source,*destination)
 text$=""
 For *pos=*source To *source+(size-1)
  text$+RSet(Hex(*pos\b,#PB_Byte),2,"0")
 Next
 Debug "Src after: "+text$
 text$=""
 For *pos=*destination To *destination+(size-1)
  text$+RSet(Hex(*pos\b,#PB_Byte),2,"0")
 Next
 Debug "Dst after: "+text$
Else
 Debug "Size not equal!"
EndIf

FreeMemory(*source)
FreeMemory(*destination)

CompilerElse

;Speed test, Compile without debugger to run.
Define size.i,*source,*destination
Define t.l,t1.l,t2.l,t3.l,t4.l,memerror.i=#False
timeBeginPeriod_(1)
size=1024*1024*100 ;set this as high as you are able to.
*source=AllocateMemory(size)
*destination=AllocateMemory(size)
If *source=#Null
 memerror=#True
EndIf
If *destination=#Null
 memerror=#True
EndIf
If Not memerror
 FillMemory(*source,size,$AB,#PB_Byte)
 FillMemory(*destination,size,$CD,#PB_Byte)
 
 t1=timeGetTime_()
 MemorySwap(*source,*destination)
 t=timeGetTime_()
 t1=t-t1
 t2=timeGetTime_()
 MemorySwapASM(*source,*destination)
 t=timeGetTime_()
 t2=t-t2
 t3=timeGetTime_()
 MemorySwapASM2(*source,*destination)
 t=timeGetTime_()
 t3=t-t3
 t4=timeGetTime_()
 MemorySwap2(*source,*destination)
 t=timeGetTime_()
 t4=t-t4
 
 MessageRequester("Result","Processor=x"+Str(#Processor)+", T1="+Str(t1)+"ms, "+"T2="+Str(t2)+"ms, "+"T3="+Str(t3)+"ms, "+"T4="+Str(t4)+"ms.")
Else
 MessageRequester("Result","Not enough memory for allocations!")
EndIf

If *source
 FreeMemory(*source)
EndIf
If *destination
 FreeMemory(*destination)
EndIf

 timeEndPeriod_(1)
;End

CompilerEndIf

End

And here is one that will work with any memory and length, and probably what the original poster really wanted, speed should be the same as MemorySwap2() above:

Code: Select all

Procedure.i MemorySwap(*src.Quad,*dst.Quad,length.i)
 Protected result.i=#False,t.b
 If length>0
  If length>=SizeOf(Quad)
   Repeat
    Swap *src\q,*dst\q
    length-SizeOf(Quad)
    *src+SizeOf(Quad)
    *dst+SizeOf(Quad)
   Until length<SizeOf(Quad)
  EndIf
  If length>0
   Repeat
    t=PeekB(*src)
    PokeB(*src,PeekB(*dst))
    PokeB(*dst,t)
    length-SizeOf(Byte)
    *src+SizeOf(Byte)
    *dst+SizeOf(Byte)
   Until length<SizeOf(Byte)
  EndIf
  result=#True
 EndIf
 ProcedureReturn result
EndProcedure

EDIT: Fixed a bug with MemorySwap(*src.Quad,*dst.Quad,length.i) the byte copy part didn't reduce length count, oops.

Post by **Fred** » Tue Sep 08, 2009 11:16 am

You use XCHG which implements an implicit lock, that's why it's slower (i think).

Rescator · Post by **Rescator** » Tue Sep 08, 2009 11:29 am

You're correct, I read that too.
But in this case that memory bus lock is of no use as the memory could change during swap anyway since one value has to be kept in a register.

I'm still impressed by the speed of your Swap code though, I thought for sure that the XOR method was the fastest. O.o

Post by **Fred** » Tue Sep 08, 2009 11:31 am

PB swap is inlined, so you can take a look to the generated code. It does use register only

Rescator · Post by **Rescator** » Tue Sep 08, 2009 11:41 am

Yup!

Code: Select all

; Swap *src\q,*dst\q
  MOV    ebp,dword [esp+PS4+0]
  LEA    eax,[ebp]
  MOV    ebp,dword [esp+PS4+8-4]
  FLD    qword [ebp]
  FLD    qword [eax]
  FSTP   qword [ebp]
  FSTP   qword [eax]

I'd never have though of doing it this way though.
FLD and FSTP are float ops after all.

I guess the only way to make a Memory Swap faster would be to use MMX and SSE features, in other words this is as fast as it gets using normal x86 calls? or is it possible to improve the Swap using x64 code for PureBasic x64 ?

Post by **Fred** » Tue Sep 08, 2009 12:57 pm

It already does on x64.

Thorium · Post by **Thorium** » Tue Sep 08, 2009 2:31 pm

On Core i7 920: T1=1227 T2=1146 T3=4254 T4=492 with 1024*1024*910

Thorium · Post by **Thorium** » Tue Sep 08, 2009 3:06 pm

Just wrote a "real" Assembler procedure. real = pure assembler, no PB in procedure.

Speed: 329ms on 1024*1024*910

Code: Select all

Procedure MemorySwapASM3(*src, *dst, len.i)

  !mov esi,[p.p_src]
  !mov edi,[p.p_dst]
  !mov ecx,[p.v_len]
  
  !mov ebx,ecx
  !shr ecx,2
  
  !align 4
  !MemorySwapASM3_LoopStart1:
  !cmp ecx,0
  !je MemorySwapASM3_LoopEnd1

    !mov eax,[esi]
    !mov edx,[edi]
    !mov [esi],edx
    !mov [edi],eax
    !add esi,4
    !add edi,4
    !dec ecx

  !jmp MemorySwapASM3_LoopStart1
  !MemorySwapASM3_LoopEnd1:
  
  !mov ecx,ebx
  !and ecx,3
  
  !MemorySwapASM3_LoopStart2:
  !cmp ecx,0
  !je MemorySwapASM3_LoopEnd2
  
    !mov al,[esi]
    !mov dl,[edi]
    !mov [esi],dl
    !mov [edi],al
    !inc esi
    !inc edi
    !dec ecx
  
  !jmp MemorySwapASM3_LoopStart2
  !MemorySwapASM3_LoopEnd2:

  ProcedureReturn #True
  
EndProcedure

Thorium · Post by **Thorium** » Tue Sep 08, 2009 3:24 pm

x64 Version of my procedure is faster than x86:

T1=1017 T2=1149 T3=4257 T4=538
my prodecure = 276
on 1024*1024*910

Code: Select all

Procedure MemorySwapASM3(*src, *dst, len.i)

  !mov rsi,[p.p_src]
  !mov rdi,[p.p_dst]
  !mov rcx,[p.v_len]
  
  !mov rbx,rcx
  !shr rcx,3
  
  !align 4
  !MemorySwapASM3_LoopStart1:
  !cmp rcx,0
  !je MemorySwapASM3_LoopEnd1

    !mov rax,[rsi]
    !mov rdx,[rdi]
    !mov [rsi],rdx
    !mov [rdi],rax
    !add rsi,8
    !add rdi,8
    !dec rcx

  !jmp MemorySwapASM3_LoopStart1
  !MemorySwapASM3_LoopEnd1:
  
  !mov rcx,rbx
  !and rcx,7
  
  !MemorySwapASM3_LoopStart2:
  !cmp rcx,0
  !je MemorySwapASM3_LoopEnd2
  
    !mov al,[rsi]
    !mov dl,[rdi]
    !mov [rsi],dl
    !mov [rdi],al
    !inc rsi
    !inc rdi
    !dec rcx
  
  !jmp MemorySwapASM3_LoopStart2
  !MemorySwapASM3_LoopEnd2:

  ProcedureReturn #True
  
EndProcedure

Rescator · Post by **Rescator** » Tue Sep 08, 2009 3:41 pm

Hmm! I just remembered something...
x64 CPU's, don't all of them have at least SSE2, possibly even SSE3? (aka SIMD instructions?)
If that is the case then a lot of the x64 code could just utilize those features. (I'm sure PB GFX stuff would benefit a lot)
But simple things like this could too?

Ah found it: http://en.wikipedia.org/wiki/X64

SSE and SSE2 are available in 32-bit mode in modern x86 processors; however, if they're used in 32-bit programs, those programs will only work on systems with processors that have the feature. This is not an issue in 64-bit programs, as all AMD64 processors have SSE and SSE2, so using SSE and SSE2 instructions instead of x87 instructions does not reduce the set of machines on which x64 programs can be run. SSE and SSE2 are generally faster than, and duplicate most of the features of the traditional x87 instructions, MMX, and 3DNow!.

So this means all x64 CPU's (AMD/Intel) have at least SSE+SSE2 capabilities, thus one can assume by default that SSE and SSE2 is available on x64 (unlike on x86). The question is whether the PureBasic team has enough time to SIMD'ify PureBasic x64 that much, hopefully in the long run though.

Trond · Post by **Trond** » Tue Sep 08, 2009 3:45 pm

Swap with two longs uses the stack rather than registers in 4.40 b2 x86:

Code: Select all

; Swap a, b
  PUSH   dword [v_b]
  MOV    edx,dword [v_a]
  MOV    dword [v_b],edx
  POP    dword [v_a]

This could be optimized into this:

Code: Select all

  !mov eax, [v_a]
  !mov ebx, [v_b]
  !mov [v_a], ebx
  !mov [v_b], eax

Trond · Post by **Trond** » Tue Sep 08, 2009 3:49 pm

And from what I heard, when the fpu loads a denormalized number, it normalizes it if possible. In other words, quad swap will change the values.