Faster memory copy (optimized for AMD)

Share your advanced PureBasic knowledge/code with the community.
User avatar
blueznl
PureBasic Expert
PureBasic Expert
Posts: 6166
Joined: Sat May 17, 2003 11:31 am
Contact:

Post by blueznl »

newbie wrote: EDIT : I have a question :
Does that optimization means that it is by now faster to write :

var1.s = "toto"
var2.s = ""
CopyMemoryAMD(@var1, @var2, 4)

than to use the normal way :

var2 = var1
the above doesn't make sense... and will cause a crash :-)

you're copying four bytes to a string with zero length, thus overwriting other stuff

this may make sense:

var1.s = "toto"
var2.s = "1234"
CopyMemoryAMD(@var1, @var2, 4)
( PB6.00 LTS Win11 x64 Asrock AB350 Pro4 Ryzen 5 3600 32GB GTX1060 6GB)
( The path to enlightenment and the PureBasic Survival Guide right here... )
tonnelier
User
User
Posts: 45
Joined: Mon May 03, 2004 10:34 pm
Location: France

Post by tonnelier »

Strange : on my notebook (pentium mobile 1.7GHz), after copymemoryAMD, the destination area is not equal to the source one.
There are just almost the first 10% of the source that have been copied.

Any idea?
Have you test the correctness of the copy?
User avatar
Psychophanta
Always Here
Always Here
Posts: 5153
Joined: Wed Jun 11, 2003 9:33 pm
Location: Anare
Contact:

Post by Psychophanta »

Oh! Could be this is in fact valid only for AMD? :? :(
http://www.zeitgeistmovie.com

while (world==business) world+=mafia;
tonnelier
User
User
Posts: 45
Joined: Mon May 03, 2004 10:34 pm
Location: France

Post by tonnelier »

I've not examined precisely the code.
But there can't be goods reasons for it works on AMD and not on Pentium :
the only thing that can be different is the speed, because of the caches and prefetching.
But anyway, it must copy correctly the datas for pentium too, if it works on AMD.

Personnaly, I've made a movememory in ASM, using prefetching, which works on Pentiums and AMD, but it runs only 40% faster than PB's copymemory function.
El_Choni
TailBite Expert
TailBite Expert
Posts: 1007
Joined: Fri Apr 25, 2003 6:09 pm
Location: Spain

Post by El_Choni »

About memory copy failing with this snippet, it's weird, I've tested it a lot of times in my system and data is always copied correctly. Maybe something to do with SSE2 support?
El_Choni
venom
User
User
Posts: 56
Joined: Fri Jul 25, 2003 1:54 pm
Location: Australia

Post by venom »

On my Pentium 4 it crashes if I set the size to 800
like:

Code: Select all

a=AllocateMemory(800)
b=AllocateMemory(800)

CopyMemoryAMD(a,b,800)
tonnelier
User
User
Posts: 45
Joined: Mon May 03, 2004 10:34 pm
Location: France

Post by tonnelier »

As for me : it crashes for some sizes.
And when it works, it doesn't copy all the datas.

But the code doesn't use AMD specificals instructions or registers...
:?: :?:
venom
User
User
Posts: 56
Joined: Fri Jul 25, 2003 1:54 pm
Location: Australia

Post by venom »

A size of 1024 works, but like tonnelier it doesn't copy all.

Here it copies until the 129th byte.

Code: Select all

a=AllocateMemory(1024)
b=AllocateMemory(1024)

For i=0 To 1023
  PokeB(a+i,5)
  PokeB(b+i,100)
Next

CopyMemoryAMD(a,b,1024)

For i=0 To 1023
  If PeekB(a+i)<>PeekB(b+i) 
    ;if the data isn't that same report it
    Debug "Byte "+Str(i+1)+" Src: "+Str(PeekB(a+i))+" Dest: "+Str(PeekB(b+i))
  EndIf
Next
Max.²
Enthusiast
Enthusiast
Posts: 175
Joined: Wed Jul 28, 2004 8:38 am

Post by Max.² »

I did test it and indeed it

a) copies exactly 1/8th of the memory correctly for
b) it fails for unaligned memory blocks

There are some minor exceptions, like 63 bytes are copied correctly (all data, not crash), but I guess that has to do with the different routines for different sizes.

So basically, it should be solveable.

Edit: I am not very deep in assembler, but from what I see, the assembler commands work with bits while the PB variables are assigned with bytes. Would fit to the /8 divisor...

Edit2:

Could this be the problem?

Code: Select all

len = size/8
MOV ecx, len ; number of QWORDS (8 bytes) 
I found some similar code on the net:

Code: Select all

static FORCEINLINE void* _op_memcpy_3dnow(void *dest, const void *src,
unsigned nbytes)
{
 __asm
 {
  mov  ecx, [nbytes]  ; number of bytes to copy
When I change to

Code: Select all

Len=Size
it crashes, but for example

Code: Select all

Len=Size/2
reliably copies 4 time as much, which makes me believe that this really is the first problem. There seems to be another one when using the correct size though.

I also noticed that aligning is commented (and doesn't work if uncommented), but isn't that necessary (could have to do with the not-multiple-of... crashes, or?)?
El_Choni
TailBite Expert
TailBite Expert
Posts: 1007
Joined: Fri Apr 25, 2003 6:09 pm
Location: Spain

Post by El_Choni »

Yes, you've found it. LOL, no wonder the procedure worked dozens of times faster than CopyMemory: it was copying only 1/8 of the data :mrgreen:

The worst of this is that the error is mine... Without noticing it, I mixed two sources from this asm thread while looking for a good copy memory algorithm: http://board.flatassembler.net/viewtopic.php?t=2571

And hence all those outstanding results. I my defense, I must say I remember having checked data integrity in my first attempts, but as that part of the code was slowing the research, and data was always correct, I stripped it. When I came to the code I posted here, the error was in.

So, here is the true, definitive, real copy memory algorithm optimized for AMD, with text it includes at the beginning:

Code: Select all

; Copyright (c) 2001 Advanced Micro Devices, Inc.
;
;LIMITATION OF LIABILITY:  THE MATERIALS ARE PROVIDED *AS IS* WITHOUT ANY
;EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY,
;NONINFRINGEMENT OF THIRD-PARTY INTELLECTUAL PROPERTY, OR FITNESS FOR ANY
;PARTICULAR PURPOSE.  IN NO EVENT SHALL AMD OR ITS SUPPLIERS BE LIABLE FOR ANY
;DAMAGES WHATSOEVER (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS,
;BUSINESS INTERRUPTION, LOSS OF INFORMATION) ARISING OUT OF THE USE OF OR
;INABILITY TO USE THE MATERIALS, EVEN IF AMD HAS BEEN ADVISED OF THE POSSIBILITY
;OF SUCH DAMAGES.  BECAUSE SOME JURISDICTIONS PROHIBIT THE EXCLUSION OR LIMITATION
;OF LIABILITY FOR CONSEQUENTIAL OR INCIDENTAL DAMAGES, THE ABOVE LIMITATION MAY
;NOT APPLY TO YOU.
;
;AMD does not assume any responsibility for any errors which may appear in the
;Materials nor any responsibility to support or update the Materials.  AMD retains
;the right to make changes to its test specifications at any time, without notice.
;
;NO SUPPORT OBLIGATION: AMD is not obligated to furnish, support, or make any
;further information, software, technical information, know-how, or show-how
;available to you.
;
;So that all may benefit from your experience, please report  any  problems
;or  suggestions about this software to 3dsdk.support@amd.com
;
;AMD Developer Technologies, M/S 585
;Advanced Micro Devices, Inc.
;5900 E. Ben White Blvd.
;Austin, TX 78741
;3dsdk.support@amd.com

; Very optimized memcpy() routine for all AMD Athlon and Duron family.
; This code uses any of FOUR different basic copy methods, depending
; on the transfer size.
; NOTE:  Since this code uses MOVNTQ (also known as "Non-Temporal MOV" or
; "Streaming Store"), and also uses the software prefetchnta instructions,
; be sure you're running on Athlon/Duron or other recent CPU before calling!

Procedure CopyMemoryAMD(*src, *dst, size)
  #CACHEBLOCK = $80
  #CACHEBLOCKPREFETCH = #CACHEBLOCK/2
  #CACHEBLOCKTOP = #CACHEBLOCK*64
  #UNCACHED_COPY = 197*1024
  #UNCACHED_COPYPREFETCH = #UNCACHED_COPY/64
  #TINY_BLOCK_COPY = 64
  #IN_CACHE_COPY = 64*1024
  #IN_CACHE_COPYBIG = #IN_CACHE_COPY/64
  MOV esi, *src ; source array
  MOV edi, *dst ; destination array
  MOV ecx, size
  MOV  ebx, ecx  ; keep a copy of count
  CLD
  CMP  ecx, #TINY_BLOCK_COPY
  JB  l_memcpy_ic_3 ; tiny? skip mmx copy
  CMP  ecx, 32*1024  ; don't align between 32k-64k because
  JBE  l_memcpy_do_align ;  it appears to be slower
  CMP  ecx, 64*1024
  JBE  l_memcpy_align_done
  memcpy_do_align:
  MOV  ecx, 8   ; a trick that's faster than rep movsb...
  SUB  ecx, edi  ; align destination to qword
  AND  ecx, 7 ; 111b  ; get the low bits
  SUB  ebx, ecx  ; update copy count
  NEG  ecx    ; set up to jump into the array
  ADD  ecx, l_memcpy_align_done
  JMP  ecx    ; jump to array of movsb's
  ;align 4
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  memcpy_align_done:   ; destination is dword aligned
  MOV  ecx, ebx  ; number of bytes left to copy
  SHR  ecx, 6   ; get 64-byte block count
  JZ  l_memcpy_ic_2 ; finish the last few bytes
  CMP  ecx, #IN_CACHE_COPYBIG ; too big 4 cache? use uncached copy
  JAE  l_memcpy_uc_test
  ;    !align 16
  memcpy_ic_1:   ; 64-byte block copies, in-cache copy
  !prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
  !movq mm0, [esi+0] ; read 64 bits
  !movq mm1, [esi+8]
  !movq [edi+0], mm0 ; write 64 bits
  !movq [edi+8], mm1 ;    note:  the normal !movq writes the
  !movq mm2, [esi+16] ;    data to cache; a cache line will be
  !movq mm3, [esi+24] ;    allocated as needed, to store the data
  !movq [edi+16], mm2
  !movq [edi+24], mm3
  !movq mm0, [esi+32]
  !movq mm1, [esi+40]
  !movq [edi+32], mm0
  !movq [edi+40], mm1
  !movq mm2, [esi+48]
  !movq mm3, [esi+56]
  !movq [edi+48], mm2
  !movq [edi+56], mm3
  ADD  esi, 64   ; update source pointer
  ADD  edi, 64   ; update destination pointer
  DEC  ecx    ; count down
  JNZ  l_memcpy_ic_1 ; last 64-byte block?
  memcpy_ic_2:
  MOV  ecx, ebx  ; has valid low 6 bits of the byte count
  memcpy_ic_3:
  SHR  ecx, 2   ; dword count
  AND  ecx, 15 ; %1111  ; only look at the "remainder" bits
  NEG  ecx    ; set up to jump into the array
  ADD  ecx, l_memcpy_last_few
  JMP  ecx    ; jump to array of movsd's
  memcpy_uc_test:
  CMP  ecx, #UNCACHED_COPYPREFETCH ; big enough? use block prefetch copy
  JAE  l_memcpy_bp_1
  memcpy_64_test:
  OR  ecx, ecx  ; tail end of block prefetch will jump here
  JZ  l_memcpy_ic_2 ; no more 64-byte blocks left
  memcpy_uc_1:    ; 64-byte blocks, uncached copy
  !prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
  !movq mm0, [esi+0]  ; read 64 bits
  ADD  edi, 64   ; update destination pointer
  !movq mm1, [esi+8]
  ADD  esi, 64   ; update source pointer
  !movq mm2, [esi-48]
  !movntq [edi-64], mm0 ; write 64 bits, bypassing the cache
  !movq mm0, [esi-40] ;    note: !movntq also prevents the CPU
  !movntq [edi-56], mm1 ;    from READING the destination address
  !movq mm1, [esi-32] ;    into the cache, only to be over-written
  !movntq [edi-48], mm2 ;    so that also helps performance
  !movq mm2, [esi-24]
  !movntq [edi-40], mm0
  !movq mm0, [esi-16]
  !movntq [edi-32], mm1
  !movq mm1, [esi-8]
  !movntq [edi-24], mm2
  !movntq [edi-16], mm0
  DEC  ecx
  !movntq [edi-8], mm1
  JNZ  l_memcpy_uc_1 ; last 64-byte block?
  JMP  l_memcpy_ic_2  ; almost done
  memcpy_bp_1:   ; large blocks, block prefetch copy
  CMP  ecx, #CACHEBLOCK   ; big enough to run another prefetch loop?
  JL  l_memcpy_64_test   ; no, back to regular uncached copy
  MOV  eax, #CACHEBLOCKPREFETCH  ; block prefetch loop, unrolled 2X
  ADD  esi, #CACHEBLOCKTOP ; move to the top of the block
  ;    !align 16
  memcpy_bp_2:
  MOV  edx, [esi-64]  ; grab one address per cache line
  MOV  edx, [esi-128]  ; grab one address per cache line
  SUB  esi, 128   ; go reverse order
  DEC  eax     ; count down the cache lines
  JNZ  l_memcpy_bp_2  ; keep grabbing more lines into cache
  MOV  eax, #CACHEBLOCK  ; now that it's in cache, do the copy
  ;    !align 16
  memcpy_bp_3:
  !movq mm0, [esi]  ; read 64 bits
  !movq mm1, [esi+ 8]
  !movq mm2, [esi+16]
  !movq mm3, [esi+24]
  !movq mm4, [esi+32]
  !movq mm5, [esi+40]
  !movq mm6, [esi+48]
  !movq mm7, [esi+56]
  ADD  esi, 64    ; update source pointer
  !movntq [edi], mm0  ; write 64 bits, bypassing cache
  !movntq [edi+ 8], mm1  ;    note: !movntq also prevents the CPU
  !movntq [edi+16], mm2  ;    from READING the destination address
  !movntq [edi+24], mm3  ;    into the cache, only to be over-written,
  !movntq [edi+32], mm4  ;    so that also helps performance
  !movntq [edi+40], mm5
  !movntq [edi+48], mm6
  !movntq [edi+56], mm7
  ADD  edi, 64    ; update dest pointer
  DEC  eax     ; count down
  JNZ  l_memcpy_bp_3  ; keep copying
  SUB  ecx, #CACHEBLOCK  ; update the 64-byte block count
  JMP  l_memcpy_bp_1  ; keep processing chunks
  ;The smallest copy uses the X86 "!movsd" instruction, in an optimized
  ;form which is an "unrolled loop".   Then it handles the last few bytes.
  ;    !align 4
  !movsd
  !movsd   ; perform last 1-15 dword copies
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd   ; perform last 1-7 dword copies
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  memcpy_last_few:  ; dword aligned from before !movsd's
  MOV  ecx, ebx ; has valid low 2 bits of the byte count
  AND  ecx, 3 ; %11 ; the last few cows must come home
  JZ  l_memcpy_final ; no more, let's leave
  REP  movsb  ; the last 1, 2, or 3 bytes
  memcpy_final:
  !emms    ; clean up the  state
  !sfence    ; flush the write buffer
EndProcedure

manyk = Pow(2, 26)
source = AllocateMemory(manyk)
destination = AllocateMemory(manyk)
If source And destination
  For a=0 To manyk-1 Step 4
    PokeL(source+a, Random($fffffff))
  Next
  time = ElapsedMilliseconds()
  For a=1 To 10
    CopyMemoryAMD(source, destination, manyk)
  Next
  manyk_AMD.s = Str((ElapsedMilliseconds()-time))
  For a=0 To manyk-1 Step 4
    If PeekL(source+a)<>PeekL(destination+a)
      MessageRequester("Wrong data", "CopyMemoryAMD 64 MB at offset "+Str(a))
      Break
    EndIf
  Next

  time = ElapsedMilliseconds()
  For a=1 To 10
    CopyMemory(source, destination, manyk)
  Next
  manyk_PB.s=Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  manyk_times.f = Val(manyk_PB)/Val(manyk_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(manyk/1024)+" KB blocks.")
EndIf

ameg = Pow(2, 20)
source = AllocateMemory(ameg)
destination = AllocateMemory(ameg)
If source And destination
  For a=0 To ameg-1 Step 4
    PokeL(source+a, Random($fffffff))
  Next
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemoryAMD(source, destination, ameg)
  Next
  onek_AMD.s = Str((ElapsedMilliseconds()-time))
  For a=0 To ameg-1 Step 4
    If PeekL(source+a)<>PeekL(destination+a)
      MessageRequester("Wrong data", "CopyMemoryAMD 1 MB at offset "+Str(a))
      Break
    EndIf
  Next

  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemory(source, destination, ameg)
  Next
  onek_PB.s = Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  onek_times.f = Val(onek_PB)/Val(onek_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(ameg/1024)+" KB blocks.")
EndIf

hk = 102400
source = AllocateMemory(hk)
destination = AllocateMemory(hk)
If source And destination
  For a=0 To hk-1 Step 4
    PokeB(source+a, Random($fffffff))
  Next
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemoryAMD(source, destination, hk)
  Next
  hundredk_AMD.s = Str((ElapsedMilliseconds()-time))
  For a=0 To hk-1 Step 4
    If PeekL(source+a)<>PeekL(destination+a)
      MessageRequester("Wrong data", "CopyMemoryAMD 100 K at offset "+Str(a))
      Break
    EndIf
  Next

  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemory(source, destination, hk)
  Next
  hundredk_PB.s = Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  hundredk_times.f = Val(hundredk_PB)/Val(hundredk_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(hk/1024)+" KB blocks.")
EndIf

results.s="--- 64 MB tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ manyk_AMD +#LFCR$
results.s+"Pure Function : "+ manyk_PB +#LFCR$
results.s+"AMD Function is "+StrF(manyk_times)+" times faster."+#LFCR$
results.s+#LFCR$
results.s+"--- 1 MB tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ onek_AMD +#LFCR$
results.s+"Pure Function : "+ onek_PB +#LFCR$
results.s+"AMD Function is "+StrF(onek_times)+" times faster."+#LFCR$
results.s+#LFCR$
results.s+"--- 100kb tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ hundredk_AMD +#LFCR$
results.s+"Pure Function : "+ hundredk_PB +#LFCR$
results.s+"AMD Function is "+StrF(hundredk_times)+" times faster."+#LFCR$

MessageRequester("Test Results", results.s) 
And now, the results are more humble, but still interesting and worthy to be studied for future implementation in PureBasic:

Code: Select all

---------------------------
Test Results
---------------------------
--- 64 MB tranfer test ---
AMD Function : 641
Pure Function : 1625
AMD Function is 2.535101 times faster.

--- 1 MB tranfer test ---
AMD Function : 10156
Pure Function : 25203
AMD Function is 2.481587 times faster.

--- 100kb tranfer test ---
AMD Function : 453
Pure Function : 406
AMD Function is 0.896247 times faster. ; <--- XXDDDD
My apologies for the initial error, I hope nobody has really used the wrong procedure in any app (have you?) :mrgreen:
El_Choni
Xombie
Addict
Addict
Posts: 898
Joined: Thu Jul 01, 2004 2:51 am
Location: Tacoma, WA
Contact:

Post by Xombie »

Yes, definitely 'slower' than the previous code :D

Now I get

--- 64 MB transfer test ---
AMD Function : 406
Pure Function : 766
AMD Function is 1.886700 times faster.

--- 1 MB transfer test ---
AMD Function : 5703
Pure Function : 12969
AMD Function is 2.274066 times faster.

--- 100kb transfer test ---
AMD Function : 344
Pure Function : 312
AMD Function is 0.906977 times faster.

So while not as fast as the previous code, still faster. I'm still on the AMD Athlon 64 3800+ with 1gig of ram.

Thanks for the correction! :D
Last edited by Xombie on Fri Jan 14, 2005 6:36 am, edited 1 time in total.
Max.²
Enthusiast
Enthusiast
Posts: 175
Joined: Wed Jul 28, 2004 8:38 am

Post by Max.² »

Ah, cool.

On my system, except for the 100kb blocks, where PB is faster by some %, the new routine is 3 to 3.5 times faster.

Later I will check if the same stuff is copied. :lol:

AMD 2400+, 1 GB, btw.

PS: On my Intel Centrino 1600 Notebook, the result is ~2/3/0.5 times faster than the original PB function.
Last edited by Max.² on Fri Jan 14, 2005 8:37 am, edited 1 time in total.
Bonne_den_kule
Addict
Addict
Posts: 841
Joined: Mon Jun 07, 2004 7:10 pm

Post by Bonne_den_kule »

My results:
--- 64 MB tranfer test ---

AMD Function : 688

Pure Function : 1609

AMD Function is 2.338663 times faster.


--- 1 MB tranfer test ---

AMD Function : 10875

Pure Function : 25953

AMD Function is 2.386483 times faster.


--- 100kb tranfer test ---

AMD Function : 469

Pure Function : 468

AMD Function is 0.997868 times faster.
Using a AMD AthlonXP 2500+

Why is it slower on th 100kb copy test?
El_Choni
TailBite Expert
TailBite Expert
Posts: 1007
Joined: Fri Apr 25, 2003 6:09 pm
Location: Spain

Post by El_Choni »

Why is it slower on th 100kb copy test?
My guess is that, for that memory block size, both functions use the same technic, but the CopyMemoryAMD procedure has the overhead of the procedure 'epilogue' code and size checking, so the results are a bit slower, but only Fred can tell. If implemented as SSE2 version of CopyMemory, there shouldn't be almost any difference.

About the code alignment (which I had to comment), I think it would improve performance (and would be possible to include it when implemented as a lib function), but I doubt it can provoke any crash. Not so sure about memory blocks alignment though...
El_Choni
User avatar
djes
Addict
Addict
Posts: 1806
Joined: Sat Feb 19, 2005 2:46 pm
Location: Pas-de-Calais, France

Align

Post by djes »

To use the align asm command, just do

Code: Select all

!ALIGN 4 ;uppercase, the fasm directive
and not

Code: Select all

!align 4 ;lowercase, the macro
:wink:
Post Reply