Faster memory copy (optimized for AMD)

El_Choni · Post by **El_Choni** » Thu Nov 25, 2004 4:02 am

Code updated for 5.20+

Hi,

This code is not mine, but from AMD itself. It copies memory as fast as an AMD can do it. It works on Intel, too, of course, and will probably be faster than normal CopyMemory() even in Intel.

This procedure works 12 times faster than native CopyMemory() in an AMD Athlon XP 3200+. It would be cool to hear from tests in other AMD or, even better, in Pentiums.

Enjoy it.

EDIT: this is the working code, although I haven't been able to use all those 'ALIGN'. Fasm complains: "section is not aligned enough for this operation".

Code: Select all

; Copyright (c) 2001 Advanced Micro Devices, Inc.
;
;LIMITATION OF LIABILITY:  THE MATERIALS ARE PROVIDED *AS IS* WITHOUT ANY
;EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY,
;NONINFRINGEMENT OF THIRD-PARTY INTELLECTUAL PROPERTY, OR FITNESS FOR ANY
;PARTICULAR PURPOSE.  IN NO EVENT SHALL AMD OR ITS SUPPLIERS BE LIABLE FOR ANY
;DAMAGES WHATSOEVER (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS,
;BUSINESS INTERRUPTION, LOSS OF INFORMATION) ARISING OUT OF THE USE OF OR
;INABILITY TO USE THE MATERIALS, EVEN IF AMD HAS BEEN ADVISED OF THE POSSIBILITY
;OF SUCH DAMAGES.  BECAUSE SOME JURISDICTIONS PROHIBIT THE EXCLUSION OR LIMITATION
;OF LIABILITY FOR CONSEQUENTIAL OR INCIDENTAL DAMAGES, THE ABOVE LIMITATION MAY
;NOT APPLY TO YOU.
;
;AMD does not assume any responsibility for any errors which may appear in the
;Materials nor any responsibility to support or update the Materials.  AMD retains
;the right to make changes to its test specifications at any time, without notice.
;
;NO SUPPORT OBLIGATION: AMD is not obligated to furnish, support, or make any
;further information, software, technical information, know-how, or show-how
;available to you.
;
;So that all may benefit from your experience, please report  any  problems
;or  suggestions about this software to 3dsdk.support@amd.com
;
;AMD Developer Technologies, M/S 585
;Advanced Micro Devices, Inc.
;5900 E. Ben White Blvd.
;Austin, TX 78741
;3dsdk.support@amd.com

; Very optimized memcpy() routine for all AMD Athlon and Duron family.
; This code uses any of FOUR different basic copy methods, depending
; on the transfer size.
; NOTE:  Since this code uses MOVNTQ (also known as "Non-Temporal MOV" or
; "Streaming Store"), and also uses the software prefetchnta instructions,
; be sure you're running on Athlon/Duron or other recent CPU before calling!

Procedure CopyMemoryAMD(*src, *dst, size)
  #CACHEBLOCK = $80
  #CACHEBLOCKPREFETCH = #CACHEBLOCK/2
  #CACHEBLOCKTOP = #CACHEBLOCK*64
  #UNCACHED_COPY = 197*1024
  #UNCACHED_COPYPREFETCH = #UNCACHED_COPY/64
  #TINY_BLOCK_COPY = 64
  #IN_CACHE_COPY = 64*1024
  #IN_CACHE_COPYBIG = #IN_CACHE_COPY/64
  EnableASM
  MOV esi, *src ; source array
  MOV edi, *dst ; destination array
  MOV ecx, size
  MOV  ebx, ecx  ; keep a copy of count
  CLD
  CMP  ecx, #TINY_BLOCK_COPY
  JB  l_copymemoryamd_memcpy_ic_3 ; tiny? skip mmx copy
  CMP  ecx, 32*1024  ; don't align between 32k-64k because
  JBE  l_copymemoryamd_memcpy_do_align ;  it appears to be slower
  CMP  ecx, 64*1024
  JBE  l_copymemoryamd_memcpy_align_done
  memcpy_do_align:
  MOV  ecx, 8   ; a trick that's faster than rep movsb...
  SUB  ecx, edi  ; align destination to qword
  And  ecx, 7 ; 111b  ; get the low bits
  SUB  ebx, ecx  ; update copy count
  NEG  ecx    ; set up to jump into the array
  ADD  ecx, l_copymemoryamd_memcpy_align_done
  JMP  ecx    ; jump to array of movsb's
  !ALIGN 4
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  memcpy_align_done:   ; destination is dword aligned
  MOV  ecx, ebx  ; number of bytes left to copy
  SHR  ecx, 6   ; get 64-byte block count
  JZ  l_copymemoryamd_memcpy_ic_2 ; finish the last few bytes
  CMP  ecx, #IN_CACHE_COPYBIG ; too big 4 cache? use uncached copy
  JAE  l_copymemoryamd_memcpy_uc_test
  ;!ALIGN 16
  memcpy_ic_1:   ; 64-byte block copies, in-cache copy
  !prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
  !movq mm0, [esi+0] ; read 64 bits
  !movq mm1, [esi+8]
  !movq [edi+0], mm0 ; write 64 bits
  !movq [edi+8], mm1 ;    note:  the normal !movq writes the
  !movq mm2, [esi+16] ;    data to cache; a cache line will be
  !movq mm3, [esi+24] ;    allocated as needed, to store the data
  !movq [edi+16], mm2
  !movq [edi+24], mm3
  !movq mm0, [esi+32]
  !movq mm1, [esi+40]
  !movq [edi+32], mm0
  !movq [edi+40], mm1
  !movq mm2, [esi+48]
  !movq mm3, [esi+56]
  !movq [edi+48], mm2
  !movq [edi+56], mm3
  ADD  esi, 64   ; update source pointer
  ADD  edi, 64   ; update destination pointer
  DEC  ecx    ; count down
  JNZ  l_copymemoryamd_memcpy_ic_1 ; last 64-byte block?
  memcpy_ic_2:
  MOV  ecx, ebx  ; has valid low 6 bits of the byte count
  memcpy_ic_3:
  SHR  ecx, 2   ; dword count
  And  ecx, 15 ; %1111  ; only look at the "remainder" bits
  NEG  ecx    ; set up to jump into the array
  ADD  ecx, l_copymemoryamd_memcpy_last_few
  JMP  ecx    ; jump to array of movsd's
  memcpy_uc_test:
  CMP  ecx, #UNCACHED_COPYPREFETCH ; big enough? use block prefetch copy
  JAE  l_copymemoryamd_memcpy_bp_1
  memcpy_64_test:
  Or  ecx, ecx  ; tail end of block prefetch will jump here
  JZ  l_copymemoryamd_memcpy_ic_2 ; no more 64-byte blocks left
  memcpy_uc_1:    ; 64-byte blocks, uncached copy
  !prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
  !movq mm0, [esi+0]  ; read 64 bits
  ADD  edi, 64   ; update destination pointer
  !movq mm1, [esi+8]
  ADD  esi, 64   ; update source pointer
  !movq mm2, [esi-48]
  !movntq [edi-64], mm0 ; write 64 bits, bypassing the cache
  !movq mm0, [esi-40] ;    note: !movntq also prevents the CPU
  !movntq [edi-56], mm1 ;    from READING the destination address
  !movq mm1, [esi-32] ;    into the cache, only to be over-written
  !movntq [edi-48], mm2 ;    so that also helps performance
  !movq mm2, [esi-24]
  !movntq [edi-40], mm0
  !movq mm0, [esi-16]
  !movntq [edi-32], mm1
  !movq mm1, [esi-8]
  !movntq [edi-24], mm2
  !movntq [edi-16], mm0
  DEC  ecx
  !movntq [edi-8], mm1
  JNZ  l_copymemoryamd_memcpy_uc_1 ; last 64-byte block?
  JMP  l_copymemoryamd_memcpy_ic_2  ; almost done
  memcpy_bp_1:   ; large blocks, block prefetch copy
  CMP  ecx, #CACHEBLOCK   ; big enough to run another prefetch loop?
  JL  l_copymemoryamd_memcpy_64_test   ; no, back to regular uncached copy
  MOV  eax, #CACHEBLOCKPREFETCH  ; block prefetch loop, unrolled 2X
  ADD  esi, #CACHEBLOCKTOP ; move to the top of the block
  ;!ALIGN 16
  memcpy_bp_2:
  MOV  edx, [esi-64]  ; grab one address per cache line
  MOV  edx, [esi-128]  ; grab one address per cache line
  SUB  esi, 128   ; go reverse order
  DEC  eax     ; count down the cache lines
  JNZ  l_copymemoryamd_memcpy_bp_2  ; keep grabbing more lines into cache
  MOV  eax, #CACHEBLOCK  ; now that it's in cache, do the copy
  ;!ALIGN 16
  memcpy_bp_3:
  !movq mm0, [esi]  ; read 64 bits
  !movq mm1, [esi+ 8]
  !movq mm2, [esi+16]
  !movq mm3, [esi+24]
  !movq mm4, [esi+32]
  !movq mm5, [esi+40]
  !movq mm6, [esi+48]
  !movq mm7, [esi+56]
  ADD  esi, 64    ; update source pointer
  !movntq [edi], mm0  ; write 64 bits, bypassing cache
  !movntq [edi+ 8], mm1  ;    note: !movntq also prevents the CPU
  !movntq [edi+16], mm2  ;    from READING the destination address
  !movntq [edi+24], mm3  ;    into the cache, only to be over-written,
  !movntq [edi+32], mm4  ;    so that also helps performance
  !movntq [edi+40], mm5
  !movntq [edi+48], mm6
  !movntq [edi+56], mm7
  ADD  edi, 64    ; update dest pointer
  DEC  eax     ; count down
  JNZ  l_copymemoryamd_memcpy_bp_3  ; keep copying
  SUB  ecx, #CACHEBLOCK  ; update the 64-byte block count
  JMP  l_copymemoryamd_memcpy_bp_1  ; keep processing chunks
  ;The smallest copy uses the X86 "!movsd" instruction, in an optimized
  ;form which is an "unrolled loop".   Then it handles the last few bytes.
  !ALIGN 4
  !movsd
  !movsd   ; perform last 1-15 dword copies
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd   ; perform last 1-7 dword copies
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  memcpy_last_few:  ; dword aligned from before !movsd's
  MOV  ecx, ebx ; has valid low 2 bits of the byte count
  And  ecx, 3 ; %11 ; the last few cows must come home
  JZ  l_copymemoryamd_memcpy_final ; no more, let's leave
  REP  movsb  ; the last 1, 2, or 3 bytes
  memcpy_final:
  !emms    ; clean up the  state
  !sfence    ; flush the write buffer
  DisableASM
EndProcedure

manyk = Pow(2, 26)
source = AllocateMemory(manyk)
destination = AllocateMemory(manyk)
If source And destination
  For a=0 To manyk-1 Step 4
    PokeL(source+a, Random($fffffff))
  Next
  time = ElapsedMilliseconds()
  For a=1 To 10
    CopyMemoryAMD(source, destination, manyk)
  Next
  manyk_AMD.s = Str((ElapsedMilliseconds()-time))
  For a=0 To manyk-1 Step 4
    If PeekL(source+a)<>PeekL(destination+a)
      MessageRequester("Wrong data", "CopyMemoryAMD 64 MB at offset "+Str(a))
      Break
    EndIf
  Next

  time = ElapsedMilliseconds()
  For a=1 To 10
    CopyMemory(source, destination, manyk)
  Next
  manyk_PB.s=Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  manyk_times.f = Val(manyk_PB)/Val(manyk_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(manyk/1024)+" KB blocks.")
EndIf

ameg = Pow(2, 20)
source = AllocateMemory(ameg)
destination = AllocateMemory(ameg)
If source And destination
  For a=0 To ameg-1 Step 4
    PokeL(source+a, Random($fffffff))
  Next
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemoryAMD(source, destination, ameg)
  Next
  onek_AMD.s = Str((ElapsedMilliseconds()-time))
  For a=0 To ameg-1 Step 4
    If PeekL(source+a)<>PeekL(destination+a)
      MessageRequester("Wrong data", "CopyMemoryAMD 1 MB at offset "+Str(a))
      Break
    EndIf
  Next

  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemory(source, destination, ameg)
  Next
  onek_PB.s = Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  onek_times.f = Val(onek_PB)/Val(onek_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(ameg/1024)+" KB blocks.")
EndIf

hk = 102400
source = AllocateMemory(hk)
destination = AllocateMemory(hk)
If source And destination
  For a=0 To hk-1 Step 4
    PokeB(source+a, Random($fffffff))
  Next
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemoryAMD(source, destination, hk)
  Next
  hundredk_AMD.s = Str((ElapsedMilliseconds()-time))
  For a=0 To hk-1 Step 4
    If PeekL(source+a)<>PeekL(destination+a)
      MessageRequester("Wrong data", "CopyMemoryAMD 100 K at offset "+Str(a))
      Break
    EndIf
  Next

  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemory(source, destination, hk)
  Next
  hundredk_PB.s = Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  hundredk_times.f = Val(hundredk_PB)/Val(hundredk_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(hk/1024)+" KB blocks.")
EndIf

results.s="--- 64 MB tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ manyk_AMD +#LFCR$
results.s+"Pure Function : "+ manyk_PB +#LFCR$
results.s+"AMD Function is "+StrF(manyk_times)+" times faster."+#LFCR$
results.s+#LFCR$
results.s+"--- 1 MB tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ onek_AMD +#LFCR$
results.s+"Pure Function : "+ onek_PB +#LFCR$
results.s+"AMD Function is "+StrF(onek_times)+" times faster."+#LFCR$
results.s+#LFCR$
results.s+"--- 100kb tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ hundredk_AMD +#LFCR$
results.s+"Pure Function : "+ hundredk_PB +#LFCR$
results.s+"AMD Function is "+StrF(hundredk_times)+" times faster."+#LFCR$

MessageRequester("Test Results", results.s)

Regards,

Xombie · Post by **Xombie** » Thu Nov 25, 2004 5:20 am

That's awesome, El_Choni! I can't wait to try this out. I have an AMD 64 3800 and eventually I"ll be needing a good CopyMemory function.

I'm also intensely curious as to how it performs on an Intel processor versus native code. It'd be great if it also ran faster there.

Num3 · Post by **Num3** » Thu Nov 25, 2004 11:09 am

Ok, i've tested this routine and here are the results for a P4@1400mhz

1 MB memory buffer -> 13x faster than PB routine
100K memory buffer -> 13x faster than PB routine

Here's the code i used (remember to turn debuger off!):

10000 memory copies for each run...

Code: Select all

Procedure CopyMemoryAMD(*src, *dst, size)
  #CACHEBLOCK = $80
  #CACHEBLOCKPREFETCH = #CACHEBLOCK/2
  #CACHEBLOCKTOP = #CACHEBLOCK*64
  #UNCACHED_COPY = 197*1024
  #UNCACHED_COPYPREFETCH = #UNCACHED_COPY/64
  #TINY_BLOCK_COPY = 64
  #IN_CACHE_COPY = 64*1024
  #IN_CACHE_COPYBIG = #IN_CACHE_COPY/64
  len = size/8
  MOV esi, *src ; source array
  MOV edi, *dst ; destination array
  MOV ecx, len ; number of QWORDS (8 bytes)
  
  MOV  ebx, ecx  ; keep a copy of count
  CLD
  CMP  ecx, #TINY_BLOCK_COPY
  JB  l_memcpy_ic_3 ; tiny? skip mmx copy
  CMP  ecx, 32*1024  ; don't align between 32k-64k because
  JBE  l_memcpy_do_align ;  it appears to be slower
  CMP  ecx, 64*1024
  JBE  l_memcpy_align_done
  memcpy_do_align:
  MOV  ecx, 8   ; a trick that's faster than rep movsb...
  SUB  ecx, edi  ; align destination to qword
  And  ecx, 7 ; 111b  ; get the low bits
  SUB  ebx, ecx  ; update copy count
  NEG  ecx    ; set up to jump into the array
  ADD  ecx, l_memcpy_align_done
  JMP  ecx    ; jump to array of movsb's
  ;align 4
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  memcpy_align_done:   ; destination is dword aligned
  MOV  ecx, ebx  ; number of bytes left to copy
  SHR  ecx, 6   ; get 64-byte block count
  JZ  l_memcpy_ic_2 ; finish the last few bytes
  CMP  ecx, #IN_CACHE_COPYBIG ; too big 4 cache? use uncached copy
  JAE  l_memcpy_uc_test
  
  ;    !align 16
  memcpy_ic_1:   ; 64-byte block copies, in-cache copy
  !prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
  !movq mm0, [esi+0] ; read 64 bits
  !movq mm1, [esi+8]
  !movq [edi+0], mm0 ; write 64 bits
  !movq [edi+8], mm1 ;    note:  the normal !movq writes the
  !movq mm2, [esi+16] ;    data to cache; a cache line will be
  !movq mm3, [esi+24] ;    allocated as needed, to store the data
  !movq [edi+16], mm2
  !movq [edi+24], mm3
  !movq mm0, [esi+32]
  !movq mm1, [esi+40]
  !movq [edi+32], mm0
  !movq [edi+40], mm1
  !movq mm2, [esi+48]
  !movq mm3, [esi+56]
  !movq [edi+48], mm2
  !movq [edi+56], mm3
  ADD  esi, 64   ; update source pointer
  ADD  edi, 64   ; update destination pointer
  DEC  ecx    ; count down
  JNZ  l_memcpy_ic_1 ; last 64-byte block?
  memcpy_ic_2:
  MOV  ecx, ebx  ; has valid low 6 bits of the byte count
  memcpy_ic_3:
  SHR  ecx, 2   ; dword count
  And  ecx, 31 ; %1111  ; only look at the "remainder" bits
  NEG  ecx    ; set up to jump into the array
  ADD  ecx, l_memcpy_last_few
  JMP  ecx    ; jump to array of movsd's
  memcpy_uc_test:
  CMP  ecx, #UNCACHED_COPYPREFETCH ; big enough? use block prefetch copy
  JAE  l_memcpy_bp_1
  memcpy_64_test:
  Or  ecx, ecx  ; tail end of block prefetch will jump here
  JZ  l_memcpy_ic_2 ; no more 64-byte blocks left
  
  memcpy_uc_1:    ; 64-byte blocks, uncached copy
  !prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
  !movq mm0, [esi+0]  ; read 64 bits
  ADD  edi, 64   ; update destination pointer
  !movq mm1, [esi+8]
  ADD  esi, 64   ; update source pointer
  !movq mm2, [esi-48]
  !movntq [edi-64], mm0 ; write 64 bits, bypassing the cache
  !movq mm0, [esi-40] ;    note: !movntq also prevents the CPU
  !movntq [edi-56], mm1 ;    from READING the destination address
  !movq mm1, [esi-32] ;    into the cache, only to be over-written
  !movntq [edi-48], mm2 ;    so that also helps performance
  !movq mm2, [esi-24]
  !movntq [edi-40], mm0
  !movq mm0, [esi-16]
  !movntq [edi-32], mm1
  !movq mm1, [esi-8]
  !movntq [edi-24], mm2
  !movntq [edi-16], mm0
  DEC  ecx
  !movntq [edi-8], mm1
  JNZ  l_memcpy_uc_1 ; last 64-byte block?
  JMP  l_memcpy_ic_2  ; almost done
  
  memcpy_bp_1:   ; large blocks, block prefetch copy
  CMP  ecx, #CACHEBLOCK   ; big enough to run another prefetch loop?
  JL  l_memcpy_64_test   ; no, back to regular uncached copy
  MOV  eax, #CACHEBLOCKPREFETCH  ; block prefetch loop, unrolled 2X
  ADD  esi, #CACHEBLOCKTOP ; move to the top of the block
  ;    !align 16
  memcpy_bp_2:
  MOV  edx, [esi-64]  ; grab one address per cache line
  MOV  edx, [esi-128]  ; grab one address per cache line
  SUB  esi, 128   ; go reverse order
  DEC  eax     ; count down the cache lines
  JNZ  l_memcpy_bp_2  ; keep grabbing more lines into cache
  MOV  eax, #CACHEBLOCK  ; now that it's in cache, do the copy
  ;    !align 16
  memcpy_bp_3:
  !movq mm0, [esi]  ; read 64 bits
  !movq mm1, [esi+ 8]
  !movq mm2, [esi+16]
  !movq mm3, [esi+24]
  !movq mm4, [esi+32]
  !movq mm5, [esi+40]
  !movq mm6, [esi+48]
  !movq mm7, [esi+56]
  ADD  esi, 64    ; update source pointer
  !movntq [edi], mm0  ; write 64 bits, bypassing cache
  !movntq [edi+ 8], mm1  ;    note: !movntq also prevents the CPU
  !movntq [edi+16], mm2  ;    from READING the destination address
  !movntq [edi+24], mm3  ;    into the cache, only to be over-written,
  !movntq [edi+32], mm4  ;    so that also helps performance
  !movntq [edi+40], mm5
  !movntq [edi+48], mm6
  !movntq [edi+56], mm7
  ADD  edi, 64    ; update dest pointer
  DEC  eax     ; count down
  JNZ  l_memcpy_bp_3  ; keep copying
  SUB  ecx, #CACHEBLOCK  ; update the 64-byte block count
  JMP  l_memcpy_bp_1  ; keep processing chunks
  ;The smallest copy uses the X86 "!movsd" instruction, in an optimized
  ;form which is an "unrolled loop".   Then it handles the last few bytes.
  ;    !align 4
  !movsd
  !movsd   ; perform last 1-15 dword copies
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd   ; perform last 1-7 dword copies
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  memcpy_last_few:  ; dword aligned from before !movsd's
  MOV  ecx, ebx ; has valid low 2 bits of the byte count
  And  ecx, 3 ; %11 ; the last few cows must come home
  JZ  l_memcpy_final ; no more, let's leave
  REP  movsb  ; the last 1, 2, or 3 bytes
  memcpy_final:
  !emms    ; clean up the  state
  !sfence    ; flush the write buffer
EndProcedure 


source.l=AllocateMemory(1024000)
destination.l=AllocateMemory(1024000)

For a=0 To 1024000
  PokeB(source,Random(15))
Next

time.l=ElapsedMilliseconds()

For a=1 To 10000
  CopyMemoryAMD(source,destination,1024000)
Next

onek_AMD.s=Str((ElapsedMilliseconds()-time))

time.l=ElapsedMilliseconds()

For a=1 To 10000
  CopyMemory(source,destination,1024000)
Next

onek_PB.s=Str((ElapsedMilliseconds()-time))

source.l=AllocateMemory(102400)
destination.l=AllocateMemory(102400)

For a=0 To 102400
  PokeB(source,Random(15))
Next

time.l=ElapsedMilliseconds()

For a=1 To 10000
  CopyMemoryAMD(source,destination,102400)
Next

hundredk_AMD.s=Str((ElapsedMilliseconds()-time))

time.l=ElapsedMilliseconds()

For a=1 To 10000
  CopyMemory(source,destination,102400)
Next


hundredk_PB.s=Str((ElapsedMilliseconds()-time))

FreeMemory(source)
FreeMemory(destination)

results.s="--- 1 MB tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ onek_AMD +#LFCR$
results.s+"Pure Function : "+ onek_PB +#LFCR$
results.s+#LFCR$
results.s+"--- 100kb tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ hundredk_AMD +#LFCR$
results.s+"Pure Function : "+ hundredk_PB +#LFCR$


MessageRequester("Test Results",results.s)

Max.Â² · Post by **Max.Â²** » Thu Nov 25, 2004 11:44 am

On an Intel Centrino 1600 Notebook, the 1MB transfer test is 13x times faster than the original, the 100kb test 15times.

Edit:

What are the miminum requirements for this routine, cpu-wise?

BalrogSoft · Post by **BalrogSoft** » Thu Nov 25, 2004 12:11 pm

I tried on my notebook Acer TM 2001 with a P4 2.6Ghz Celeron, and i got this results:

1MB transfer: 12x faster than PB CopyMemory
100Kb transfer: 67x faster than PB CopyMemory!!!

Very good work...

El_Choni · Post by **El_Choni** » Thu Nov 25, 2004 1:25 pm

@Max2: PC with MMX.

It's funnny, the routine is supposed to be optimized for AMD but runs faster on Pentiums 8O

EDIT: I've modified num3 code a bit, I added one more test:

Code: Select all

Procedure CopyMemoryAMD(*src, *dst, size)
  #CACHEBLOCK = $80
  #CACHEBLOCKPREFETCH = #CACHEBLOCK/2
  #CACHEBLOCKTOP = #CACHEBLOCK*64
  #UNCACHED_COPY = 197*1024
  #UNCACHED_COPYPREFETCH = #UNCACHED_COPY/64
  #TINY_BLOCK_COPY = 64
  #IN_CACHE_COPY = 64*1024
  #IN_CACHE_COPYBIG = #IN_CACHE_COPY/64
  len = size/8
  MOV esi, *src ; source array
  MOV edi, *dst ; destination array
  MOV ecx, len ; number of QWORDS (8 bytes)
 
  MOV  ebx, ecx  ; keep a copy of count
  CLD
  CMP  ecx, #TINY_BLOCK_COPY
  JB  l_memcpy_ic_3 ; tiny? skip mmx copy
  CMP  ecx, 32*1024  ; don't align between 32k-64k because
  JBE  l_memcpy_do_align ;  it appears to be slower
  CMP  ecx, 64*1024
  JBE  l_memcpy_align_done
  memcpy_do_align:
  MOV  ecx, 8   ; a trick that's faster than rep movsb...
  SUB  ecx, edi  ; align destination to qword
  AND  ecx, 7 ; 111b  ; get the low bits
  SUB  ebx, ecx  ; update copy count
  NEG  ecx    ; set up to jump into the array
  ADD  ecx, l_memcpy_align_done
  JMP  ecx    ; jump to array of movsb's
  ;align 4
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  !movsb
  memcpy_align_done:   ; destination is dword aligned
  MOV  ecx, ebx  ; number of bytes left to copy
  SHR  ecx, 6   ; get 64-byte block count
  JZ  l_memcpy_ic_2 ; finish the last few bytes
  CMP  ecx, #IN_CACHE_COPYBIG ; too big 4 cache? use uncached copy
  JAE  l_memcpy_uc_test
 
  ;    !align 16
  memcpy_ic_1:   ; 64-byte block copies, in-cache copy
  !prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
  !movq mm0, [esi+0] ; read 64 bits
  !movq mm1, [esi+8]
  !movq [edi+0], mm0 ; write 64 bits
  !movq [edi+8], mm1 ;    note:  the normal !movq writes the
  !movq mm2, [esi+16] ;    data to cache; a cache line will be
  !movq mm3, [esi+24] ;    allocated as needed, to store the data
  !movq [edi+16], mm2
  !movq [edi+24], mm3
  !movq mm0, [esi+32]
  !movq mm1, [esi+40]
  !movq [edi+32], mm0
  !movq [edi+40], mm1
  !movq mm2, [esi+48]
  !movq mm3, [esi+56]
  !movq [edi+48], mm2
  !movq [edi+56], mm3
  ADD  esi, 64   ; update source pointer
  ADD  edi, 64   ; update destination pointer
  DEC  ecx    ; count down
  JNZ  l_memcpy_ic_1 ; last 64-byte block?
  memcpy_ic_2:
  MOV  ecx, ebx  ; has valid low 6 bits of the byte count
  memcpy_ic_3:
  SHR  ecx, 2   ; dword count
  AND  ecx, 31 ; %1111  ; only look at the "remainder" bits
  NEG  ecx    ; set up to jump into the array
  ADD  ecx, l_memcpy_last_few
  JMP  ecx    ; jump to array of movsd's
  memcpy_uc_test:
  CMP  ecx, #UNCACHED_COPYPREFETCH ; big enough? use block prefetch copy
  JAE  l_memcpy_bp_1
  memcpy_64_test:
  OR  ecx, ecx  ; tail end of block prefetch will jump here
  JZ  l_memcpy_ic_2 ; no more 64-byte blocks left
 
  memcpy_uc_1:    ; 64-byte blocks, uncached copy
  !prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
  !movq mm0, [esi+0]  ; read 64 bits
  ADD  edi, 64   ; update destination pointer
  !movq mm1, [esi+8]
  ADD  esi, 64   ; update source pointer
  !movq mm2, [esi-48]
  !movntq [edi-64], mm0 ; write 64 bits, bypassing the cache
  !movq mm0, [esi-40] ;    note: !movntq also prevents the CPU
  !movntq [edi-56], mm1 ;    from READING the destination address
  !movq mm1, [esi-32] ;    into the cache, only to be over-written
  !movntq [edi-48], mm2 ;    so that also helps performance
  !movq mm2, [esi-24]
  !movntq [edi-40], mm0
  !movq mm0, [esi-16]
  !movntq [edi-32], mm1
  !movq mm1, [esi-8]
  !movntq [edi-24], mm2
  !movntq [edi-16], mm0
  DEC  ecx
  !movntq [edi-8], mm1
  JNZ  l_memcpy_uc_1 ; last 64-byte block?
  JMP  l_memcpy_ic_2  ; almost done
 
  memcpy_bp_1:   ; large blocks, block prefetch copy
  CMP  ecx, #CACHEBLOCK   ; big enough to run another prefetch loop?
  JL  l_memcpy_64_test   ; no, back to regular uncached copy
  MOV  eax, #CACHEBLOCKPREFETCH  ; block prefetch loop, unrolled 2X
  ADD  esi, #CACHEBLOCKTOP ; move to the top of the block
  ;    !align 16
  memcpy_bp_2:
  MOV  edx, [esi-64]  ; grab one address per cache line
  MOV  edx, [esi-128]  ; grab one address per cache line
  SUB  esi, 128   ; go reverse order
  DEC  eax     ; count down the cache lines
  JNZ  l_memcpy_bp_2  ; keep grabbing more lines into cache
  MOV  eax, #CACHEBLOCK  ; now that it's in cache, do the copy
  ;    !align 16
  memcpy_bp_3:
  !movq mm0, [esi]  ; read 64 bits
  !movq mm1, [esi+ 8]
  !movq mm2, [esi+16]
  !movq mm3, [esi+24]
  !movq mm4, [esi+32]
  !movq mm5, [esi+40]
  !movq mm6, [esi+48]
  !movq mm7, [esi+56]
  ADD  esi, 64    ; update source pointer
  !movntq [edi], mm0  ; write 64 bits, bypassing cache
  !movntq [edi+ 8], mm1  ;    note: !movntq also prevents the CPU
  !movntq [edi+16], mm2  ;    from READING the destination address
  !movntq [edi+24], mm3  ;    into the cache, only to be over-written,
  !movntq [edi+32], mm4  ;    so that also helps performance
  !movntq [edi+40], mm5
  !movntq [edi+48], mm6
  !movntq [edi+56], mm7
  ADD  edi, 64    ; update dest pointer
  DEC  eax     ; count down
  JNZ  l_memcpy_bp_3  ; keep copying
  SUB  ecx, #CACHEBLOCK  ; update the 64-byte block count
  JMP  l_memcpy_bp_1  ; keep processing chunks
  ;The smallest copy uses the X86 "!movsd" instruction, in an optimized
  ;form which is an "unrolled loop".   Then it handles the last few bytes.
  ;    !align 4
  !movsd
  !movsd   ; perform last 1-15 dword copies
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd   ; perform last 1-7 dword copies
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  !movsd
  memcpy_last_few:  ; dword aligned from before !movsd's
  MOV  ecx, ebx ; has valid low 2 bits of the byte count
  AND  ecx, 3 ; %11 ; the last few cows must come home
  JZ  l_memcpy_final ; no more, let's leave
  REP  movsb  ; the last 1, 2, or 3 bytes
  memcpy_final:
  !emms    ; clean up the  state
  !sfence    ; flush the write buffer
EndProcedure

manyk = Pow(2, 26)
source = AllocateMemory(manyk)
destination = AllocateMemory(manyk)
For a=0 To manyk-1 Step 4
  PokeL(source+a, Random($fffffff))
Next
time = ElapsedMilliseconds()
For a=1 To 10
  CopyMemoryAMD(source, destination, manyk)
Next
manyk_AMD.s = Str((ElapsedMilliseconds()-time))
time = ElapsedMilliseconds()
For a=1 To 10
  CopyMemory(source, destination, manyk)
Next
manyk_PB.s=Str((ElapsedMilliseconds()-time))
FreeMemory(source)
FreeMemory(destination)
manyk_times = Val(manyk_PB)/Val(manyk_AMD)

ameg = Pow(2, 20)
source = AllocateMemory(ameg)
destination = AllocateMemory(ameg)
For a=0 To ameg-1 Step 4
  PokeL(source+a, Random($fffffff))
Next
time = ElapsedMilliseconds()
For a=1 To 10000
  CopyMemoryAMD(source, destination, ameg)
Next
onek_AMD.s = Str((ElapsedMilliseconds()-time))
time = ElapsedMilliseconds()
For a=1 To 10000
  CopyMemory(source, destination, ameg)
Next
onek_PB.s = Str((ElapsedMilliseconds()-time))
FreeMemory(source)
FreeMemory(destination)
onek_times = Val(onek_PB)/Val(onek_AMD)

hk = 102400
source = AllocateMemory(hk)
destination = AllocateMemory(hk)
For a=0 To hk-1 Step 4
  PokeB(source+a, Random($fffffff))
Next
time = ElapsedMilliseconds()
For a=1 To 10000
  CopyMemoryAMD(source, destination, hk)
Next
hundredk_AMD.s = Str((ElapsedMilliseconds()-time))
time = ElapsedMilliseconds()
For a=1 To 10000
  CopyMemory(source, destination, hk)
Next
hundredk_PB.s = Str((ElapsedMilliseconds()-time))
FreeMemory(source)
FreeMemory(destination)
hundredk_times = Val(hundredk_PB)/Val(hundredk_AMD)

results.s="--- 64 MB tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ manyk_AMD +#LFCR$
results.s+"Pure Function : "+ manyk_PB +#LFCR$
results.s+"AMD Function is "+Str(manyk_times)+" times faster."+#LFCR$
results.s+#LFCR$
results.s+"--- 1 MB tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ onek_AMD +#LFCR$
results.s+"Pure Function : "+ onek_PB +#LFCR$
results.s+"AMD Function is "+Str(onek_times)+" times faster."+#LFCR$
results.s+#LFCR$
results.s+"--- 100kb tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ hundredk_AMD +#LFCR$
results.s+"Pure Function : "+ hundredk_PB +#LFCR$
results.s+"AMD Function is "+Str(hundredk_times)+" times faster."+#LFCR$

MessageRequester("Test Results", results.s)

Now the figures are nearer yours:

Code: Select all

---------------------------
Test Results
---------------------------
--- 64 MB tranfer test ---
AMD Function : 78
Pure Function : 1547
AMD Function is 19 times faster.

--- 1 MB tranfer test ---
AMD Function : 547
Pure Function : 25078
AMD Function is 45 times faster.

--- 100kb tranfer test ---
AMD Function : 16
Pure Function : 391
AMD Function is 24 times faster.

Regards,

traumatic · Post by **traumatic** » Thu Nov 25, 2004 1:44 pm

El_Choni wrote:@Max2: PC with MMX.

I just tried it at work on a PII :roll: (with MMX) and it crashes.
Are you sure about the requirements?

El_Choni · Post by **El_Choni** » Thu Nov 25, 2004 1:58 pm

Are you sure about the requirements?

No, I'm not sure. I guess it uses some MMX 2 instructions, or whatever. I'll check this.

EDIT: ok, I've learned that prefetchnta, movntq and sfence are part of SSE 1, which is implemented on the P3 and later, and the Athlon XP and later. I'll dig for some code to check for SSE support.

Code: Select all

Procedure IsSSESupported() ; Returns 33554432 if supported, 0 if not supported
  result = 0
  XOR EDX, EDX        ; Set edx to 0 just in case CPUID is disabled, not to get wrong results
  MOV eax, 1          ; CPUID level 1
  !CPUID              ; EDX = feature flag 
  AND edx, $2000000    ; test bit 25 of feature flag
  MOV result, edx     ; <>0 If SSE is supported
  ProcedureReturn result
EndProcedure

Debug IsSSESupported()

traumatic, could you try stripping the prefetchnta and sfence instructions, and changing all the movntq to movq instructions to see how it compares in the Pentium II?

Num3 · Post by **Num3** » Thu Nov 25, 2004 2:15 pm

El_Choni wrote:@Max2: PC with MMX.

It's funnny, the routine is supposed to be optimized for AMD but runs faster on Pentiums 8O

I haven't tested it with my home PC Athlon XP 2000+, but i guess it will come closer to those numbers, cause i have my memory at 333mhz, and here at the office only at 133mhz.

I also think that the diference is because of the size of L1 cache, AMD still goes for 64k and Intel are now going for the 1Mb !!!

Bonne_den_kule · Post by **Bonne_den_kule** » Thu Nov 25, 2004 2:32 pm

Results on my atlonxp 2500
1MB 13x faster then pure
100kb 2x fast

*edited*
Here is my result when it is saved as a exe:

64mb 20x faster
1mb 40x faster
100kb 31x faster

Max.Â² · Post by **Max.Â²** » Thu Nov 25, 2004 2:42 pm

P4-2.4GHz:

17/18/9

Intel Centrino 1.6GHz:

19/11/13

Noticed, that the results get worse from time to time. Guess XP memory handling is getting more and more involved...

Max.Â² · Post by **Max.Â²** » Thu Nov 25, 2004 2:44 pm

Couldn't that be something for Fred to add as optimized command, so the compiler options (Dynamic, MMX, SSE, ...) would make sense?

freedimension · Post by **freedimension** » Thu Nov 25, 2004 2:52 pm

mobile AMD Athlon XP 1500+ (with lots of programs running)

--- 64 MB tranfer test ---
AMD Function : 330
Pure Function : 5028
AMD Function is 15 times faster.

--- 1 MB tranfer test ---
AMD Function : 2373
Pure Function : 77972
AMD Function is 32 times faster.

--- 100kb tranfer test ---
AMD Function : 20
Pure Function : 711
AMD Function is 35 times faster.

traumatic · Post by **traumatic** » Thu Nov 25, 2004 2:53 pm

El_Choni wrote: traumatic, could you try stripping the prefetchnta and sfence instructions, and changing all the movntq to movq instructions to see how it compares in the Pentium II?

Maybe I did something wrong (as I don't really know ASM) but that way it still crashes.

BTW according to IsSSESupported(), this CPU does support SSE

El_Choni · Post by **El_Choni** » Thu Nov 25, 2004 3:09 pm

Weird, it should work... Could you check where the crash occurs precisely?