Faster memory copy (optimized for AMD)

traumatic · Post by **traumatic** » Thu Nov 25, 2004 3:49 pm

El_Choni wrote:Weird, it should work... Could you check where the crash occurs precisely?

As I said I'm at work right now (having a computer like this at home was
no fun at all

), so I can't really have a closer look right now but the crash
occurs at memcpy_bp_2, at least as far as I can see.

If that might already give you a clue and you have some further code
to test, of course I'll try to help as much as I can (regardless of my more than limited ASM knowledge

).

El_Choni · Post by **El_Choni** » Thu Nov 25, 2004 5:00 pm

@traumatic, could you try this code (when you can) in your Pentium 2?

Code: Select all

Procedure CopyMemoryAMD(*src, *dst, size)
  #CACHEBLOCK = $80
  #CACHEBLOCKPREFETCH = #CACHEBLOCK/2
  #CACHEBLOCKTOP = #CACHEBLOCK*64
  #UNCACHED_COPY = 197*1024
  #UNCACHED_COPYPREFETCH = #UNCACHED_COPY/64
  #TINY_BLOCK_COPY = 64
  #IN_CACHE_COPY = 64*1024
  #IN_CACHE_COPYBIG = #IN_CACHE_COPY/64
  len = size/8
    MOV esi, *src ; source array
    MOV edi, *dst ; destination array
    MOV ecx, len ; number of QWORDS (8 bytes)
   
    MOV  ebx, ecx  ; keep a copy of count
    CLD
    CMP  ecx, #TINY_BLOCK_COPY
    JB  l_memcpy_ic_3 ; tiny? skip mmx copy
    CMP  ecx, 32*1024  ; don't align between 32k-64k because
    JBE  l_memcpy_do_align ;  it appears to be slower
    CMP  ecx, 64*1024
    JBE  l_memcpy_align_done
  memcpy_do_align:
    MOV  ecx, 8   ; a trick that's faster than rep movsb...
    SUB  ecx, edi  ; align destination to qword
    AND  ecx, 7 ; 111b  ; get the low bits
    SUB  ebx, ecx  ; update copy count
    NEG  ecx    ; set up to jump into the array
    ADD  ecx, l_memcpy_align_done
    JMP  ecx    ; jump to array of movsb's
    ;!align 4
    !movsb
    !movsb
    !movsb
    !movsb
    !movsb
    !movsb
    !movsb
    !movsb
  memcpy_align_done:   ; destination is dword aligned
    MOV  ecx, ebx  ; number of bytes left to copy
    SHR  ecx, 6   ; get 64-byte block count
    JZ  l_memcpy_ic_2 ; finish the last few bytes
    CMP  ecx, #IN_CACHE_COPYBIG ; too big 4 cache? use uncached copy
    JAE  l_memcpy_uc_test
   
    ;    !align 16
  memcpy_ic_1:   ; 64-byte block copies, in-cache copy
    ;!prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
    !movq mm0, [esi+0] ; read 64 bits
    !movq mm1, [esi+8]
    !movq [edi+0], mm0 ; write 64 bits
    !movq [edi+8], mm1 ;    note:  the normal !movq writes the
    !movq mm2, [esi+16] ;    data to cache; a cache line will be
    !movq mm3, [esi+24] ;    allocated as needed, to store the data
    !movq [edi+16], mm2
    !movq [edi+24], mm3
    !movq mm0, [esi+32]
    !movq mm1, [esi+40]
    !movq [edi+32], mm0
    !movq [edi+40], mm1
    !movq mm2, [esi+48]
    !movq mm3, [esi+56]
    !movq [edi+48], mm2
    !movq [edi+56], mm3
    ADD  esi, 64   ; update source pointer
    ADD  edi, 64   ; update destination pointer
    DEC  ecx    ; count down
    JNZ  l_memcpy_ic_1 ; last 64-byte block?
  memcpy_ic_2:
    MOV  ecx, ebx  ; has valid low 6 bits of the byte count
  memcpy_ic_3:
    SHR  ecx, 2   ; dword count
    AND  ecx, 31 ; %1111  ; only look at the "remainder" bits
    NEG  ecx    ; set up to jump into the array
    ADD  ecx, l_memcpy_last_few
    JMP  ecx    ; jump to array of movsd's
  memcpy_uc_test:
    CMP  ecx, #UNCACHED_COPYPREFETCH ; big enough? use block prefetch copy
    JAE  l_memcpy_bp_1
  memcpy_64_test:
    OR  ecx, ecx  ; tail end of block prefetch will jump here
    JZ  l_memcpy_ic_2 ; no more 64-byte blocks left
   
  memcpy_uc_1:    ; 64-byte blocks, uncached copy
    ;!prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
    !movq mm0, [esi+0]  ; read 64 bits
    ADD  edi, 64   ; update destination pointer
    !movq mm1, [esi+8]
    ADD  esi, 64   ; update source pointer
    !movq mm2, [esi-48]
    !movq [edi-64], mm0 ; write 64 bits, bypassing the cache
    !movq mm0, [esi-40] ;    note: !movq also prevents the CPU
    !movq [edi-56], mm1 ;    from READING the destination address
    !movq mm1, [esi-32] ;    into the cache, only to be over-written
    !movq [edi-48], mm2 ;    so that also helps performance
    !movq mm2, [esi-24]
    !movq [edi-40], mm0
    !movq mm0, [esi-16]
    !movq [edi-32], mm1
    !movq mm1, [esi-8]
    !movq [edi-24], mm2
    !movq [edi-16], mm0
    DEC  ecx
    !movq [edi-8], mm1
    JNZ  l_memcpy_uc_1 ; last 64-byte block?
    JMP  l_memcpy_ic_2  ; almost done
   
  memcpy_bp_1:   ; large blocks, block prefetch copy
    CMP  ecx, #CACHEBLOCK   ; big enough to run another prefetch loop?
    JL  l_memcpy_64_test   ; no, back to regular uncached copy
    MOV  eax, #CACHEBLOCKPREFETCH  ; block prefetch loop, unrolled 2X
    ADD  esi, #CACHEBLOCKTOP ; move to the top of the block
    ;    !align 16
  memcpy_bp_2:
    MOV  edx, [esi-64]  ; grab one address per cache line
    MOV  edx, [esi-128]  ; grab one address per cache line
    SUB  esi, 128   ; go reverse order
    DEC  eax     ; count down the cache lines
    JNZ  l_memcpy_bp_2  ; keep grabbing more lines into cache
    MOV  eax, #CACHEBLOCK  ; now that it's in cache, do the copy
    ;    !align 16
  memcpy_bp_3:
    !movq mm0, [esi]  ; read 64 bits
    !movq mm1, [esi+ 8]
    !movq mm2, [esi+16]
    !movq mm3, [esi+24]
    !movq mm4, [esi+32]
    !movq mm5, [esi+40]
    !movq mm6, [esi+48]
    !movq mm7, [esi+56]
    ADD  esi, 64    ; update source pointer
    !movq [edi], mm0  ; write 64 bits, bypassing cache
    !movq [edi+ 8], mm1  ;    note: !movq also prevents the CPU
    !movq [edi+16], mm2  ;    from READING the destination address
    !movq [edi+24], mm3  ;    into the cache, only to be over-written,
    !movq [edi+32], mm4  ;    so that also helps performance
    !movq [edi+40], mm5
    !movq [edi+48], mm6
    !movq [edi+56], mm7
    ADD  edi, 64    ; update dest pointer
    DEC  eax     ; count down
    JNZ  l_memcpy_bp_3  ; keep copying
    SUB  ecx, #CACHEBLOCK  ; update the 64-byte block count
    JMP  l_memcpy_bp_1  ; keep processing chunks
    ;The smallest copy uses the X86 "!movsd" instruction, in an optimized
    ;form which is an "unrolled loop".   Then it handles the last few bytes.
    ;    !align 4
    !movsd
    !movsd   ; perform last 1-15 dword copies
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd   ; perform last 1-7 dword copies
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd
  memcpy_last_few:  ; dword aligned from before !movsd's
    MOV  ecx, ebx ; has valid low 2 bits of the byte count
    AND  ecx, 3 ; %11 ; the last few cows must come home
    JZ  l_memcpy_final ; no more, let's leave
    REP  movsb  ; the last 1, 2, or 3 bytes
  memcpy_final:
    !emms    ; clean up the  state
    ;!sfence    ; flush the write buffer
EndProcedure

manyk = Pow(2, 26)
source = AllocateMemory(manyk)
destination = AllocateMemory(manyk)
If source And destination
  time = ElapsedMilliseconds()
  For a=1 To 10
    CopyMemoryAMD(source, destination, manyk)
  Next
  manyk_AMD.s = Str((ElapsedMilliseconds()-time))
  time = ElapsedMilliseconds()
  For a=1 To 10
    CopyMemory(source, destination, manyk)
  Next
  manyk_PB.s = Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  manyk_times = Val(manyk_PB)/Val(manyk_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(manyk/1024)+" KB blocks.")
EndIf

ameg = Pow(2, 20)
source = AllocateMemory(ameg)
destination = AllocateMemory(ameg)
If source And destination
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemoryAMD(source, destination, ameg)
  Next
  onek_AMD.s = Str((ElapsedMilliseconds()-time))
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemory(source, destination, ameg)
  Next
  onek_PB.s = Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  onek_times = Val(onek_PB)/Val(onek_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(ameg/1024)+" KB blocks.")
EndIf

hk = 1024*100
source = AllocateMemory(hk)
destination = AllocateMemory(hk)
If source And destination
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemoryAMD(source, destination, hk)
  Next
  hundredk_AMD.s = Str((ElapsedMilliseconds()-time))
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemory(source, destination, hk)
  Next
  hundredk_PB.s = Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  hundredk_times = Val(hundredk_PB)/Val(hundredk_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(hk/1024)+" KB blocks.")
EndIf

results.s="--- 64 MB tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ manyk_AMD +#LFCR$
results.s+"Pure Function : "+ manyk_PB +#LFCR$
results.s+"AMD Function is "+Str(manyk_times)+" times faster."+#LFCR$
results.s+#LFCR$
results.s+"--- 1 MB tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ onek_AMD +#LFCR$
results.s+"Pure Function : "+ onek_PB +#LFCR$
results.s+"AMD Function is "+Str(onek_times)+" times faster."+#LFCR$
results.s+#LFCR$
results.s+"--- 100kb tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ hundredk_AMD +#LFCR$
results.s+"Pure Function : "+ hundredk_PB +#LFCR$
results.s+"AMD Function is "+Str(hundredk_times)+" times faster."+#LFCR$

MessageRequester("Test Results", results.s)

Regards,

traumatic · Post by **traumatic** » Thu Nov 25, 2004 5:38 pm

El_Choni wrote:@traumatic, could you try this code (when you can) in your Pentium 2?

Same result.

Isn't this the same code that I tried after modifying the original procedure according to your instructions?

Well, I'll go home now, I'd be glad to do some further testing on tomorrow.

traumatic · Post by **traumatic** » Thu Nov 25, 2004 6:09 pm

I'm at home now

In case you're still interested in other people's results, here's mine -> Athlon XP1600+

Code: Select all

--- 64 MB tranfer test ---
AMD Function : 109
Pure Function : 1719
AMD Function is 15 times faster.

--- 1 MB tranfer test ---
AMD Function : 656
Pure Function : 30687
AMD Function is 46 times faster.

--- 100kb tranfer test ---
AMD Function : 16
Pure Function : 594
AMD Function is 37 times faster.

Max.Â² · Post by **Max.Â²** » Thu Nov 25, 2004 7:08 pm

On Athlon XP+ 2400

Hm, why is traumatic's XP 1600 faster except for the 100kb?

Ah, darn; looks like the BIOS did reset itself to lower values...

Code: Select all

--- 64 MB tranfer test ---

AMD Function : 100

Pure Function : 2293

AMD Function is 22 times faster.


--- 1 MB tranfer test ---

AMD Function : 701

Pure Function : 39457

AMD Function is 56 times faster.


--- 100kb tranfer test ---

AMD Function : 10

Pure Function : 400

AMD Function is 40 times faster.

Kale · Post by **Kale** » Thu Nov 25, 2004 7:10 pm

Intel Pentium 4 Prescott Processor 3.2GHz

* New .09 micron "Strained SI" manufacturing process
* Improved Hyperthreading Technology
* 1MB on chip, Full Speed L2 Cache
* Increased 16KB L1 Data Cache
* Streaming SIMD Extensions - SSE2 and 13 new SSE3 Instructions
* 31 stage "Hyper Pipelined" Technology for extremely high clock speeds
* 800MHz "Quad Pumped" Front Side Bus
* Rapid Execution Engine - ALU clocked at 2X frequency of core
* 128bit Floating Point/Multimedia unit
* Intel "NetBurst" micro-architecture
* Supported by the Intel i875P and i865G chipsets, with Hyperthreading support
* Intel MMX media enhancement technology
* Memory cacheability up to 4 GB of addressable memory space and system memory scalability up to 64 GB of physical memory
* 1.25 - 1.4V operating voltage range
* 89 - 103 Watts max power dissipation
* Transistor count: 125 million
* Die size: 112 mm2

Xombie · Post by **Xombie** » Thu Nov 25, 2004 7:17 pm

I'm not getting anywhere near what you fellas are getting. I'm running an AMD 64 3800

--- 64 MB test ---
AMD function: 47
Pure function: 813
AMD function is 17 times faster

--- 1 MB test ---
AMD function: 891
Pure function: 11937
AMD function is 13 times faster

--- 100kb MB test ---
AMD function: 16
Pure function: 297
AMD function is 18 times faster

Now I'm kinda depressed

Max.Â² · Post by **Max.Â²** » Thu Nov 25, 2004 7:18 pm

Xombie wrote:I'm not getting anywhere near what you fellas are getting. I'm running an AMD 64 3800

As you got the fastest system overall, I wouldn't be depressed. But if you insist, then we can trade.

And I wouldn't wonder if the method of measuring the time would get at it's limits..

Num3 · Post by **Num3** » Thu Nov 25, 2004 7:25 pm

I supose that the more powerfull the system is, less diference there will be between both functions.

Has i suspected the 1MB L1 cache Pentium does make a diference when the memory is buffer alligned.

I converted the code to work with 64kb chunks (my L1 cache), and i got even more speed out of the AMD copy function.

So it's a mid term between programming a memory feed....

traumatic · Post by **traumatic** » Thu Nov 25, 2004 7:38 pm

Max.Â² wrote:Hm, why is traumatic's XP 1600 faster except for the 100kb?

You really don't know? Because it's MY computer!

LuCiFeR[SD] · Post by **LuCiFeR[SD]** » Thu Nov 25, 2004 7:55 pm

hell, thats quite impressive

it is damn fast on this AMD duron 1.3 ghz this is with a compiled exe btw...

64mb 19x faster
1mb 30x faster
100kb 121x faster

GeoTrail · Post by **GeoTrail** » Thu Nov 25, 2004 7:56 pm

Here's my result with my homebuilt box running AMD Athlon XP +2400 with 1 GB RAM (Kingston power)

--- 64 MB test ---
AMD function: 125
Pure function: 2016
AMD function is 16 times faster

--- 1 MB test ---
AMD function: 750
Pure function: 30156
AMD function is 40 times faster

--- 100kb MB test ---
AMD function: 16
Pure function: 468
AMD function is 29 times faster

Bonne_den_kule · Post by **Bonne_den_kule** » Thu Nov 25, 2004 8:49 pm

GeoTrail wrote:Here's my result with my homebuilt box running AMD Athlon XP +2400 with 1 GB RAM (Kingston power)
--- 64 MB test ---
AMD function: 125
Pure function: 2016
AMD function is 16 times faster

--- 1 MB test ---
AMD function: 750
Pure function: 30156
AMD function is 40 times faster

--- 100kb MB test ---
AMD function: 16
Pure function: 468
AMD function is 29 times faster

Looks like it is big difference between a AthlonXp 2400 and 2500. I have only 512MB, cheap memory. Here is my results:

--- 64 MB tranfer test ---
AMD Function : 93
Pure Function : 1610
AMD Function is 17 times faster.

--- 1 MB tranfer test ---
AMD Function : 610
Pure Function : 25500
AMD Function is 41 times faster.

--- 100kb tranfer test ---
AMD Function : 15
Pure Function : 469
AMD Function is 31 times faster.

Bonne_den_kule · Post by **Bonne_den_kule** » Thu Nov 25, 2004 8:51 pm

But don't forget mates that it is more than this test that shows how good and fast cpu you have

GeoTrail · Post by **GeoTrail** » Thu Nov 25, 2004 9:32 pm

Damn, think I'm gonna replace that PC2700 module I got

Then I would have three PC3200 modules instead

That should speed things up abit since the two other modules get the same speed as the slower module.