Faster memory copy (optimized for AMD)

Share your advanced PureBasic knowledge/code with the community.
traumatic
PureBasic Expert
PureBasic Expert
Posts: 1661
Joined: Sun Apr 27, 2003 4:41 pm
Location: Germany
Contact:

Post by traumatic »

El_Choni wrote:Weird, it should work... Could you check where the crash occurs precisely?

As I said I'm at work right now (having a computer like this at home was
no fun at all :P ), so I can't really have a closer look right now but the crash
occurs at memcpy_bp_2, at least as far as I can see.

If that might already give you a clue and you have some further code
to test, of course I'll try to help as much as I can (regardless of my more than limited ASM knowledge ;) ).
Good programmers don't comment their code. It was hard to write, should be hard to read.
El_Choni
TailBite Expert
TailBite Expert
Posts: 1007
Joined: Fri Apr 25, 2003 6:09 pm
Location: Spain

Post by El_Choni »

@traumatic, could you try this code (when you can) in your Pentium 2?

Code: Select all

Procedure CopyMemoryAMD(*src, *dst, size)
  #CACHEBLOCK = $80
  #CACHEBLOCKPREFETCH = #CACHEBLOCK/2
  #CACHEBLOCKTOP = #CACHEBLOCK*64
  #UNCACHED_COPY = 197*1024
  #UNCACHED_COPYPREFETCH = #UNCACHED_COPY/64
  #TINY_BLOCK_COPY = 64
  #IN_CACHE_COPY = 64*1024
  #IN_CACHE_COPYBIG = #IN_CACHE_COPY/64
  len = size/8
    MOV esi, *src ; source array
    MOV edi, *dst ; destination array
    MOV ecx, len ; number of QWORDS (8 bytes)
   
    MOV  ebx, ecx  ; keep a copy of count
    CLD
    CMP  ecx, #TINY_BLOCK_COPY
    JB  l_memcpy_ic_3 ; tiny? skip mmx copy
    CMP  ecx, 32*1024  ; don't align between 32k-64k because
    JBE  l_memcpy_do_align ;  it appears to be slower
    CMP  ecx, 64*1024
    JBE  l_memcpy_align_done
  memcpy_do_align:
    MOV  ecx, 8   ; a trick that's faster than rep movsb...
    SUB  ecx, edi  ; align destination to qword
    AND  ecx, 7 ; 111b  ; get the low bits
    SUB  ebx, ecx  ; update copy count
    NEG  ecx    ; set up to jump into the array
    ADD  ecx, l_memcpy_align_done
    JMP  ecx    ; jump to array of movsb's
    ;!align 4
    !movsb
    !movsb
    !movsb
    !movsb
    !movsb
    !movsb
    !movsb
    !movsb
  memcpy_align_done:   ; destination is dword aligned
    MOV  ecx, ebx  ; number of bytes left to copy
    SHR  ecx, 6   ; get 64-byte block count
    JZ  l_memcpy_ic_2 ; finish the last few bytes
    CMP  ecx, #IN_CACHE_COPYBIG ; too big 4 cache? use uncached copy
    JAE  l_memcpy_uc_test
   
    ;    !align 16
  memcpy_ic_1:   ; 64-byte block copies, in-cache copy
    ;!prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
    !movq mm0, [esi+0] ; read 64 bits
    !movq mm1, [esi+8]
    !movq [edi+0], mm0 ; write 64 bits
    !movq [edi+8], mm1 ;    note:  the normal !movq writes the
    !movq mm2, [esi+16] ;    data to cache; a cache line will be
    !movq mm3, [esi+24] ;    allocated as needed, to store the data
    !movq [edi+16], mm2
    !movq [edi+24], mm3
    !movq mm0, [esi+32]
    !movq mm1, [esi+40]
    !movq [edi+32], mm0
    !movq [edi+40], mm1
    !movq mm2, [esi+48]
    !movq mm3, [esi+56]
    !movq [edi+48], mm2
    !movq [edi+56], mm3
    ADD  esi, 64   ; update source pointer
    ADD  edi, 64   ; update destination pointer
    DEC  ecx    ; count down
    JNZ  l_memcpy_ic_1 ; last 64-byte block?
  memcpy_ic_2:
    MOV  ecx, ebx  ; has valid low 6 bits of the byte count
  memcpy_ic_3:
    SHR  ecx, 2   ; dword count
    AND  ecx, 31 ; %1111  ; only look at the "remainder" bits
    NEG  ecx    ; set up to jump into the array
    ADD  ecx, l_memcpy_last_few
    JMP  ecx    ; jump to array of movsd's
  memcpy_uc_test:
    CMP  ecx, #UNCACHED_COPYPREFETCH ; big enough? use block prefetch copy
    JAE  l_memcpy_bp_1
  memcpy_64_test:
    OR  ecx, ecx  ; tail end of block prefetch will jump here
    JZ  l_memcpy_ic_2 ; no more 64-byte blocks left
   
  memcpy_uc_1:    ; 64-byte blocks, uncached copy
    ;!prefetchnta [esi+(200*64/34+192)]  ; start reading ahead
    !movq mm0, [esi+0]  ; read 64 bits
    ADD  edi, 64   ; update destination pointer
    !movq mm1, [esi+8]
    ADD  esi, 64   ; update source pointer
    !movq mm2, [esi-48]
    !movq [edi-64], mm0 ; write 64 bits, bypassing the cache
    !movq mm0, [esi-40] ;    note: !movq also prevents the CPU
    !movq [edi-56], mm1 ;    from READING the destination address
    !movq mm1, [esi-32] ;    into the cache, only to be over-written
    !movq [edi-48], mm2 ;    so that also helps performance
    !movq mm2, [esi-24]
    !movq [edi-40], mm0
    !movq mm0, [esi-16]
    !movq [edi-32], mm1
    !movq mm1, [esi-8]
    !movq [edi-24], mm2
    !movq [edi-16], mm0
    DEC  ecx
    !movq [edi-8], mm1
    JNZ  l_memcpy_uc_1 ; last 64-byte block?
    JMP  l_memcpy_ic_2  ; almost done
   
  memcpy_bp_1:   ; large blocks, block prefetch copy
    CMP  ecx, #CACHEBLOCK   ; big enough to run another prefetch loop?
    JL  l_memcpy_64_test   ; no, back to regular uncached copy
    MOV  eax, #CACHEBLOCKPREFETCH  ; block prefetch loop, unrolled 2X
    ADD  esi, #CACHEBLOCKTOP ; move to the top of the block
    ;    !align 16
  memcpy_bp_2:
    MOV  edx, [esi-64]  ; grab one address per cache line
    MOV  edx, [esi-128]  ; grab one address per cache line
    SUB  esi, 128   ; go reverse order
    DEC  eax     ; count down the cache lines
    JNZ  l_memcpy_bp_2  ; keep grabbing more lines into cache
    MOV  eax, #CACHEBLOCK  ; now that it's in cache, do the copy
    ;    !align 16
  memcpy_bp_3:
    !movq mm0, [esi]  ; read 64 bits
    !movq mm1, [esi+ 8]
    !movq mm2, [esi+16]
    !movq mm3, [esi+24]
    !movq mm4, [esi+32]
    !movq mm5, [esi+40]
    !movq mm6, [esi+48]
    !movq mm7, [esi+56]
    ADD  esi, 64    ; update source pointer
    !movq [edi], mm0  ; write 64 bits, bypassing cache
    !movq [edi+ 8], mm1  ;    note: !movq also prevents the CPU
    !movq [edi+16], mm2  ;    from READING the destination address
    !movq [edi+24], mm3  ;    into the cache, only to be over-written,
    !movq [edi+32], mm4  ;    so that also helps performance
    !movq [edi+40], mm5
    !movq [edi+48], mm6
    !movq [edi+56], mm7
    ADD  edi, 64    ; update dest pointer
    DEC  eax     ; count down
    JNZ  l_memcpy_bp_3  ; keep copying
    SUB  ecx, #CACHEBLOCK  ; update the 64-byte block count
    JMP  l_memcpy_bp_1  ; keep processing chunks
    ;The smallest copy uses the X86 "!movsd" instruction, in an optimized
    ;form which is an "unrolled loop".   Then it handles the last few bytes.
    ;    !align 4
    !movsd
    !movsd   ; perform last 1-15 dword copies
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd   ; perform last 1-7 dword copies
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd
    !movsd
  memcpy_last_few:  ; dword aligned from before !movsd's
    MOV  ecx, ebx ; has valid low 2 bits of the byte count
    AND  ecx, 3 ; %11 ; the last few cows must come home
    JZ  l_memcpy_final ; no more, let's leave
    REP  movsb  ; the last 1, 2, or 3 bytes
  memcpy_final:
    !emms    ; clean up the  state
    ;!sfence    ; flush the write buffer
EndProcedure

manyk = Pow(2, 26)
source = AllocateMemory(manyk)
destination = AllocateMemory(manyk)
If source And destination
  time = ElapsedMilliseconds()
  For a=1 To 10
    CopyMemoryAMD(source, destination, manyk)
  Next
  manyk_AMD.s = Str((ElapsedMilliseconds()-time))
  time = ElapsedMilliseconds()
  For a=1 To 10
    CopyMemory(source, destination, manyk)
  Next
  manyk_PB.s = Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  manyk_times = Val(manyk_PB)/Val(manyk_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(manyk/1024)+" KB blocks.")
EndIf

ameg = Pow(2, 20)
source = AllocateMemory(ameg)
destination = AllocateMemory(ameg)
If source And destination
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemoryAMD(source, destination, ameg)
  Next
  onek_AMD.s = Str((ElapsedMilliseconds()-time))
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemory(source, destination, ameg)
  Next
  onek_PB.s = Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  onek_times = Val(onek_PB)/Val(onek_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(ameg/1024)+" KB blocks.")
EndIf

hk = 1024*100
source = AllocateMemory(hk)
destination = AllocateMemory(hk)
If source And destination
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemoryAMD(source, destination, hk)
  Next
  hundredk_AMD.s = Str((ElapsedMilliseconds()-time))
  time = ElapsedMilliseconds()
  For a=1 To 10000
    CopyMemory(source, destination, hk)
  Next
  hundredk_PB.s = Str((ElapsedMilliseconds()-time))
  FreeMemory(source)
  FreeMemory(destination)
  source = 0
  destination = 0
  hundredk_times = Val(hundredk_PB)/Val(hundredk_AMD)
Else
  MessageRequester("Error", "Could not allocate two "+Str(hk/1024)+" KB blocks.")
EndIf

results.s="--- 64 MB tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ manyk_AMD +#LFCR$
results.s+"Pure Function : "+ manyk_PB +#LFCR$
results.s+"AMD Function is "+Str(manyk_times)+" times faster."+#LFCR$
results.s+#LFCR$
results.s+"--- 1 MB tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ onek_AMD +#LFCR$
results.s+"Pure Function : "+ onek_PB +#LFCR$
results.s+"AMD Function is "+Str(onek_times)+" times faster."+#LFCR$
results.s+#LFCR$
results.s+"--- 100kb tranfer test ---"+#LFCR$
results.s+"AMD Function : "+ hundredk_AMD +#LFCR$
results.s+"Pure Function : "+ hundredk_PB +#LFCR$
results.s+"AMD Function is "+Str(hundredk_times)+" times faster."+#LFCR$

MessageRequester("Test Results", results.s)
Regards,
El_Choni
traumatic
PureBasic Expert
PureBasic Expert
Posts: 1661
Joined: Sun Apr 27, 2003 4:41 pm
Location: Germany
Contact:

Post by traumatic »

El_Choni wrote:@traumatic, could you try this code (when you can) in your Pentium 2?
Same result.

Isn't this the same code that I tried after modifying the original procedure according to your instructions?

Well, I'll go home now, I'd be glad to do some further testing on tomorrow.
Good programmers don't comment their code. It was hard to write, should be hard to read.
traumatic
PureBasic Expert
PureBasic Expert
Posts: 1661
Joined: Sun Apr 27, 2003 4:41 pm
Location: Germany
Contact:

Post by traumatic »

I'm at home now :)

In case you're still interested in other people's results, here's mine -> Athlon XP1600+

Code: Select all

--- 64 MB tranfer test ---
AMD Function : 109
Pure Function : 1719
AMD Function is 15 times faster.

--- 1 MB tranfer test ---
AMD Function : 656
Pure Function : 30687
AMD Function is 46 times faster.

--- 100kb tranfer test ---
AMD Function : 16
Pure Function : 594
AMD Function is 37 times faster.
Good programmers don't comment their code. It was hard to write, should be hard to read.
Max.²
Enthusiast
Enthusiast
Posts: 175
Joined: Wed Jul 28, 2004 8:38 am

Post by Max.² »

On Athlon XP+ 2400

Hm, why is traumatic's XP 1600 faster except for the 100kb?

Ah, darn; looks like the BIOS did reset itself to lower values...

Code: Select all

--- 64 MB tranfer test ---

AMD Function : 100

Pure Function : 2293

AMD Function is 22 times faster.


--- 1 MB tranfer test ---

AMD Function : 701

Pure Function : 39457

AMD Function is 56 times faster.


--- 100kb tranfer test ---

AMD Function : 10

Pure Function : 400

AMD Function is 40 times faster.

Last edited by Max.² on Thu Nov 25, 2004 7:13 pm, edited 2 times in total.
Kale
PureBasic Expert
PureBasic Expert
Posts: 3000
Joined: Fri Apr 25, 2003 6:03 pm
Location: Lincoln, UK
Contact:

Post by Kale »

Intel Pentium 4 Prescott Processor 3.2GHz

* New .09 micron "Strained SI" manufacturing process
* Improved Hyperthreading Technology
* 1MB on chip, Full Speed L2 Cache
* Increased 16KB L1 Data Cache
* Streaming SIMD Extensions - SSE2 and 13 new SSE3 Instructions
* 31 stage "Hyper Pipelined" Technology for extremely high clock speeds
* 800MHz "Quad Pumped" Front Side Bus
* Rapid Execution Engine - ALU clocked at 2X frequency of core
* 128bit Floating Point/Multimedia unit
* Intel "NetBurst" micro-architecture
* Supported by the Intel i875P and i865G chipsets, with Hyperthreading support
* Intel MMX media enhancement technology
* Memory cacheability up to 4 GB of addressable memory space and system memory scalability up to 64 GB of physical memory
* 1.25 - 1.4V operating voltage range
* 89 - 103 Watts max power dissipation
* Transistor count: 125 million
* Die size: 112 mm2
Image
--Kale

Image
Xombie
Addict
Addict
Posts: 898
Joined: Thu Jul 01, 2004 2:51 am
Location: Tacoma, WA
Contact:

Post by Xombie »

I'm not getting anywhere near what you fellas are getting. I'm running an AMD 64 3800

--- 64 MB test ---
AMD function: 47
Pure function: 813
AMD function is 17 times faster

--- 1 MB test ---
AMD function: 891
Pure function: 11937
AMD function is 13 times faster

--- 100kb MB test ---
AMD function: 16
Pure function: 297
AMD function is 18 times faster

Now I'm kinda depressed :cry:
Max.²
Enthusiast
Enthusiast
Posts: 175
Joined: Wed Jul 28, 2004 8:38 am

Post by Max.² »

Xombie wrote:I'm not getting anywhere near what you fellas are getting. I'm running an AMD 64 3800
As you got the fastest system overall, I wouldn't be depressed. But if you insist, then we can trade. :twisted:

And I wouldn't wonder if the method of measuring the time would get at it's limits..
Num3
PureBasic Expert
PureBasic Expert
Posts: 2812
Joined: Fri Apr 25, 2003 4:51 pm
Location: Portugal, Lisbon
Contact:

Post by Num3 »

I supose that the more powerfull the system is, less diference there will be between both functions.

Has i suspected the 1MB L1 cache Pentium does make a diference when the memory is buffer alligned.

I converted the code to work with 64kb chunks (my L1 cache), and i got even more speed out of the AMD copy function.

So it's a mid term between programming a memory feed....
traumatic
PureBasic Expert
PureBasic Expert
Posts: 1661
Joined: Sun Apr 27, 2003 4:41 pm
Location: Germany
Contact:

Post by traumatic »

Max.² wrote:Hm, why is traumatic's XP 1600 faster except for the 100kb?
You really don't know? Because it's MY computer! :twisted:
Good programmers don't comment their code. It was hard to write, should be hard to read.
LuCiFeR[SD]
666
666
Posts: 1033
Joined: Mon Sep 01, 2003 2:33 pm

Post by LuCiFeR[SD] »

hell, thats quite impressive :) it is damn fast on this AMD duron 1.3 ghz this is with a compiled exe btw...

64mb 19x faster
1mb 30x faster
100kb 121x faster
User avatar
GeoTrail
Addict
Addict
Posts: 2794
Joined: Fri Feb 13, 2004 12:45 am
Location: Bergen, Norway
Contact:

Post by GeoTrail »

Here's my result with my homebuilt box running AMD Athlon XP +2400 with 1 GB RAM (Kingston power) ;)
--- 64 MB test ---
AMD function: 125
Pure function: 2016
AMD function is 16 times faster

--- 1 MB test ---
AMD function: 750
Pure function: 30156
AMD function is 40 times faster

--- 100kb MB test ---
AMD function: 16
Pure function: 468
AMD function is 29 times faster
I Stepped On A Cornflake!!! Now I'm A Cereal Killer!
Bonne_den_kule
Addict
Addict
Posts: 841
Joined: Mon Jun 07, 2004 7:10 pm

Post by Bonne_den_kule »

GeoTrail wrote:Here's my result with my homebuilt box running AMD Athlon XP +2400 with 1 GB RAM (Kingston power) ;)
--- 64 MB test ---
AMD function: 125
Pure function: 2016
AMD function is 16 times faster

--- 1 MB test ---
AMD function: 750
Pure function: 30156
AMD function is 40 times faster

--- 100kb MB test ---
AMD function: 16
Pure function: 468
AMD function is 29 times faster
Looks like it is big difference between a AthlonXp 2400 and 2500. I have only 512MB, cheap memory. Here is my results:

--- 64 MB tranfer test ---
AMD Function : 93
Pure Function : 1610
AMD Function is 17 times faster.

--- 1 MB tranfer test ---
AMD Function : 610
Pure Function : 25500
AMD Function is 41 times faster.

--- 100kb tranfer test ---
AMD Function : 15
Pure Function : 469
AMD Function is 31 times faster.
Bonne_den_kule
Addict
Addict
Posts: 841
Joined: Mon Jun 07, 2004 7:10 pm

Post by Bonne_den_kule »

But don't forget mates that it is more than this test that shows how good and fast cpu you have
User avatar
GeoTrail
Addict
Addict
Posts: 2794
Joined: Fri Feb 13, 2004 12:45 am
Location: Bergen, Norway
Contact:

Post by GeoTrail »

Damn, think I'm gonna replace that PC2700 module I got :(
Then I would have three PC3200 modules instead ;)
That should speed things up abit since the two other modules get the same speed as the slower module.
I Stepped On A Cornflake!!! Now I'm A Cereal Killer!
Post Reply