Page 1 of 1
[Implemented] ClearMemory()
Posted: Mon May 29, 2006 2:29 am
by Joakim Christiansen
ClearMemory(*MemoryBuffer)
Null every byte in the buffer
Re: ClearMemory()
Posted: Mon May 29, 2006 9:49 am
by gnozal
Joakim Christiansen wrote:ClearMemory(*MemoryBuffer)
Null every byte in the buffer
You could use
RtlZeroMemory_(*Buffer, BufferLen.l) (Windows only).
RtlZeroMemory wrote:The RtlZeroMemory routine fills a block of memory with zeros, given a pointer to the block and the length, in bytes, to be filled.
VOID
RtlZeroMemory(
IN VOID UNALIGNED *Destination,
IN SIZE_T Length
);
Parameters
Destination
Pointer to the memory to be filled with zeros.
Length
Specifies the number of bytes to be zeroed.
Return Value
None
Posted: Mon May 29, 2006 10:19 am
by Dare
That is very fast! Thanks gnozal.
I was doing this, which is quite fast:
FreeMemory(mem) : mem=AllocateMemory(size)
Faster than my custom-built asm memfiller anyway.
Posted: Mon May 29, 2006 10:32 am
by travismcgee
You can even fill the memory with some other value than zero with rtlFillMemory_(*buffer, length, bytevalue)
Posted: Mon May 29, 2006 11:03 am
by Dare
lol. nearly 8 times faster than my best with rep movsb.

Thanks travismcgee.
Posted: Mon May 29, 2006 9:32 pm
by Trond
Dare wrote:lol. nearly 8 times faster than my best with rep movsb.

Thanks travismcgee.
I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow.

Here's my current code:
Code: Select all
!jmp ZeroMemoryEnd
!ZeroMemory:
; eax = ptr
; ecx = counter
!add ecx, eax
!ZeroMemoryLoop:
!mov dword [ecx], 0
!sub ecx, 4
!cmp ecx, eax
!jnz ZeroMemoryLoop
!mov dword [ecx], 0
!ret
!ZeroMemoryEnd:
#Len = 1*1024*1024
Define Memory = AllocateMemory(#Len)
#Tries = 500
time = GetTickCount_()
For I = 0 To #Tries
RtlZeroMemory_(Memory, #Len)
Next
MessageRequester("", Str(GetTickCount_()-time))
time = GetTickCount_()
For I = 0 To #Tries
!mov eax, [v_Memory]
!mov ecx, 1*1024*1024
!call ZeroMemory
Next
MessageRequester("", Str(GetTickCount_()-time))
Posted: Mon May 29, 2006 10:24 pm
by dioxin
For clearing (or filling or moving) large blocks of memory, the limiting factor is not CPU speed, it's memory bandwidth.
To make the fastest memory filler you need to use as few memory accesses as possible which means using large registers so you should consider using mmx or sse.
You also need to write your code to take advantage of the hardware by keeping things in large, contiguous blocks. This doesn't apply to a memory fill which is inherently done that way, but does to memory copy where you should read-a-large-block then write-a-large-block rather than read a word then write a word.
Then, use the CPU instructions designed for the purpose. Shifting large blocks of data in slow RAM using fast CPUs has been a problem for years so the CPU designers improved the instruction set to help.
This can cause problems if the code is to be run on older processors which may not have the instructions, but any modern CPU will have the MOVNTQ and PREFETCH instructions which can significantly speed up storing and loading of data respectively.
Paul.
Posted: Tue May 30, 2006 12:33 am
by Dare
Trond wrote:Dare wrote:lol. nearly 8 times faster than my best with rep movsb.

Thanks travismcgee.
I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow.

Thanks for the code snip, Trond.
And yeah, my fill code sucked big time. So does my pattern fill which has two limitations: It only fills the destination in increments of the fill pattern size (no partial fill) and it takes even longer than my filler.
Hi dioxin,
Thanks for the info. (Now to try to understand it

)
Posted: Tue May 30, 2006 1:25 am
by dioxin
Dare,
(Now to try to understand it )
Rather than try to explain it, just look at the following code.
I set the size of the data block to fill to 100,000,000 bytes to allow for reasonable timing.
Trond:
Code: Select all
!mov ecx,100000000
!mov eax,StartAddress
ZeroMemory:
'; eax = ptr
'; ecx = counter
!add ecx, eax
ZeroMemoryLoop:
!mov dword [ecx], 0
!sub ecx, 4
!cmp ecx, eax
!jnz ZeroMemoryLoop
!mov dword [ecx], 0
takes 102ms on my Athlon 2600+
But try the following changes,
1) use MMX register 0 instead of eax
2) since mm0 is twice as big (64 bits) we double the loop increment from 4 to 8 bytes
3) use MOVNTQ instead of MOV or MOVQ to skip the cache which otherwise slows things down
Code: Select all
zero=0 'a 64 bit variable, set to zero.
!mov eax,StartAddress
!mov ecx,100000000
!movq mm0,zero
lp2:
!MOVNTQ [eax+ecx],mm0
!sub ecx,8
!jns lp2
!emms 'since we used mmx we have to switch back to FPU or it'll cause confusion
.and it now runs in 38ms, almost 3 times faster.
In this case we're moving 100,000,000 bytes in 38ms = 2.63GB/sec which is near the maximum possible memory throughput on my PC.
Paul.
Posted: Tue May 30, 2006 1:58 am
by Joakim Christiansen
Now Fred only need to copy that asm code to make the new command!

Posted: Tue May 30, 2006 10:54 am
by Trond
dioxin wrote:
.and it now runs in 38ms, almost 3 times faster.
In this case we're moving 100,000,000 bytes in 38ms = 2.63GB/sec which is near the maximum possible memory throughput on my PC.
Paul.
There's a small problem with our codes: It will write outside the allocated memory if the size of the allocated memory % 4 (or 8 in your case) is not zero. I think.
Posted: Tue May 30, 2006 11:39 am
by dioxin
Trond,
the code as posted was not meant to be complete. The intention was to show how to speed up the main part of the loop on a modern CPU.
Complete code would need to check for non-aligned bytes at the start of the block, then fill the main block as shown, then check for extra bytes at the end of the block.
Paul.
Posted: Tue May 30, 2006 1:18 pm
by dioxin
Trond,
I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow.
it's tomorrow. Are we going to get to see what's up your sleeve?
Paul.
Posted: Tue May 30, 2006 2:57 pm
by Trond
dioxin wrote:Trond,
I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow.
it's tomorrow. Are we going to get to see what's up your sleeve?
Paul.
It won't be close to beating yours since you're using mmx, but I'm working on it.
Edit: It seems to not make any difference at all.

Posted: Tue May 30, 2006 4:18 pm
by dioxin
Trond,
the likely reason you couldn't get it any faster is that it was probably running as fast as it could go anyway.
Improvements to the code will just leave the CPU idling longer between memory accesses while it waits for main RAM to catch up.
You need to use the MOVNTxxx instructions to go significantly faster as they access the memory in a different, more efficient, way when using large blocks of memory.
Paul.