[Implemented] ClearMemory()

Joakim Christiansen · Post by **Joakim Christiansen** » Mon May 29, 2006 2:29 am

ClearMemory(*MemoryBuffer)
Null every byte in the buffer

gnozal · Post by **gnozal** » Mon May 29, 2006 9:49 am

Joakim Christiansen wrote:ClearMemory(*MemoryBuffer)
Null every byte in the buffer

You could use RtlZeroMemory_(*Buffer, BufferLen.l) (Windows only).

RtlZeroMemory wrote:The RtlZeroMemory routine fills a block of memory with zeros, given a pointer to the block and the length, in bytes, to be filled.

VOID
RtlZeroMemory(
IN VOID UNALIGNED *Destination,
IN SIZE_T Length
);

Parameters
Destination
Pointer to the memory to be filled with zeros.
Length
Specifies the number of bytes to be zeroed.

Return Value

None

Dare · Post by **Dare** » Mon May 29, 2006 10:19 am

That is very fast! Thanks gnozal.

I was doing this, which is quite fast:
FreeMemory(mem) : mem=AllocateMemory(size)
Faster than my custom-built asm memfiller anyway.

travismcgee · Post by **travismcgee** » Mon May 29, 2006 10:32 am

You can even fill the memory with some other value than zero with rtlFillMemory_(*buffer, length, bytevalue)

Dare · Post by **Dare** » Mon May 29, 2006 11:03 am

lol. nearly 8 times faster than my best with rep movsb.

Thanks travismcgee.

Trond · Post by **Trond** » Mon May 29, 2006 9:32 pm

Dare wrote:lol. nearly 8 times faster than my best with rep movsb. Thanks travismcgee.

I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow.

Here's my current code:

Code: Select all

!jmp ZeroMemoryEnd
!ZeroMemory:
; eax = ptr
; ecx = counter
!add ecx, eax
!ZeroMemoryLoop:
!mov dword [ecx], 0
!sub ecx, 4
!cmp ecx, eax
!jnz ZeroMemoryLoop
!mov dword [ecx], 0
!ret
!ZeroMemoryEnd:


#Len = 1*1024*1024
Define Memory = AllocateMemory(#Len)

#Tries = 500
time = GetTickCount_()
For I = 0 To #Tries
  RtlZeroMemory_(Memory, #Len)
Next
MessageRequester("", Str(GetTickCount_()-time))
time = GetTickCount_()
For I = 0 To #Tries
  !mov eax, [v_Memory]
  !mov ecx, 1*1024*1024
  !call ZeroMemory
Next
MessageRequester("", Str(GetTickCount_()-time))

dioxin · Post by **dioxin** » Mon May 29, 2006 10:24 pm

For clearing (or filling or moving) large blocks of memory, the limiting factor is not CPU speed, it's memory bandwidth.
To make the fastest memory filler you need to use as few memory accesses as possible which means using large registers so you should consider using mmx or sse.

You also need to write your code to take advantage of the hardware by keeping things in large, contiguous blocks. This doesn't apply to a memory fill which is inherently done that way, but does to memory copy where you should read-a-large-block then write-a-large-block rather than read a word then write a word.

Then, use the CPU instructions designed for the purpose. Shifting large blocks of data in slow RAM using fast CPUs has been a problem for years so the CPU designers improved the instruction set to help.
This can cause problems if the code is to be run on older processors which may not have the instructions, but any modern CPU will have the MOVNTQ and PREFETCH instructions which can significantly speed up storing and loading of data respectively.

Paul.

Dare · Post by **Dare** » Tue May 30, 2006 12:33 am

Trond wrote:
Dare wrote:lol. nearly 8 times faster than my best with rep movsb. Thanks travismcgee.

I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow.

Thanks for the code snip, Trond.

And yeah, my fill code sucked big time. So does my pattern fill which has two limitations: It only fills the destination in increments of the fill pattern size (no partial fill) and it takes even longer than my filler.

Hi dioxin,

Thanks for the info. (Now to try to understand it

)

dioxin · Post by **dioxin** » Tue May 30, 2006 1:25 am

Dare,

(Now to try to understand it )

Rather than try to explain it, just look at the following code.
I set the size of the data block to fill to 100,000,000 bytes to allow for reasonable timing.

Trond:

Code: Select all

!mov ecx,100000000
!mov eax,StartAddress
ZeroMemory:
'; eax = ptr
'; ecx = counter
!add ecx, eax
ZeroMemoryLoop:
!mov dword [ecx], 0
!sub ecx, 4
!cmp ecx, eax
!jnz ZeroMemoryLoop
!mov dword [ecx], 0

takes 102ms on my Athlon 2600+
But try the following changes,
1) use MMX register 0 instead of eax
2) since mm0 is twice as big (64 bits) we double the loop increment from 4 to 8 bytes
3) use MOVNTQ instead of MOV or MOVQ to skip the cache which otherwise slows things down

Code: Select all

zero=0  'a 64 bit variable, set to zero.

!mov eax,StartAddress
!mov ecx,100000000
!movq mm0,zero
lp2:
!MOVNTQ [eax+ecx],mm0
!sub ecx,8
!jns lp2
!emms    'since we used mmx we have to switch back to FPU or it'll cause confusion

.and it now runs in 38ms, almost 3 times faster.

In this case we're moving 100,000,000 bytes in 38ms = 2.63GB/sec which is near the maximum possible memory throughput on my PC.

Paul.

Joakim Christiansen · Post by **Joakim Christiansen** » Tue May 30, 2006 1:58 am

Now Fred only need to copy that asm code to make the new command!

Trond · Post by **Trond** » Tue May 30, 2006 10:54 am

dioxin wrote: .and it now runs in 38ms, almost 3 times faster.

In this case we're moving 100,000,000 bytes in 38ms = 2.63GB/sec which is near the maximum possible memory throughput on my PC.

Paul.

There's a small problem with our codes: It will write outside the allocated memory if the size of the allocated memory % 4 (or 8 in your case) is not zero. I think.

dioxin · Post by **dioxin** » Tue May 30, 2006 11:39 am

Trond,
the code as posted was not meant to be complete. The intention was to show how to speed up the main part of the loop on a modern CPU.
Complete code would need to check for non-aligned bytes at the start of the block, then fill the main block as shown, then check for extra bytes at the end of the block.

Paul.

dioxin · Post by **dioxin** » Tue May 30, 2006 1:18 pm

Trond,

I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow.

it's tomorrow. Are we going to get to see what's up your sleeve?

Paul.

Trond · Post by **Trond** » Tue May 30, 2006 2:57 pm

dioxin wrote:Trond,
I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow.
it's tomorrow. Are we going to get to see what's up your sleeve?

Paul.

It won't be close to beating yours since you're using mmx, but I'm working on it.

Edit: It seems to not make any difference at all.

dioxin · Post by **dioxin** » Tue May 30, 2006 4:18 pm

Trond,
the likely reason you couldn't get it any faster is that it was probably running as fast as it could go anyway.
Improvements to the code will just leave the CPU idling longer between memory accesses while it waits for main RAM to catch up.

You need to use the MOVNTxxx instructions to go significantly faster as they access the memory in a different, more efficient, way when using large blocks of memory.

Paul.

PureBasic Forums - English

[Implemented] ClearMemory()

[Implemented] ClearMemory()

Re: ClearMemory()