[Implemented] ClearMemory()

Got an idea for enhancing PureBasic? New command(s) you'd like to see?
User avatar
Joakim Christiansen
Addict
Addict
Posts: 2452
Joined: Wed Dec 22, 2004 4:12 pm
Location: Norway
Contact:

[Implemented] ClearMemory()

Post by Joakim Christiansen »

ClearMemory(*MemoryBuffer)
Null every byte in the buffer
I like logic, hence I dislike humans but love computers.
gnozal
PureBasic Expert
PureBasic Expert
Posts: 4229
Joined: Sat Apr 26, 2003 8:27 am
Location: Strasbourg / France
Contact:

Re: ClearMemory()

Post by gnozal »

Joakim Christiansen wrote:ClearMemory(*MemoryBuffer)
Null every byte in the buffer
You could use RtlZeroMemory_(*Buffer, BufferLen.l) (Windows only).
RtlZeroMemory wrote:The RtlZeroMemory routine fills a block of memory with zeros, given a pointer to the block and the length, in bytes, to be filled.

VOID
RtlZeroMemory(
IN VOID UNALIGNED *Destination,
IN SIZE_T Length
);

Parameters
Destination
Pointer to the memory to be filled with zeros.
Length
Specifies the number of bytes to be zeroed.

Return Value

None
For free libraries and tools, visit my web site (also home of jaPBe V3 and PureFORM).
Dare
Addict
Addict
Posts: 1965
Joined: Mon May 29, 2006 1:01 am
Location: Outback

Post by Dare »

That is very fast! Thanks gnozal.

I was doing this, which is quite fast:
FreeMemory(mem) : mem=AllocateMemory(size)
Faster than my custom-built asm memfiller anyway.
Dare2 cut down to size
User avatar
travismcgee
New User
New User
Posts: 9
Joined: Mon May 29, 2006 2:16 am

Post by travismcgee »

You can even fill the memory with some other value than zero with rtlFillMemory_(*buffer, length, bytevalue)
Dare
Addict
Addict
Posts: 1965
Joined: Mon May 29, 2006 1:01 am
Location: Outback

Post by Dare »

lol. nearly 8 times faster than my best with rep movsb. :) Thanks travismcgee.
Dare2 cut down to size
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

Dare wrote:lol. nearly 8 times faster than my best with rep movsb. :) Thanks travismcgee.
:shock:

I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow. :wink: Here's my current code:

Code: Select all

!jmp ZeroMemoryEnd
!ZeroMemory:
; eax = ptr
; ecx = counter
!add ecx, eax
!ZeroMemoryLoop:
!mov dword [ecx], 0
!sub ecx, 4
!cmp ecx, eax
!jnz ZeroMemoryLoop
!mov dword [ecx], 0
!ret
!ZeroMemoryEnd:


#Len = 1*1024*1024
Define Memory = AllocateMemory(#Len)

#Tries = 500
time = GetTickCount_()
For I = 0 To #Tries
  RtlZeroMemory_(Memory, #Len)
Next
MessageRequester("", Str(GetTickCount_()-time))
time = GetTickCount_()
For I = 0 To #Tries
  !mov eax, [v_Memory]
  !mov ecx, 1*1024*1024
  !call ZeroMemory
Next
MessageRequester("", Str(GetTickCount_()-time))
dioxin
User
User
Posts: 97
Joined: Thu May 11, 2006 9:53 pm

Post by dioxin »

For clearing (or filling or moving) large blocks of memory, the limiting factor is not CPU speed, it's memory bandwidth.
To make the fastest memory filler you need to use as few memory accesses as possible which means using large registers so you should consider using mmx or sse.

You also need to write your code to take advantage of the hardware by keeping things in large, contiguous blocks. This doesn't apply to a memory fill which is inherently done that way, but does to memory copy where you should read-a-large-block then write-a-large-block rather than read a word then write a word.

Then, use the CPU instructions designed for the purpose. Shifting large blocks of data in slow RAM using fast CPUs has been a problem for years so the CPU designers improved the instruction set to help.
This can cause problems if the code is to be run on older processors which may not have the instructions, but any modern CPU will have the MOVNTQ and PREFETCH instructions which can significantly speed up storing and loading of data respectively.

Paul.
Dare
Addict
Addict
Posts: 1965
Joined: Mon May 29, 2006 1:01 am
Location: Outback

Post by Dare »

Trond wrote:
Dare wrote:lol. nearly 8 times faster than my best with rep movsb. :) Thanks travismcgee.
:shock:

I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow. :wink:
:D

Thanks for the code snip, Trond.

And yeah, my fill code sucked big time. So does my pattern fill which has two limitations: It only fills the destination in increments of the fill pattern size (no partial fill) and it takes even longer than my filler. :)


Hi dioxin,

Thanks for the info. (Now to try to understand it :))
Dare2 cut down to size
dioxin
User
User
Posts: 97
Joined: Thu May 11, 2006 9:53 pm

Post by dioxin »

Dare,
(Now to try to understand it )
Rather than try to explain it, just look at the following code.
I set the size of the data block to fill to 100,000,000 bytes to allow for reasonable timing.

Trond:

Code: Select all

!mov ecx,100000000
!mov eax,StartAddress
ZeroMemory:
'; eax = ptr
'; ecx = counter
!add ecx, eax
ZeroMemoryLoop:
!mov dword [ecx], 0
!sub ecx, 4
!cmp ecx, eax
!jnz ZeroMemoryLoop
!mov dword [ecx], 0
takes 102ms on my Athlon 2600+
But try the following changes,
1) use MMX register 0 instead of eax
2) since mm0 is twice as big (64 bits) we double the loop increment from 4 to 8 bytes
3) use MOVNTQ instead of MOV or MOVQ to skip the cache which otherwise slows things down

Code: Select all

zero=0  'a 64 bit variable, set to zero.

!mov eax,StartAddress
!mov ecx,100000000
!movq mm0,zero
lp2:
!MOVNTQ [eax+ecx],mm0
!sub ecx,8
!jns lp2
!emms    'since we used mmx we have to switch back to FPU or it'll cause confusion
.and it now runs in 38ms, almost 3 times faster.

In this case we're moving 100,000,000 bytes in 38ms = 2.63GB/sec which is near the maximum possible memory throughput on my PC.

Paul.
User avatar
Joakim Christiansen
Addict
Addict
Posts: 2452
Joined: Wed Dec 22, 2004 4:12 pm
Location: Norway
Contact:

Post by Joakim Christiansen »

Now Fred only need to copy that asm code to make the new command! :D
I like logic, hence I dislike humans but love computers.
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

dioxin wrote: .and it now runs in 38ms, almost 3 times faster.

In this case we're moving 100,000,000 bytes in 38ms = 2.63GB/sec which is near the maximum possible memory throughput on my PC.

Paul.
There's a small problem with our codes: It will write outside the allocated memory if the size of the allocated memory % 4 (or 8 in your case) is not zero. I think.
dioxin
User
User
Posts: 97
Joined: Thu May 11, 2006 9:53 pm

Post by dioxin »

Trond,
the code as posted was not meant to be complete. The intention was to show how to speed up the main part of the loop on a modern CPU.
Complete code would need to check for non-aligned bytes at the start of the block, then fill the main block as shown, then check for extra bytes at the end of the block.

Paul.
dioxin
User
User
Posts: 97
Joined: Thu May 11, 2006 9:53 pm

Post by dioxin »

Trond,
I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow.
it's tomorrow. Are we going to get to see what's up your sleeve?

Paul.
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

dioxin wrote:Trond,
I'm almost beating it already, and I've got a kickass optimization up my sleeve for tomorrow.
it's tomorrow. Are we going to get to see what's up your sleeve?

Paul.
It won't be close to beating yours since you're using mmx, but I'm working on it.

Edit: It seems to not make any difference at all. :oops:
dioxin
User
User
Posts: 97
Joined: Thu May 11, 2006 9:53 pm

Post by dioxin »

Trond,
the likely reason you couldn't get it any faster is that it was probably running as fast as it could go anyway.
Improvements to the code will just leave the CPU idling longer between memory accesses while it waits for main RAM to catch up.

You need to use the MOVNTxxx instructions to go significantly faster as they access the memory in a different, more efficient, way when using large blocks of memory.

Paul.
Post Reply