PureBasic Forums - English

Posted: **Sat Apr 30, 2011 8:30 pm**

Use SSE2
parts of the code are (c) Advanced Micro Devices, Inc. (NYSE: AMD)

parts of the code are Copyright (C) 2009 Jan Boon (Kaetemi) see here: http://blog.kaetemi.be/post/2009/10/25/SSE2-memcpy

Code: Select all

Procedure memcopy_sse2(*src, *dst, nBytes.L)
!local  l_slow
!local  l_fast
!local  l_first
!local  l_more
!local  l_aligned4k
!local  l_aligned4kinp
!local  l_alinged4kout
!local  l_alignedlast
!local  l_alignedlastinp
!local  l_alignedlastout
!local  l_unaligned4k
!local  l_unaligned4kinp
!local  l_unalinged4kout
!local  l_unalignedlast
!local  l_unalignedlastinp
!local  l_unalignedlastout
!local  l_last
!local  l_end
!                       MOV         ecx, [p.v_nBytes]
!                       MOV         edi, [p.p_dst]
!                       MOV         esi, [p.p_src]
!                       ADD         ecx, edi
!                       prefetchnta [esi+0]
!                       prefetchnta [esi+32]
!                       prefetchnta [esi+64]
!                       prefetchnta [esi+96]
!                       CMP         dword [p.v_nBytes], 512
!                       JGE         l_fast
!l_slow:                MOV         bl, [esi]
!                       MOV         [edi], bl
!                       INC         edi
!                       INC         esi
!                       CMP         ecx, edi
!                       JNZ         l_slow
!                       JMP         l_end
!l_fast:                AND         ecx, $FFFFFF80
!                       MOV         ebx, esi
!                       SUB         ebx, edi
!                       ADD         ebx, ecx
!                       MOV         eax, edi
!                       AND         edi, $FFFFFF80
!                       CMP         eax, edi
!                       JNE         l_first
!                       JMP         l_more
!l_first:               movdqu      xmm0, [esi+0]
!                       movdqu      xmm1, [esi+16]
!                       movdqu      xmm2, [esi+32]
!                       movdqu      xmm3, [esi+48]
!                       movdqu      xmm4, [esi+64]
!                       movdqu      xmm5, [esi+80]
!                       movdqu      xmm6, [esi+96]
!                       movdqu      xmm7, [esi+112]
!                       movdqu      [eax+0],   xmm0
!                       movdqu      [eax+16],  xmm1
!                       movdqu      [eax+32],  xmm2
!                       movdqu      [eax+48],  xmm3
!                       movdqu      [eax+64],  xmm4
!                       movdqu      [eax+80],  xmm5
!                       movdqu      [eax+96],  xmm6
!                       movdqu      [eax+112], xmm7
!                       ADD         edi, 128
!                       SUB         eax, edi
!                       SUB         esi, eax
!                       CMP         ecx, edi
!                       JNZ         l_more
!                       JMP         l_last
!l_more:                MOV         eax, esi
!                       AND         eax, $FFFFFF80
!                       CMP         eax, esi
!                       JNE         l_unaligned4k
!l_aligned4k:           MOV         eax, esi
!                       ADD         eax, 4096
!                       CMP         eax, ebx
!                       JLE         l_aligned4kinp
!                       CMP         ecx, edi
!                       JNE         l_alignedlast
!                       JMP         l_last
!l_aligned4kinp:        prefetchnta [esi+0]
!                       prefetchnta [esi+32]
!                       prefetchnta [esi+64]
!                       prefetchnta [esi+96]
!                       ADD         esi, 128              
!                       CMP         eax, esi
!                       JNE         l_aligned4kinp
!                       SUB         esi, 4096
!l_alinged4kout:        movdqa      xmm0, [esi+0]
!                       movdqa      xmm1, [esi+16]
!                       movdqa      xmm2, [esi+32]
!                       movdqa      xmm3, [esi+48]
!                       movdqa      xmm4, [esi+64]
!                       movdqa      xmm5, [esi+80]
!                       movdqa      xmm6, [esi+96]
!                       movdqa      xmm7, [esi+112]
!                       movntdq     [edi+0],   xmm0
!                       movntdq     [edi+16],  xmm1
!                       movntdq     [edi+32],  xmm2
!                       movntdq     [edi+48],  xmm3
!                       movntdq     [edi+64],  xmm4
!                       movntdq     [edi+80],  xmm5
!                       movntdq     [edi+96],  xmm6
!                       movntdq     [edi+112], xmm7
!                       ADD         esi, 128
!                       ADD         edi, 128
!                       CMP         eax, esi
!                       JNE         l_alinged4kout
!                       JMP         l_aligned4k
!l_alignedlast:         MOV         eax, esi
!l_alignedlastinp:      prefetchnta [esi+0]
!                       prefetchnta [esi+32]
!                       prefetchnta [esi+64]
!                       prefetchnta [esi+96]
!                       ADD         esi, 128
!                       CMP         ebx, esi
!                       JNE         l_alignedlastinp
!                       MOV         esi, eax
!l_alignedlastout:      movdqa      xmm0, [esi+0]
!                       movdqa      xmm1, [esi+16]
!                       movdqa      xmm2, [esi+32]
!                       movdqa      xmm3, [esi+48]
!                       movdqa      xmm4, [esi+64]
!                       movdqa      xmm5, [esi+80]
!                       movdqa      xmm6, [esi+96]
!                       movdqa      xmm7, [esi+112]
!                       movntdq     [edi+0],  xmm0
!                       movntdq     [edi+16], xmm1
!                       movntdq     [edi+32], xmm2
!                       movntdq     [edi+48], xmm3
!                       movntdq     [edi+64], xmm4
!                       movntdq     [edi+80], xmm5
!                       movntdq     [edi+96], xmm6
!                       movntdq     [edi+112], xmm7
!                       ADD         esi, 128
!                       ADD         edi, 128
!                       CMP         ecx, edi
!                       JNE         l_alignedlastout
!                       JMP         l_last
!l_unaligned4k:         MOV         eax, esi
!                       ADD         eax, 4096
!                       CMP         eax, ebx
!                       JLE         l_unaligned4kinp
!                       CMP         ecx, edi
!                       JNE         l_unalignedlast
!                       JMP         l_last
!l_unaligned4kinp:      prefetchnta [esi+0]
!                       prefetchnta [esi+32]
!                       prefetchnta [esi+64]
!                       prefetchnta [esi+96]
!                       ADD         esi, 128
!                       CMP         eax, esi
!                       JNE         l_unaligned4kinp
!                       SUB         esi, 4096
!l_unalinged4kout:      movdqu      xmm0, [esi+0]
!                       movdqu      xmm1, [esi+16]
!                       movdqu      xmm2, [esi+32]
!                       movdqu      xmm3, [esi+48]
!                       movdqu      xmm4, [esi+64]
!                       movdqu      xmm5, [esi+80]
!                       movdqu      xmm6, [esi+96]
!                       movdqu      xmm7, [esi+112]
!                       movntdq     [edi+0],   xmm0
!                       movntdq     [edi+16],  xmm1
!                       movntdq     [edi+32],  xmm2
!                       movntdq     [edi+48],  xmm3
!                       movntdq     [edi+64],  xmm4
!                       movntdq     [edi+80],  xmm5
!                       movntdq     [edi+96],  xmm6
!                       movntdq     [edi+112], xmm7
!                       ADD         esi, 128
!                       ADD         edi, 128
!                       CMP         eax, esi
!                       JNE         l_unalinged4kout
!                       JMP         l_unaligned4k
!l_unalignedlast:       MOV         eax, esi
!l_unalignedlastinp:    prefetchnta [esi+0]
!                       prefetchnta [esi+32]
!                       prefetchnta [esi+64]
!                       prefetchnta [esi+96]
!                       ADD         esi, 128
!                       CMP         ebx, esi
!                       JNE         l_unalignedlastinp
!                       MOV         esi, eax
!l_unalignedlastout:    movdqu      xmm0, [esi+0]
!                       movdqu      xmm1, [esi+16]
!                       movdqu      xmm2, [esi+32]
!                       movdqu      xmm3, [esi+48]
!                       movdqu      xmm4, [esi+64]
!                       movdqu      xmm5, [esi+80]
!                       movdqu      xmm6, [esi+96]
!                       movdqu      xmm7, [esi+112]
!                       movntdq     [edi+0],   xmm0
!                       movntdq     [edi+16],  xmm1
!                       movntdq     [edi+32],  xmm2
!                       movntdq     [edi+48],  xmm3
!                       movntdq     [edi+64],  xmm4
!                       movntdq     [edi+80],  xmm5
!                       movntdq     [edi+96],  xmm6
!                       movntdq     [edi+112], xmm7
!                       ADD         esi, 128
!                       ADD         edi, 128
!                       CMP         ecx, edi
!                       JNE         l_unalignedlastout
!                       JMP         l_last
!l_last:                MOV         ecx, [p.v_nBytes]
!                       MOV         edi, [p.p_dst]
!                       MOV         esi, [p.p_src]
!                       ADD         edi, ecx
!                       ADD         esi, ecx
!                       SUB         edi, 128
!                       SUB         esi, 128
!                       movdqu      xmm0, [esi+0]
!                       movdqu      xmm1, [esi+16]
!                       movdqu      xmm2, [esi+32]
!                       movdqu      xmm3, [esi+48]
!                       movdqu      xmm4, [esi+64]
!                       movdqu      xmm5, [esi+80]
!                       movdqu      xmm6, [esi+96]
!                       movdqu      xmm7, [esi+112]
!                       movdqu      [edi+0],   xmm0
!                       movdqu      [edi+16],  xmm1
!                       movdqu      [edi+32],  xmm2
!                       movdqu      [edi+48],  xmm3
!                       movdqu      [edi+64],  xmm4
!                       movdqu      [edi+80],  xmm5
!                       movdqu      [edi+96],  xmm6
!                       movdqu      [edi+112], xmm7
!l_end:
EndProcedure

How to use (same as CopyMemory):

Code: Select all

Global  *mem1 = AllocateMemory($10000)
RandomData(*mem1, $10000)
Global  *mem2 = AllocateMemory($1)


For testsize=1 To $10000
    *mem2 = ReAllocateMemory(*mem2, testsize)
    memcopy_sse2(*mem1, *mem2, testsize)
;;;;;;;;;;;;;;;;;;;;;;;;;    CopyMemory(*mem1, *mem2, testsize)
    If CompareMemory(*mem1, *mem2, testsize) = 0
        Debug "Error with the testsize = " + Hex(testsize)
    EndIf    
Next

Posted: **Sun May 01, 2011 9:37 am**

Quote from http://blog.kaetemi.be/post/2009/10/25/SSE2-memcpy

http://blog.kaetemi.be/post/2009/10/25/SSE2-memcpy wrote:Code is available below, ask before using.

It is nice to have fast code available, though.

Posted: **Sun May 01, 2011 4:33 pm**

remi_meier wrote:Quote from http://blog.kaetemi.be/post/2009/10/25/SSE2-memcpy
http://blog.kaetemi.be/post/2009/10/25/SSE2-memcpy wrote:Code is available below, ask before using.
It is nice to have fast code available, though.

As demonstrated by testing, this procedure is faster only on Core2 +, otherwise slower. Well, try to make a similar, but using only MMX, it would be more useful.

Code: Select all

;=================================================================================
Procedure.Q   Get_TSC()
Protected   TSC.Q
!    RDTSC
!    MOV     dword [p.v_TSC+0],EAX
!    MOV     dword [p.v_TSC+4],EDX
ProcedureReturn TSC.Q
EndProcedure
;=================================================================================

Global  MAX_BLOCK = $10000
Global  *mem1 = AllocateMemory(MAX_BLOCK)
RandomData(*mem1, MAX_BLOCK)
Global  *mem2 = AllocateMemory($1)
Global  TSC_temp1.Q
Global  TSC_temp2.Q
Dim TSC.Q(MAX_BLOCK+1)
Global  TSC_result1.Q
Global  TSC_result2.Q

OpenConsole()
;=================================================================================
PrintN("Start FastCopy")
;testing with FastCopy
For testsize=1 To MAX_BLOCK
    *mem2 = ReAllocateMemory(*mem2, testsize)
    TSC_temp1 = Get_TSC()
    memcopy_sse2(*mem1, *mem2, testsize)
    TSC_temp2 = Get_TSC()
    TSC.Q(testsize) = TSC_temp2 - TSC_temp1
Next

;calculate middle value of time that executed
TSC_result1 = 0
For i=1 To MAX_BLOCK
    TSC_result1 = TSC_result1 + TSC.Q(i)
Next
    TSC_result1 = TSC_result1 / MAX_BLOCK
;=================================================================================


PrintN("Start StandardCopy")
;testing with standard copy
For testsize=1 To MAX_BLOCK
    *mem2 = ReAllocateMemory(*mem2, testsize)
    TSC_temp1 = Get_TSC()
    CopyMemory(*mem1, *mem2, testsize)
    TSC_temp2 = Get_TSC()
    TSC.Q(testsize) = TSC_temp2 - TSC_temp1    
Next

;calculate middle value of time that executed
TSC_result2 = 0
For i=1 To MAX_BLOCK
    TSC_result2 = TSC_result2 + TSC.Q(i)
Next
    TSC_result2 = TSC_result2 / MAX_BLOCK
;=================================================================================
PrintN("    FastCopy = " +Hex(TSC_result1))
PrintN("StandardCopy = " +Hex(TSC_result2))

Delay(10000)

Posted: **Sun May 01, 2011 11:30 pm**

This routine take much better result, use MMX & DWORD align.
Faster that standard PureBasic CopyMemory (used MSVCRT.memcpy) if memory block size is >= 128KiB:

Code: Select all

Procedure   CopyMemoryMMX(*src, *dst, nBytes.L)
!local l_memcpy_do_align    
!local l_memcpy_align_done
!local l_memcpy_ic_1
!local l_memcpy_ic_2
!local l_memcpy_ic_3
!local l_memcpy_uc_test
!local l_memcpy_64_test
!local l_memcpy_uc_1
!local l_memcpy_bp_1
!local l_memcpy_bp_2
!local l_memcpy_bp_3
!local l_memcpy_last_few
!local l_memcpy_last_few_nxt
!local l_memcpy_final
!align 4
!	                    MOV         EDI, [p.p_dst]
!	                    MOV         ESI, [p.p_src]
!	                    MOV         ECX, [p.v_nBytes]
!	                    MOV		    EBX, ECX
!                       CLD
!               	    CMP         ECX, 64
!                       JB          l_memcpy_ic_3

!                       CMP		    ECX, 32*1024
!	                    JBE         l_memcpy_do_align

!                       CMP         ECX, 64*1024
!                       JBE         l_memcpy_align_done

!l_memcpy_do_align:     MOV		    ECX, 8
!                       SUB         ECX, EDI
!                       AND         ECX, 111b
!                       SUB		    EBX, ECX
!	                    NEG		    ECX
!	                    ADD		    ECX, l_memcpy_align_done
!	                    JMP		    ECX
!align 4
!	                    MOVSB
!	                    MOVSB
!	                    MOVSB
!	                    MOVSB
!	                    MOVSB
!	                    MOVSB
!	                    MOVSB
!	                    MOVSB

!l_memcpy_align_done:	MOV		    ECX, EBX
!                       SHR         ECX, 6
!                       JZ          l_memcpy_ic_2
!	                    CMP		    ECX, 1024
!	                    JAE		    l_memcpy_uc_test
;!align 16;   <=- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!align 8;(bypass PureBasic bug of Align 16)
!DQ $9090909090909090
!l_memcpy_ic_1:         prefetchnta [esi + (200*64/34+192)]
!	                    MOVQ	    mm0, [esi+0]
!	                    MOVQ	    mm1, [esi+8]
!	                    MOVQ	    [edi+0], mm0
!	                    MOVQ	    [edi+8], mm1
!	                    MOVQ	    mm2, [esi+16]
!	                    MOVQ	    mm3, [esi+24]
!	                    MOVQ	    [edi+16], mm2
!	                    MOVQ	    [edi+24], mm3
!	                    MOVQ	    mm0, [esi+32]
!	                    MOVQ	    mm1, [esi+40]
!	                    MOVQ	    [edi+32], mm0
!	                    MOVQ	    [edi+40], mm1
!	                    MOVQ	    mm2, [esi+48]
!	                    MOVQ	    mm3, [esi+56]
!	                    MOVQ	    [edi+48], mm2
!	                    MOVQ	    [edi+56], mm3
!	                    ADD		    esi, 64
!	                    ADD		    edi, 64
!	                    DEC		    ecx
!	                    JNZ		    l_memcpy_ic_1
!l_memcpy_ic_2:         MOV		    ecx, ebx
!l_memcpy_ic_3:         SHR		    ecx, 2
!	                    AND		    ecx, 1111b
!	                    NEG		    ecx
!	                    ADD		    ecx, l_memcpy_last_few
!	                    JMP		    ecx
!l_memcpy_uc_test:      CMP		    ecx, (197*1024)/64
!	                    JAE		    l_memcpy_bp_1
!l_memcpy_64_test:      OR		    ecx, ecx
!	                    JZ		    l_memcpy_ic_2
;!align 16 ;   <=- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!align 8;(bypass PureBasic bug of Align 16)
!DQ $9090909090909090
!l_memcpy_uc_1:			prefetchnta [esi + (200*64/34+192)]
!	                    MOVQ	    mm0,[esi+0]
!	                    ADD		    edi,64
!	                    MOVQ	    mm1,[esi+8]
!	                    ADD		    esi,64
!	                    MOVQ	    mm2,[esi-48]
!	                    MOVNTQ	    [edi-64], mm0
!	                    MOVQ	    mm0,[esi-40]
!	                    MOVNTQ	    [edi-56], mm1
!	                    MOVQ	    mm1,[esi-32]
!	                    MOVNTQ	    [edi-48], mm2
!	                    MOVQ	    mm2,[esi-24]
!	                    MOVNTQ	    [edi-40], mm0
!	                    MOVQ	    mm0,[esi-16]
!	                    MOVNTQ	    [edi-32], mm1
!	                    MOVQ	    mm1,[esi-8]
!	                    MOVNTQ	    [edi-24], mm2
!	                    MOVNTQ	    [edi-16], mm0
!	                    DEC		    ecx
!	                    MOVNTQ	    [edi-8], mm1
!	                    JNZ		    l_memcpy_uc_1
!	                    JMP		    l_memcpy_ic_2
!l_memcpy_bp_1:         CMP		    ecx, $0080
!	                    JL		    l_memcpy_64_test
!	                    MOV		    eax, $0080 / 2
!	                    ADD		    esi, $0080 * 64
;!align 16 ;   <=- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!align 8;(bypass PureBasic bug of Align 16)
!DQ $9090909090909090
!l_memcpy_bp_2:         MOV		    edx, [esi-64]
!	                    MOV		    edx, [esi-128]
!	                    SUB		    esi, 128
!	                    DEC		    eax
!	                    JNZ		    l_memcpy_bp_2
!	                    MOV		    eax, $0080
;!align 16 ;   <=- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!align 8;(bypass PureBasic bug of Align 16)
!DQ $9090909090909090
!l_memcpy_bp_3:         MOVQ	    mm0, [esi   ]
!	                    MOVQ	    mm1, [esi+ 8]
!	                    MOVQ	    mm2, [esi+16]
!	                    MOVQ	    mm3, [esi+24]
!	                    MOVQ	    mm4, [esi+32]
!	                    MOVQ	    mm5, [esi+40]
!	                    MOVQ	    mm6, [esi+48]
!	                    MOVQ	    mm7, [esi+56]
!	                    ADD		    esi, 64
!	                    MOVNTQ	    [edi   ], mm0
!	                    MOVNTQ	    [edi+ 8], mm1
!	                    MOVNTQ	    [edi+16], mm2
!	                    MOVNTQ	    [edi+24], mm3
!	                    MOVNTQ	    [edi+32], mm4
!	                    MOVNTQ	    [edi+40], mm5
!	                    MOVNTQ	    [edi+48], mm6
!	                    MOVNTQ	    [edi+56], mm7
!	                    ADD		    edi, 64
!	                    DEC		    eax
!	                    JNZ		    l_memcpy_bp_3
!	                    SUB		    ecx, $0080
!	                    JMP		    l_memcpy_bp_1
!align 4
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!	                    MOVSD
!l_memcpy_last_few:     MOV		    ecx, ebx
!	                    AND		    ecx, 11b
!	                    JZ		    l_memcpy_final

!l_memcpy_last_few_nxt: REP         MOVSB
!l_memcpy_final:        EMMS
!	                    SFENCE
EndProcedure
;=================================================================================

Posted: **Mon May 02, 2011 12:34 pm**

So now it's from here:
http://www.cs.virginia.edu/stream/FTP/C ... py_amd.asm

; Copyright (c) 2001 Advanced Micro Devices, Inc.

Let me ask you a question: Why do you copy code,
remove comments (even copyright statements!) and
post it without reference to its source?

@Mods: I do not think he's asked to copy the code
in the first post. Please remove it until he makes
clear that you will have to ask the original author before
you can use it.

Posted: **Mon May 02, 2011 1:02 pm**

remi_meier wrote:So now it's from here:
http://www.cs.virginia.edu/stream/FTP/C ... py_amd.asm
; Copyright (c) 2001 Advanced Micro Devices, Inc.

Let me ask you a question: Why do you copy code,
remove comments (even copyright statements!) and
post it without reference to its source?

@Mods: I do not think he's asked to copy the code
in the first post. Please remove it until he makes
clear that you will have to ask the original author before
you can use it.

Beacose this routine not so hard to understanding, to have a tones comments

And about some mythical copyright - so this source is a collective of the canonical fragments me that provides links to 150,000 pieces by the author

?

Но если это единственное, что тебе так не хватает в этой жизни, я могу тебя порадовать.

(c) Advanced Micro Devices, Inc. (NYSE: AMD)

I hope this will help in developing an efficient algorithm for copying the memory block.

I think it's worth negotiate how to improve opportunities PureBasic. And propose solutions. Empty criticism does not carry the payload.

Posted: **Mon May 02, 2011 2:13 pm**

4RESTER wrote:Beacose this routine not so hard to understanding, to have a tones comments

And about some mythical copyright - so this source is a collective of the canonical fragments me that provides links to 150,000 pieces by the author ?

Sorry, I have difficulty understanding you. Are you saying
that you took fragments from several authors? Both codes
look like 1:1 conversions to me. I understand that you do
not _need_ the comments, but you removed them. And
I am not talking about comments about what the code does,
but copyright notices. And I have nothing against you posting
the second code (albeit mentioning the source is just how
you do things). But I have everything against you posting
a direct conversion of a code posted by someone explicitly
mentioning that you must not use his/her code without asking
for approval. That also goes for everyone using the code you
posted without them knowing about the original author not
wanting them to use the code without approval.
If the code is that easy to understand, why not write it yourself
and I will salute you.

I hope this will help in developing an efficient algorithm for copying the memory block.

I think it's worth negotiate how to improve opportunities PureBasic. And propose solutions. Empty criticism does not carry the payload.

We had threads like this already and I think it is a good way
to help. It's just the way you do it that isn't right.

Edit: Btw, here are two other threads that you might
learn from:
CopyMemory: http://www.purebasic.fr/english/viewtop ... emory+fast
El_Choni posted about the same code from AMD that you
posted. WITH copyright notice

CompareMemory: http://www.purebasic.fr/english/viewtop ... emory+fast

Posted: **Mon May 02, 2011 2:26 pm**

i edited first code to reflect the original .

CopyMemory is always an interesting part of coding,
so i let this topic as it is and locked it temporary.

PureBasic Forums - English

Extremal fast memory block copy

Extremal fast memory block copy

Re: Extremal fast memory block copy

Re: Extremal fast memory block copy

Re: Extremal fast memory block copy

Re: Extremal fast memory block copy

Re: Extremal fast memory block copy

Re: Extremal fast memory block copy

Re: Extremal fast memory block copy