Faster Mid() with asm

Xombie · Post by **Xombie** » Thu Oct 19, 2006 8:26 pm

One question that's been bothering me - can't subsystems be used for this? Could we possibly make an ASM subsystem with different replacement functions like your Mid() routine and then link that in? Would that then force PB to even use the function internally instead of it's own Mid() routine? I've been wondering if that's how subsystems work.

Shannara · Post by **Shannara** » Thu Oct 19, 2006 9:15 pm

Now that would be cool. Just drop it into a subsystem, reference the subsystem in the project options, and your good to go.

technicorn · Post by **technicorn** » Thu Oct 19, 2006 9:24 pm

I don't think that is how subsystems working, to be honest,
I don't know how subsystems working at all :roll:

But you can use a trick to replace any call to PB Mid (or anything else)
by intercepting the assembly stage of PB when you replace FAsm.exe
with a faked one that scans the assembler listing produced by the
PB compiler and than replace the call destination.
The replacement function must have the same parameter handling and
same behavior like the original one, of course.

Trond · Post by **Trond** » Fri Oct 20, 2006 3:54 pm

Code: Select all

Macro Mid2(String, StartPos, Length)
  MidAscii(@String, StartPos, Length)
EndMacro


Procedure.s MidAscii(String, StartPos, Length)
  
  !MOV    eax, [p.v_String]
  !ADD    eax, [p.v_StartPos]
  !DEC    eax
  !MOV    edx, eax
  !ADD    eax, [p.v_Length]
  !MOV    cl, byte [eax]
  !PUSH   ecx
  !MOV    byte [eax], 0
  !MOV    ebx, eax
  !PUSH   dword [_PB_StringBasePosition]
  !CALL  _SYS_CopyString@0
  !POP    eax
  !ADD    eax, [PB_StringBase]
  !POP    ecx
  !MOV    byte [ebx], cl
  
  ProcedureReturn 
  ProcedureReturn ""
EndProcedure

VeryLongString.s = Space(250)

#Tries = 10000000

time = GetTickCount_()
For I = 0 To #Tries
  Mid2(VeryLongString, 200, 4)
Next
MessageRequester("", Str(GetTickCount_()-time))


time = GetTickCount_()
For I = 0 To #Tries
  Mid(VeryLongString, 200, 4)
Next
MessageRequester("", Str(GetTickCount_()-time))

technicorn · Post by **technicorn** » Fri Oct 20, 2006 6:53 pm

@Trond:

Where is the check that start is not greater than the length of the string,
or that lenght is not longer than the rest after start?
Or start and/or length < 0 ?

yrret · Post by **yrret** » Sat Oct 21, 2006 7:33 pm

Hey Trond
In your code above in reference to the line:
!PUSH dword [_PB_StringBasePosition]

When I try to compile and run it, I always get

PureBasic.asm[389]:
MPO
PureBasic.asm [124] MPO [21]
PUSH dword [_PB_StringBasePosition]
error: undefined symbol.

Anybody have the same problem or an idea of what's wrong?
In Compiler Options, I do have Enable inline ASM support checked.

EDIT:
After playing around some more with settings. I also had Create
threadsafe executable checked. When I tried to compile and run
with that unchecked, it compiled and ran. So there must be some
conflict during compiling when you use the Create threadsafe
executable option, or there is a bug?

yrret · Post by **yrret** » Sun Oct 22, 2006 10:52 pm

WOW, it is impressive.
Using Technicorn's code in a .pbi file:

Code: Select all

! macro EndMacro {}

; Assembler macro to align code on a given boundary,
; aligning loop starts on 8 byte boundaries can speed up the code
; by about 1.5 to 2 times!

! macro calignjmp value
! {
!   local dest
!
!   if ((value - 1) - ((($ - $$) + value - 1) mod value)) > 3
!     makeDest equ 1
!     jmp dest
!   else
!     makeDest equ 0
!   end if
!
!   rept value
!   \{
!     if ($ - $$) mod value
!       nop
!     end if
!   \}
!   dest:
! } EndMacro

; Uncommend the next 3 lines, if you want to replace
; Mid with FMid in all places:
 Macro Mid(string, start, length)
   FMid2(@string,start,length)
 EndMacro

CompilerIf #PB_Compiler_Unicode
Procedure.s FMid2(*srcPtr.Byte, start.l, length.l)
  !MOV    edx,[p.p_srcPtr]
  !MOV    ecx,[p.v_length]
  !TEST   edx,edx
  !JZ     l_fmid2_empty
  !TEST   ecx,ecx
  !JLE    l_fmid2_empty
  !MOV    ecx,[p.v_start]
  !CMP    ecx,1
  !JLE    l_fmid2_return
  !XOR    eax,eax
  !DEC    ecx
  !calignjmp 4
  fmid2_scan1:
  !SUB    ecx,8
  !JL     l_fmid2_scanend1
  !CMP    [edx],ax
  !JE     l_fmid2_scanend2
  !ADD    edx,2
  !CMP    [edx],ax
  !JE     l_fmid2_scanend2
  !ADD    edx,2
  !CMP    [edx],ax
  !JE     l_fmid2_scanend2
  !ADD    edx,2
  !CMP    [edx],ax
  !JE     l_fmid2_scanend2
  !ADD    edx,2
  !CMP    [edx],ax
  !JE     l_fmid2_scanend2
  !ADD    edx,2
  !CMP    [edx],ax
  !JE     l_fmid2_scanend2
  !ADD    edx,2
  !CMP    [edx],ax
  !JE     l_fmid2_scanend2
  !ADD    edx,2
  !CMP    [edx],ax
  !JE     l_fmid2_scanend2
  !ADD    edx,2
  !JMP    l_fmid2_scan1

  fmid2_scanend1:
  !ADD    ecx,8
  !JZ     l_fmid2_scanend2
  fmid2_scan2:
  !CMP    [edx],ax
  !JE     l_fmid2_scanend2
  !ADD    edx,2
  !DEC    ecx
  !JNZ    l_fmid2_scan2
  fmid2_scanend2:
  !MOV   [p.p_srcPtr],edx
  fmid2_return:
  ProcedureReturn PeekS(*srcPtr, length)
  fmid2_empty:
  ProcedureReturn ""
EndProcedure
CompilerElse
Procedure.s FMid2(*srcPtr.Byte, start.l, length.l)
  !MOV    edx,[p.p_srcPtr]
  !MOV    ecx,[p.v_length]
  !TEST   edx,edx
  !JZ     l_fmid2_empty
  !TEST   ecx,ecx
  !JLE    l_fmid2_empty
  !MOV    ecx,[p.v_start]
  !CMP    ecx,1
  !JLE    l_fmid2_return
  !XOR    eax,eax
  !DEC    ecx
  !calignjmp 4
  fmid2_scan1:
  !SUB    ecx,8
  !JL     l_fmid2_scanend1
  !CMP    [edx],al
  !JE     l_fmid2_scanend2
  !INC    edx
  !CMP    [edx],al
  !JE     l_fmid2_scanend2
  !INC    edx
  !CMP    [edx],al
  !JE     l_fmid2_scanend2
  !INC    edx
  !CMP    [edx],al
  !JE     l_fmid2_scanend2
  !INC    edx
  !CMP    [edx],al
  !JE     l_fmid2_scanend2
  !INC    edx
  !CMP    [edx],al
  !JE     l_fmid2_scanend2
  !INC    edx
  !CMP    [edx],al
  !JE     l_fmid2_scanend2
  !INC    edx
  !CMP    [edx],al
  !JE     l_fmid2_scanend2
  !INC    edx
  !JMP    l_fmid2_scan1

  fmid2_scanend1:
  !ADD    ecx,8
  !JZ     l_fmid2_scanend2
  fmid2_scan2:
  !CMP    [edx],al
  !JE     l_fmid2_scanend2
  !INC    edx
  !DEC    ecx
  !JNZ    l_fmid2_scan2
  fmid2_scanend2:
  !MOV   [p.p_srcPtr],edx
  fmid2_return:
  ProcedureReturn PeekS(*srcPtr, length)
  fmid2_empty:
  ProcedureReturn ""
EndProcedure
CompilerEndIf

And then writing a console program using two different variations in size
using MID and FMid2, showed FMid2 consistently 10 times faster.

Code: Select all

EnableExplicit
;IncludePath #PBIncludePath
XIncludeFile "FMid_DL.pbi"

OpenConsole()

Global tStart.l, tStart2.l, tDiff.d
Global str1.s, str1Mid.s
Global str2.s, str2Mid.s
Global i.l, iEnd.l, strCh.l

str1="How To back up, edit, And Restore the registry in Windows XP And Windows Server 2003.htm"


str2="How To back up, edit, And Restore the registry in Windows XP And Windows Server 2003 and How To back up, edit, And Restore the registry in Windows XP And Windows Server 2003.htm"


iEnd = 10000000

tStart2 = ElapsedMilliseconds(): Repeat: tStart = ElapsedMilliseconds(): Until tStart <> tStart2
For i = 1 To iEnd
  str1Mid = Mid(str1, 20, 10)
Next i
tDiff = (ElapsedMilliseconds() - tStart) / 1000.0
If tDiff >= 1: iEnd / tDiff: EndIf
PrintN(StrD(tDiff, 3) + " sec., Ops per sec.: " + StrD(iEnd / tDiff, 3) + "    " + str1Mid)
PrintN("")

iEnd = 1000000
tStart2 = ElapsedMilliseconds(): Repeat: tStart = ElapsedMilliseconds(): Until tStart <> tStart2
For i = 1 To iEnd
  str1Mid = FMid2(@str1, 20, 10)  ;note: need to add @ to str1 string
Next i
tDiff = (ElapsedMilliseconds() - tStart) / 1000.0
If tDiff >= 1: iEnd / tDiff: EndIf
PrintN(StrD(tDiff, 3) + " sec., Ops per sec.: " + StrD(iEnd / tDiff, 3) + "    " + str1Mid)

PrintN("")

;=========================================================================================

iEnd = 10000000

tStart2 = ElapsedMilliseconds(): Repeat: tStart = ElapsedMilliseconds(): Until tStart <> tStart2
For i = 1 To iEnd
  str2Mid = Mid(str2, 80, 30)
Next i
tDiff = (ElapsedMilliseconds() - tStart) / 1000.0
If tDiff >= 1: iEnd / tDiff: EndIf
PrintN(StrD(tDiff, 3) + " sec., Ops per sec.: " + StrD(iEnd / tDiff, 3) + "    " + str2Mid)
PrintN("")

iEnd = 1000000
tStart2 = ElapsedMilliseconds(): Repeat: tStart = ElapsedMilliseconds(): Until tStart <> tStart2
For i = 1 To iEnd
  str2Mid = FMid2(@str2, 80, 30)  ;note: need to add @ to str1 string
Next i
tDiff = (ElapsedMilliseconds() - tStart) / 1000.0
If tDiff >= 1: iEnd / tDiff: EndIf
PrintN(StrD(tDiff, 3) + " sec., Ops per sec.: " + StrD(iEnd / tDiff, 3) + "    " + str2Mid)

PrintN("")

;=========================================================================================

PrintN(#CRLF$ + "Done")
While Inkey() = "": Delay(100): Wend

My results were:
11.812 sec., Ops per sec.: 71672.621 t, And Res

1.172 sec., Ops per sec.: 728022.184 t, And Res

13.594 sec., Ops per sec.: 54113.506 2003 and How To back up, edit

1.343 sec., Ops per sec.: 554431.869 2003 and How To back up, edit

Done

technicorn · Post by **technicorn** » Mon Oct 23, 2006 5:28 am

Hi yrret,
thanks for the test result.

Could you tell me what CPU you had this running on?
I only get a speedup of about 3-5 on a AMD Athlon XP 1600+.
And you should try the code with the lines
!calingjmp 4
shanged to
!calignjmp 8

sorry, I left the 4 in from a test.

Greatings,
technicorn

yrret · Post by **yrret** » Mon Oct 23, 2006 12:21 pm

Hi Technicorn,

I am running on an AMD Athlon XP 1600+ CPU, with 512k memory. I use a MSI KT266A motherboard with a
Radeon 7500 video card. I changed both occurances of !calingjmp 4 to !calignjmp 8, and ran the program several times.
The speed and output always averaged 10 times faster. Did you get your results running my test code?
As it appears we have the same CPU, I don't understand the speed difference either? I will say though that it sure is a great
improvement over Mid's speed, even if your only getting 3 to 5 times faster, and it's much appreciated.

Thanks,
yrret

technicorn · Post by **technicorn** » Mon Oct 23, 2006 3:08 pm

Hello Yrret,

took me a while to find the errors.

1.
Was my fault in the original test program, the line

Code: Select all

If tDiff >= 1: iEnd / tDiff: EndIf

should be after the PrintN().
It was ment to keep the loop time under two second,
but used befor the print, it will give wrong result, because it
changes iEnd before the calculation of the operations per second.

2.
You use a 10 million in the test of Mid() and 1 million in the FMid2() test,
beside that, I posted the code with the already uncommented Mid() macro,
so both tests used FMid2()

With this corrected, the test give a result of FMid2() being about 1.4 times
faster for the first test,
and 1.7 times faster for the second test.

My saying that FMid2() is 3-5 times faster is correct, but only for extracting short string from the end of longer strings (length over 1000),

If you don't mind the bigger size of, you can use the following procedure,
but it only improves the speed for the your first example to 1.41 times
and 1.8 times for the second, but it gives a better speed up for longer strings.

That's all one can do, if you don't know the string length in advance, it's sad that PB doesn't has string descriptors with the lenght of the string,
than it would be possible to speed up string functions by about 10 even
for short strings, for longer strings (over 1000) it would be immeasurable,
some 1000 times, because you could omit the silly scanning of the string from the start each time.
I can't understand why it is done this way, seems to be the 'C' way,
would be the same as if you scan an array each time, to see if you past the end.

Nuf rambling, here's the code:

Code: Select all

! macro EndMacro {} ; Just to keep the IDE folding happy

; Assembler macro to align code on a given boundary, 
; aligning loop starts on 8 byte boundaries can speed up the code 
; by about 1.5 to 2 times! 

! macro calignjmp value 
! { 
!   local dest 
! 
!   if ((value - 1) - ((($ - $$) + value - 1) mod value)) > 3 
!     makeDest equ 1 
!     jmp dest 
!   else 
!     makeDest equ 0 
!   end if 
! 
!   rept value 
!   \{ 
!     if ($ - $$) mod value 
!       nop 
!     end if 
!   \} 
!   dest: 
! } EndMacro 

; Uncomment to use FMid() instead of Mid() as standard
; Macro Mid(string, start, length) 
;   FMid(@string,start,length) 
; EndMacro 
CompilerEndIf
CompilerIf #PB_Compiler_Unicode
Procedure.s FMid(*srcPtr.Byte, start.l, length.l) 
  !MOV    edx,[p.p_srcPtr] 
  !MOV    ecx,[p.v_length] 
  !TEST   edx,edx 
  !JZ     l_fmid_empty 
  !TEST   ecx,ecx 
  !JLE    l_fmid_empty 
  !MOV    ecx,[p.v_start] 
  !CMP    ecx,1 
  !JLE    l_fmid_return
  !XOR    eax,eax 
  !calignjmp 8
  fmid_scan1:
  !SUB    ecx,8
  !JL     l_fmid_scanend1_0
  !CMP    [edx],ax
  !JE     l_fmid_scanend2_1
  !CMP    [edx+2],ax 
  !JE     l_fmid_scanend2_2
  !CMP    [edx+4],ax 
  !JE     l_fmid_scanend2_3
  !CMP    [edx+6],ax 
  !JE     l_fmid_scanend2_4
  !CMP    [edx+8],ax 
  !JE     l_fmid_scanend2_5
  !CMP    [edx+10],ax 
  !JE     l_fmid_scanend2_6
  !CMP    [edx+12],ax 
  !JE     l_fmid_scanend2_7
  !CMP    [edx+14],ax 
  !JE     l_fmid_scanend2_8
  !ADD    edx,16
  !JMP    l_fmid_scan1

  fmid_scanend1_0:
  !ADD    ecx,8
  !JNZ    l_fmid_scan2
  !SUB    edx,2
  !JMP    l_fmid_scanend2_1

  fmid_scan2:  
  !DEC    ecx
  !JZ     l_fmid_scanend2_1
  !CMP    [edx],ax
  !JZ     l_fmid_scanend2_1

  !DEC    ecx
  !JZ     l_fmid_scanend2_2
  !CMP    [edx+2],ax
  !JZ     l_fmid_scanend2_2

  !DEC    ecx
  !JZ     l_fmid_scanend2_3
  !CMP    [edx+4],ax
  !JZ     l_fmid_scanend2_3

  !DEC    ecx
  !JZ     l_fmid_scanend2_4
  !CMP    [edx+6],ax
  !JZ     l_fmid_scanend2_4

  !DEC    ecx
  !JZ     l_fmid_scanend2_5
  !CMP    [edx+8],ax
  !JZ     l_fmid_scanend2_5

  !DEC    ecx
  !JZ     l_fmid_scanend2_6
  !CMP    [edx+10],ax
  !JZ     l_fmid_scanend2_6

  !JMP    l_fmid_scanend2_7
  

  fmid_scanend2_8:
  !ADD    edx,14
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_7:
  !ADD    edx,12
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_6:
  !ADD    edx,10
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_5:
  !ADD    edx,8
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_4:
  !ADD    edx,6
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_3:
  !ADD    edx,4
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_2:
  !ADD    edx,2
  fmid_scanend2_1: 
  !MOV   [p.p_srcPtr],edx 
  fmid_return: 
  ProcedureReturn PeekS(*srcPtr, length) 
  fmid_empty: 
  ProcedureReturn "" 
EndProcedure 
CompilerElse
Procedure.s FMid(*srcPtr.Byte, start.l, length.l) 
  !MOV    edx,[p.p_srcPtr] 
  !MOV    ecx,[p.v_length] 
  !TEST   edx,edx 
  !JZ     l_fmid_empty 
  !TEST   ecx,ecx 
  !JLE    l_fmid_empty 
  !MOV    ecx,[p.v_start] 
  !CMP    ecx,1 
  !JLE    l_fmid_return
  !XOR    eax,eax 
  !calignjmp 8
  fmid_scan1:
  !SUB    ecx,8
  !JL     l_fmid_scanend1_0
  !CMP    [edx],al
  !JE     l_fmid_scanend2_1
  !CMP    [edx+1],al 
  !JE     l_fmid_scanend2_2
  !CMP    [edx+2],al 
  !JE     l_fmid_scanend2_3
  !CMP    [edx+3],al 
  !JE     l_fmid_scanend2_4
  !CMP    [edx+4],al 
  !JE     l_fmid_scanend2_5
  !CMP    [edx+5],al 
  !JE     l_fmid_scanend2_6
  !CMP    [edx+6],al 
  !JE     l_fmid_scanend2_7
  !CMP    [edx+7],al 
  !JE     l_fmid_scanend2_8
  !ADD    edx,8
  !JMP    l_fmid_scan1

  fmid_scanend1_0:
  !ADD    ecx,8
  !JNZ    l_fmid_scan2
  !DEC    edx
  !JMP    l_fmid_scanend2_1

  fmid_scan2:  
  !DEC    ecx
  !JZ     l_fmid_scanend2_1
  !CMP    [edx],al
  !JZ     l_fmid_scanend2_1

  !DEC    ecx
  !JZ     l_fmid_scanend2_2
  !CMP    [edx+1],al
  !JZ     l_fmid_scanend2_2

  !DEC    ecx
  !JZ     l_fmid_scanend2_3
  !CMP    [edx+2],al
  !JZ     l_fmid_scanend2_3

  !DEC    ecx
  !JZ     l_fmid_scanend2_4
  !CMP    [edx+3],al
  !JZ     l_fmid_scanend2_4

  !DEC    ecx
  !JZ     l_fmid_scanend2_5
  !CMP    [edx+4],al
  !JZ     l_fmid_scanend2_5

  !DEC    ecx
  !JZ     l_fmid_scanend2_6
  !CMP    [edx+5],al
  !JZ     l_fmid_scanend2_6

  !JMP    l_fmid_scanend2_7
  

  fmid_scanend2_8:
  !ADD    edx,7
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_7:
  !ADD    edx,6
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_6:
  !ADD    edx,5
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_5:
  !ADD    edx,4
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_4:
  !ADD    edx,3
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_3:
  !ADD    edx,2
  !JMP    l_fmid_scanend2_1
  fmid_scanend2_2:
  !ADD    edx,1
  fmid_scanend2_1: 
  !MOV   [p.p_srcPtr],edx 
  fmid_return: 
  ProcedureReturn PeekS(*srcPtr, length) 
  fmid_empty: 
  ProcedureReturn "" 
EndProcedure 
CompilerEndIf

yrret · Post by **yrret** » Tue Oct 24, 2006 3:29 am

Hi Technicorn,

You use a 10 million in the test of Mid() and 1 million in the FMid2() test,
beside that, I posted the code with the already uncommented Mid() macro,
so both tests used FMid2()

I guess I need to be more 'awake' next time I cut and paste stuff. I totaly thought I pasted the same lines,
and totaly forgot about the macro part in the .pbi file.

I was so excited at the time, but now I'm sorry to have mislead you with the wrong results.

Thanks for your help,
yrret

technicorn · Post by **technicorn** » Tue Oct 24, 2006 7:19 am

Hi Yrret,

no problem.

It's only sad, that there isn't anything that can be done to speed this up even more.
Using something like MMX or SSE could lead to problems, because you could accidently read past the string,
and if the string happens to be at the and of a logical page and the next page isn't part of your data,
you would get sporadic pagefault error that crashes your program.

wilbert · Post by **wilbert** » Tue Oct 24, 2006 10:36 am

technicorn wrote:It's only sad, that there isn't anything that can be done to speed this up even more.

I think there still can be some improvements if you want. When PB allocates memory for a string, it allocates a few bytes extra (5 if I'm correct). Reading 32 bits at the same time should be safe I think.

When you would use

Code: Select all

!mov eax,[edx]
!test eax,$000000ff
!jz end_found
!test eax,$0000ff00
!jz end_found
!test eax,$00ff0000
!jz end_found
!test eax,$ff000000
!jz end_found

It should take a little less clock cycles if I'm correct.

Another improvement could be to preallocate an output buffer with length +5 size and store the 4 bytes you read for the end of string test at once.
The few byes extra would avoid errors and you wouldn't have to use a PeekS at the end.

Edit:
The PB docs mention SYS_GetOutputBuffer for C but I don't know if that can also be used for an ASM routine like this.

Amundo · Post by **Amundo** » Thu Oct 26, 2006 2:54 am

technicorn wrote:That's all one can do, if you don't know the string length in advance, it's sad that PB doesn't has string descriptors with the lenght of the string,
than it would be possible to speed up string functions by about 10 even
for short strings, for longer strings (over 1000) it would be immeasurable,
some 1000 times, because you could omit the silly scanning of the string from the start each time.
I can't understand why it is done this way, seems to be the 'C' way,
would be the same as if you scan an array each time, to see if you past the end.

Hi technicorn,

Nice work with this and thanks for sharing. The sort of people that use PB are the type of people who don't settle for "it compiles really small, fast EXEs". They look to tweak and optimize even more!

I think the point you raise about strings is valid..up to a point - but past that point you would be looking to use other data structures (character arrays?) if you knew in advance the possible maximum data sizes your program would be asked to deal with? I think what you ask for is for the string data-type to be as efficient as an array, when they are two different things, used for two different purposes.

My %10 cents worth.

technicorn · Post by **technicorn** » Thu Oct 26, 2006 9:25 am

Thanks Wilbert for that info, will have a look at that.

Hi Amundo,
it's not that I asked for string descriptors because of using strings as a sort of array of bytes/words (if unicode).

It's just for programms that have to scan large amount of text,
for example preprocessors, text converters, syntax highlighting...
where you have to scan a string character wise.
I know I could use pointerts to characters, but I think, that's not something that you should be forced to,
because the build in functions are slower than need to be,
with just a little extra work for the compiler to make it faster by a huge amount.

And I like strings, just because you don't have to care about buffer overflow.

Greatings,
technicorn