In a couple of little project I've encountered some performance issues with strings but this one takes the cake. In making a design change to cope with a performance bottleneck I stumbled over something quite surprising.
I think many people have had a situation where they need to process a large amount of text, like a log file or something and need to split the file up on the #CRLF$. I hit a big problem because I was taking a shortcut to replace the CRLF with a single char then use the PB ParseString() function to extract the bit. For smallish files this is okay but as you go over a couple of thousand lines you notice a slowdown as it progresses and over 10's of thousands of lines you see it drop to a crawl.
Simple design problem on my part of course, ParseString() has to count up the CR's and when the line count gets to say 20,000 then it has to count that far just to process one more line.
Solution, a split() function to just go and get the bits once rather than recalc for each line. Can handle the two byte CRLF so you get rid of the replace too. Hardly rocket science
Looking through the forums, there are a few split functions, some use the parsestring() internally so they wouldn't solve my perf problem some don't but weren't quite what I wanted so I wrote my own. I based it on the quicksearch algorithm since it (unlike findstring() ) can keep going after it finds what it's looking for and maximise the use of it's badchar() table it builds at the beginning. It's also faster the larger the delimiter so crlf will search faster than just cr.
Strangely enough, using the split function was not faster unless there were over about 15,000 lines. I thought excessive ReDim'ing to increase the array was at fault but diming out a huge array before hand was the same and in the end I found it was the PB Mid() function on large strings that was slow. This seemed odd as the size parameters are passed so it should have the problem with the slow PB Len() function. Anyway, I replaced the PB mid function for this
Code: Select all
Procedure.s MidMem(*MainMem, StartPos.l, GetLen.l)
If *MainMem = 0 Or GetLen = 0
ProcedureReturn ""
EndIf
*RetMem = AllocateMemory(GetLen)
CopyMemory(*MainMem + StartPos -1, *RetMem, GetLen)
ProcedureReturn PeekS(*RetMem,GetLen)
EndProcedure

I decided to do a test just to see what the perf difference of the Mid() function was since the design change was part of the reason for the speedup being so great so I added this in:
Code: Select all
DisableDebugger
TestString.s = Space(20000000)
Output.s
Start1.l = ElapsedMilliseconds()
For i = 1 To 100000
Output = "!" + Midmem(@TestString,i*100,10) + "!"
Next
Start2.l = ElapsedMilliseconds()
For i = 1 To 100000
Output = "!" + Mid(TestString,i*100,10) + "!"
Next
Start3.l = ElapsedMilliseconds()
EnableDebugger
MessageRequester("Mid vs MidMem", "Mid() = " + Str(Start3-Start2) + #CRLF$ + "MidMem() = " + Str(Start2-Start1))
---------------------------
Mid vs MidMem
---------------------------
Mid() = 902562
MidMem() = 31
---------------------------
OK
---------------------------
This is admittedly a very large string. Previous versions of PB with the 64k limit would have made people work around it in otherways so this would have been a much smaller issue, not sure how PB handled getting a 1mb text blob through the clipboard though
I'm not meaning this as a bash on PB. Their memory functions and pointer arithmatic really fly and make these workarounds possible but I think we need to spend some time flushing these thing out into the open so we can see the benefits. In a larger function where the use of something like MidMen() and it's inputs are controlled then the risk is quite small of memory overruns although accepting user input into it would be quite dangerous.
The Split function currently looks like this: (But I want to clean it up and merge the midmem() into it and get rid of the function call)
Code: Select all
Structure MemoryArray
Byte.c[0]
word.w[0]
EndStructure
Procedure.l Split(StringArray.s(1), Text2Split.s, Delim.s) ;return count
FindLen.l = Len(Delim)
MainLen.l = Len(Text2Split)
Dim StringArray.s(1000)
StringCount.l = 0
*MainByteArray.MemoryArray = @Text2Split ;*MainMem
*FindByteArray.MemoryArray = @Delim ;*FindMem
PrevPos.l = 1
; Build BadChr Array
Dim BadChar.l(255)
; set all alphabet to max shift pos (length of find string plus 1)
For i = 0 To 255
BadChar(i) = FindLen + 1
Next
;Update chars that are in the find string to their position from the end.
For i = 0 To findlen -1
BadChar(*FindByteArray\byte[i]) = findlen - i
Next
MainArrayLoop.l = 1
EndSearchPos.l = MainLen - (FindLen -1)
While MainArrayLoop <= EndSearchPos
If CompareMemory(@Text2Split + MainArrayLoop, @Delim, FindLen) = 1
FoundPos = MainArrayLoop + 1
If StringCount % 1000 = 0 ; not really needed, doesn't have much of a speed increase. This used to do a lot in the old VB days
ReDim StringArray.s(StringCount + 1000)
EndIf
StringArray(StringCount) = MidMem(@Text2Split, Prevpos, Foundpos - PrevPos) ;Mid(Text2Split, Prevpos, Foundpos - PrevPos) ;"HEllo, this is some text" + #TAB$ + " " + #TAB$ + "esdfsdf"
StringCount = StringCount + 1
PrevPos = foundpos + Findlen
EndIf
;Didn't find the string so shift as per the table.
MainArrayLoop + BadChar(*MainByteArray.MemoryArray\byte[MainArrayLoop + FindLen])
Wend
;catch end
ReDim StringArray.s(StringCount)
StringArray(StringCount) = Mid(Text2Split, Prevpos, MainLen - PrevPos +1)
StringCount = StringCount + 1
ReDim StringArray.s(StringCount)
ProcedureReturn StringCount
EndProcedure