Fastest way to modify a large string, byte-by-byte?

Just starting out? Need help? Post your questions and find answers here.
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Fastest way to modify a large string, byte-by-byte?

Post by Demivec »

MachineCode wrote:Anyway, I tried your new code against breeze4me and it seems just as fast, but yours ends up having lots of blank lines at the top, because they're obviously being sorted too from all the carriage-returns in my text. Breeze's doesn't include them for some reason? If you can omit them, that'd be better because I hate "Import" in my code because I have no idea what it's doing and I hate having code that I can't understand (even if it works).
I tested the two working versions (mine and the one by breeze4me) and they produced identical results with the test data used by breeze4me's code, which didn't include short lines having only a #CRLF$.

Here is the code with the changes to remove lines containing only a #CRLF$:

Code: Select all

#soc = SizeOf(character) 
;---------

Procedure.s sortText(text$)
  NewList *textLines.string()

  otext$ = text$ ;This is simply to reserve an output buffer of the same size as the original string
  *start = @text$
  *end.character = *start
  
  ;setup pointers to each substring
  While *end\c
    If *end\c = #LF  ;assumes that #LF is the last character of each substring
      *end\c = #Null ;replace #LF temporarily to mark it as a substring
      *end + #soc
      If (*end - *start) / #soc > Len(#CRLF$)
        AddElement(*textLines())
        *textLines() = *start ;simply point to the string segment without copying it
      EndIf
      *start = *end
    Else
      *end + #soc
    EndIf
  Wend
  
  SortStructuredList(*textLines(), #PB_Sort_Ascending, 0, #PB_String)
  
  ;assemble the output string from the pointers to the sorted substrings
  *end = @otext$
  CopyMemoryString("", @*end) ;prime the copy function
  ForEach *textLines()
    CopyMemoryString(*textLines())
    *end\c = #LF ;restore the #LF and 'unmark' the substring
    *end + 1
  Next
  *end\c = #Null ;mark end of string in case output is smaller than input
    
  ProcedureReturn otext$
EndProcedure
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Fastest way to modify a large string, byte-by-byte?

Post by MachineCode »

Hi Demivec, yep, your edited code now produces the same results and seems just as fast. Thanks! :)
I noticed this: Len(#CRLF$). Why not just hard-code "2" there instead of the overhead of calling Len()?
Also, shouldn't we FreeList() before exiting the procedure?

BTW, here's a zip of the Win32api.txt file that I use for all my speed tests -> http://www.sendspace.com/file/rw2zfq
Basically I open that file in Notepad, copy all, and use the copied text as my test base for all my string operations.

PS. Haven't forgotten about your donation; PayPal is giving me grief so I may cancel that account and re-open a different one. It'll happen, though. ;)
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Fastest way to modify a large string, byte-by-byte?

Post by Demivec »

MachineCode wrote:Hi Demivec, yep, your edited code now produces the same results and seems just as fast. Thanks! :)
I noticed this: Len(#CRLF$). Why not just hard-code "2" there instead of the overhead of calling Len()?
Also, shouldn't we FreeList() before exiting the procedure?
In my own tests of timing my procedure takes about half as long as breeze4me's. I attribute this to the fact that it only copies the strings once to accomplish the sort and not twice.

I used Len(#CRLF$) instead of a hard-coded 2 for the sake of clarity. I mean, in a week what will that 2 stand for anyway? :wink: I did some speed tests and it only took at most 2% longer. It makes the most sense to replace it with an appropriately named constant (i.e. #LengthOfCRLF) if you can come up with a decent name. The problem is that you can't use Len() in a constant definition so to document it well you would have to give it a descriptive name or something similar (like a comment). You could also define it at the beginning of the procedure with 'lengthOfCRLF = len(#CRLF$)' and use that variable in its place also. Pick your favorite.

The list used in the procedure is freed automatically at the end of the procedure.
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Fastest way to modify a large string, byte-by-byte?

Post by MachineCode »

Thanks for the feedback. For your procedure, here's what I did for the "2" issue:

Code: Select all

If (*end - *start) / #soc > 2 ; "2" was Len(#CRLF$).
I do this sort of thing quite often in my apps, so I know why a hard-coded value was used. :)

In your code for sorting, you saved all lines to a linked list, but the manual says arrays are faster than those. Should we be storing in an array instead, then? Your code is fast enough anyway... but I was thinking maybe arrays would give a better speed gain, based on the manual's comment?

Another question (hope you don't mind, sensei). :) So that I can process my own lines one-by-one and not hassle you anymore for help, how would your code be changed so that I can iterate every line of text in turn and do something to each on the fly? I know it's to do with the code before the SortStructuredArray() line, but when I try to PeekS() the lines to get them, it fails (the debug output shows the same line twice or more). Feel up to another lesson? Let's assume I just want to sort through that big Win32api.txt file and put "pure" at the start of each line and "basic" at the end, or basically just do anything to each line as fast as possible. Can you show me a skeleton code that grabs each line, stores it in s$, then adds s$ to a new output? That way, I can modify each s$ on the fly, before it's built into the new final output (with CopyMemoryString?).
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Fastest way to modify a large string, byte-by-byte?

Post by wilbert »

There's no need to do a division each time
You can define a constant

#crlf_bytes = #soc * 2

Code: Select all

If *end - *start > #crlf_bytes
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Fastest way to modify a large string, byte-by-byte?

Post by Demivec »

MachineCode wrote:In your code for sorting, you saved all lines to a linked list, but the manual says arrays are faster than those. Should we be storing in an array instead, then? Your code is fast enough anyway... but I was thinking maybe arrays would give a better speed gain, based on the manual's comment?
I would first like to say that when I benchmarked the different procedures I would get varying results. I would change one sort procedure and it would affect the performance of other sort procedures that were not changed (they each acted on the same non-random data). So you will just have to evaluate things for yourself to make the final say.

I don't think there is a significant (>10%) speed difference between an array version and a list version. In my tests the array version differed from the list version in speed by +/- 7%. Here is what the array version looks like if you are interested in the way in which the code differs:

Code: Select all

Procedure.s sortText(text$)
  numLines = CountString(text$, #CRLF$)
  Dim *textLines.string(numlines)
  
  otext$ = text$ ;This is simply to reserve an output buffer of the same size as the original string
  *start = @text$
  *end.character = *start
  
  currentIndex = 0
  ;setup pointers to each substring
  While *end\c
    If *end\c = #LF  ;assumes that #LF is the last character of each substring
      *end\c = #Null ;replace #LF temporarily to mark it as a substring
      *end + #soc
      If *end - *start > #crlf_bytes
        *textLines(currentIndex) = *start ;simply point to the string segment without copying it
        currentIndex + 1
      EndIf
      *start = *end
    Else
      *end + #soc
    EndIf
  Wend
  
  currentIndex - 1
  SortStructuredArray(*textLines(), #PB_Sort_Ascending, 0, #PB_String, 0, currentIndex)
  
  ;assemble the output string from the pointers to the sorted substrings
  *end = @otext$
  CopyMemoryString("", @*end) ;prime the copy function
  For i = 0 To currentIndex
    CopyMemoryString(*textLines(i))
    *end\c = #LF ;restore the #LF and 'unmark' the substring
    *end + 1
  Next
  *end\c = #Null ;mark end of string in case output is smaller than input
    
  ProcedureReturn otext$
EndProcedure
MachineCode wrote:Another question (hope you don't mind, sensei). :) So that I can process my own lines one-by-one and not hassle you anymore for help, how would your code be changed so that I can iterate every line of text in turn and do something to each on the fly? I know it's to do with the code before the SortStructuredArray() line, but when I try to PeekS() the lines to get them, it fails (the debug output shows the same line twice or more). Feel up to another lesson? Let's assume I just want to sort through that big Win32api.txt file and put "pure" at the start of each line and "basic" at the end, or basically just do anything to each line as fast as possible. Can you show me a skeleton code that grabs each line, stores it in s$, then adds s$ to a new output? That way, I can modify each s$ on the fly, before it's built into the new final output (with CopyMemoryString?).
Because you want to modify the lines and not simply sort the procedure will make copies of them instead of simply using pointers to them. The extra copying will take an additional 92% running time. You will notice that the following skeleton code is almost exactly the same. A major change was not modifying the input string. Another major change was that the output string could exceed the input string in length depending on how the lines were modified just prior to the sort.

You should be able to modify this to use an array instead of a list by using the earlier example if you wanted to go that route. With this you should be ready to knock many challenges down to size, grasshopper. :wink:

Code: Select all

#soc = SizeOf(character) 
;---------

Procedure.s sortText(text$)
  NewList textLines.String()
  
  *start = @text$
  *end.character = *start

  charCount = 1
  outputLength = 0
  ;split text into separate lines ending in #CRLF$, remove lines containing only #CRLF$
  While *end\c
    If *end\c = #LF  ;assumes that #LF is the last character of each substring
      If  charCount > 2 ;2 = Len(#CRLF$)
        AddElement(textLines())
        textLines()\s = PeekS(*start, charCount)  ;any changes to the line can be performed here, such as additions, replacements, deletions
        outputLength + charCount ;you would replace this with a calculation for your modifications, i.e. outputLength + Len(textLines()\s)
      EndIf
      
      *end + #soc
      *start = *end
      charCount = 1
    Else
      *end + #soc
      charCount + 1
    EndIf
  Wend
  
  SortStructuredList(textLines(), #PB_Sort_Ascending, 0, #PB_String)
  
  ;assemble the output string from each of the sorted substrings
  otext$ = Space((outputLength + 1) * #soc)
  *end = @otext$
  CopyMemoryString("", @*end) ;prime the copy function
  ForEach textLines()
    CopyMemoryString(textLines()\s)
  Next
      
  ProcedureReturn otext$
EndProcedure
On a side note, you sent me a PM that I am unable to respond to because you currently have user settings that prevent you from receiving PM's from forum members. :(
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Fastest way to modify a large string, byte-by-byte?

Post by MachineCode »

Thanks for the skeleton code. As for PMs, I've enabled it now. :)
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
Post Reply