Fastest way to modify a large string, byte-by-byte?

Just starting out? Need help? Post your questions and find answers here.
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Fastest way to modify a large string, byte-by-byte?

Post by MachineCode »

Faster, but still unacceptably slow (about 20 seconds). Thanks for trying, but I really want to see a "Character" version to complete the other two "Character" versions that I've learned.
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
User avatar
Danilo
Addict
Addict
Posts: 3036
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: Fastest way to modify a large string, byte-by-byte?

Post by Danilo »

MachineCode wrote:One final plea for help (and once I get this sorted, I should be set -- because I've already learned how to change same-size strings and longer-to-shorter strings). :)

This last time, I need to learn how to quickly enlarge an existing string with new data. In the particular example below, I want to put line numbers (and a space) in front of each line in the existing string. I think my code is almost there, but I don't know how to get the new larger result.

Please help me, because once this is done, you'll have taught me to fish, instead of feeding me a fish. ;)

Code: Select all

; Output in MessageRequester is to be:

;  8 eight
;  9 nine
; 10 ten

; With the above specific alignment.

text$="eight"+#CRLF$+"nine"+#CRLF$+"ten"+#CRLF$

width=2 : lines=3
soc=SizeOf(Character)

*i.Character=@text$
*o.Character=AllocateMemory(999)

n=8 : s$=RSet(Str(n),width," ")+" "
For a=1 To Len(s$) : *o\c=Asc(Mid(s$,a,1)) : *o+soc : Next

While *i\c
  *o\c=*i\c : *o+soc
  If *i\c=#LF ; End of current line.
    n+1 : s$=RSet(Str(n),width," ")+" "
    For a=1 To Len(s$) : *o\c=Asc(Mid(s$,a,1)) : *o+soc : Next
  EndIf
  *i+soc
Wend
*o\c=0

MessageRequester("Result",text$) ; Shows original string.
MessageRequester("Result",PeekS(*o)) ; Shows null.
Good try! It shows an empty string because you increment *o in your loop. Save *o and use the saved value (original) for PeekS().
sizeOf(Character) is a constant value, so you can change soc to #soc everywhere. Same for width in this case.

Could be replaced with one PokeS():

Code: Select all

For a=1 To Len(s$) : *o\c=Asc(Mid(s$,a,1)) : *o+#soc : Next

Code: Select all

; Output in MessageRequester is to be:

;  8 eight
;  9 nine
; 10 ten

; With the above specific alignment.

text$="eight"+#CRLF$+"nine"+#CRLF$+"ten"+#CRLF$

#width=2 : lines=3
#soc=SizeOf(Character)

*i.Character=@text$
CompilerIf #PB_Compiler_Version > 500
    *o.Character=AllocateMemory(999,#PB_Memory_NoClear) ; faster memory alloc
CompilerElse
    *o.Character=AllocateMemory(999)                    ; slower memory alloc
CompilerEndIf

*o_orig = *o                                            ; save original memory pointer

n=8 : s$=RSet(Str(n),#width," ")+" "
PokeS(*o,s$) : *o+Len(s$)*#soc

While *i\c
  *o\c=*i\c : *o+#soc
  If *i\c=#LF ; End of current line.
    n+1 : s$=RSet(Str(n),#width," ")+" "
    PokeS(*o,s$) : *o+Len(s$)*#soc                      ; use PokeS
  EndIf
  *i+#soc
Wend
*o\c=0

MessageRequester("Result",text$)                        ; Shows original string.
MessageRequester("Result",PeekS(*o_orig))               ; Shows new string

FreeMemory(*o_orig)                                     ; free saved memory pointer
Be aware that RSet with a #width of 2 makes the line numbers max. 2 character in width! '999' and '9999' will get truncated to '99'!
Last edited by Danilo on Sun Mar 03, 2013 6:48 am, edited 1 time in total.
Psych
Enthusiast
Enthusiast
Posts: 239
Joined: Thu Dec 18, 2008 3:35 pm
Location: Wales, UK

Re: Fastest way to modify a large string, byte-by-byte?

Post by Psych »

This is the quickest way I could think of using pointers, I am assuming you are ok with using pbs string concatenation.
You could I guess insert the characters into a memory block on the fly (stopping where you find the #crlf pair and inserting your line text) and then just peeks the memory afterwards, however there is no real way to know how many #crlf pairs there are, so you'd have to guess the memory size (or just allocate based on a formula assuming the file contains ALL #crlf pairs). Other than that this is the best I can do.

Code: Select all

Global line,*p.Character=@text$,*f=*p
While *p\c
  If *p\c=#CR
    *p+SizeOf(Character)
    If *p\c=#LF
      line+1
      *p+SizeOf(Character)
      r$+RSet(Str(line),2,"0")+" "+PeekS(*f,*p-*f)
      *f=*p
    EndIf
  Else
    *p+SizeOf(Character)
  EndIf
Wend
MessageRequester("Test",r$)
----------------------------------------------------------------------------
Commenting your own code is admitting you don't understand it.
----------------------------------------------------------------------------
Psych
Enthusiast
Enthusiast
Posts: 239
Joined: Thu Dec 18, 2008 3:35 pm
Location: Wales, UK

Re: Fastest way to modify a large string, byte-by-byte?

Post by Psych »

Are you saying you wish the parse the numbers in between the #crlf pairs and output the parsed value as a line number?

So "one"+#crlf$+"twenty"+#crlf$+"thirty"+#crlf$+"forty"+#crlf$+"fifty"

Would output

01 one
20 twenty
30 thirty
40 forty
50 fifty

In your message requester?
----------------------------------------------------------------------------
Commenting your own code is admitting you don't understand it.
----------------------------------------------------------------------------
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Fastest way to modify a large string, byte-by-byte?

Post by MachineCode »

Danilo, that's the stuff! Does what I want in about 30 ms on my 725 KB text. THANK YOU! :D

Also, I knew that about the "width" var; I only set it to 2 for my example. For my 725 KB text string it's actually set to 5 due to the number of lines being over 60000. ;)

[Edit] Psych, I just wanted to insert line numbers in front of every line of text$. All examples (except Danilo's) took over 20 seconds to do it, which is unacceptable. His version does it in less than a quarter of a second. :shock: Amazing. I am very grateful for all the lessons I learned in this thread. It's been a great topic to cover: the fast manipulation of a massive string.
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
Psych
Enthusiast
Enthusiast
Posts: 239
Joined: Thu Dec 18, 2008 3:35 pm
Location: Wales, UK

Re: Fastest way to modify a large string, byte-by-byte?

Post by Psych »

It's worth pointing out that danillos example produced an extra line number with empty text when I tested it (unless this behaviour doesnt matter).
I guess PBs string concatenation was the slowing factor after all.
----------------------------------------------------------------------------
Commenting your own code is admitting you don't understand it.
----------------------------------------------------------------------------
User avatar
Danilo
Addict
Addict
Posts: 3036
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: Fastest way to modify a large string, byte-by-byte?

Post by Danilo »

MachineCode wrote:Danilo, that's the stuff! Does what I want in about 30 ms on my 725 KB text. THANK YOU! :D

Also, I knew that about the "width" var; I only set it to 2 for my example. For my 725 KB text string it's actually set to 5 due to the number of lines being over 60000. ;)
It is your code, you wrote it. The only (common) mistake was you didn't save the original pointer *o. I made the same mistake before, so I always remember it now. ;)

Just remembered you still use PB 5.00, so I changed the code little bit:

Code: Select all

CompilerIf #PB_Compiler_Version > 500
    *o.Character=AllocateMemory(999,#PB_Memory_NoClear) ; faster memory alloc
CompilerElse
    *o.Character=AllocateMemory(999)                    ; slower memory alloc
CompilerEndIf
Memory allocation without setting it to zero is faster, but it was just introduced with PB 5.10.

psych wrote:It's worth pointing out that danillos example produced an extra line number with empty text when I tested it (unless this behaviour doesnt matter).
I guess PBs string concatenation was the slowing factor after all.
It is correct and happens because text$ has an +#CRLF at end. Remove the last #CRLF and there is no extra line number added.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Fastest way to modify a large string, byte-by-byte?

Post by wilbert »

Different approach but I don't know if it is faster or slower

Code: Select all

Procedure.s InsertLineNumbers(text.s)
  
  Protected *in.Character = @text
  Protected out.s, *cont, *out.Character
  Protected in_len, ln_cnt, lnum_w, l, ln = 1
  
  While *in\c
    If *in\c = #LF
      ln_cnt + 1
    EndIf
    *in + SizeOf(Character)
  Wend
  in_len = *in - @text
  If in_len
    *in - SizeOf(character)
    If *in\c > 31
      ln_cnt + 1
    EndIf
  EndIf
  
  If ln_cnt
    lnum_w = 2 + Log10(ln_cnt)
    out = Space(in_len + lnum_w * ln_cnt)
    *in = @text
    *out = @out
    While *in\c
      *cont = *out + lnum_w
      *out = *cont - 2
      l = ln
      While l
        *out\c = $30 | (l % 10)
        *out - 1
        l / 10
      Wend
      *out = *cont
      ln + 1
      Repeat
        *out\c = *in\c
        *in + SizeOf(character)
        *out + SizeOf(character)
      Until *in\c = #LF Or *in\c = 0
      If *in\c = #LF
        *out\c = #LF
        *in + SizeOf(character)
        *out + SizeOf(character)
      EndIf
    Wend
  EndIf
  
  ProcedureReturn out
  
EndProcedure
Windows (x64)
Raspberry Pi OS (Arm64)
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Fastest way to modify a large string, byte-by-byte?

Post by MachineCode »

wilbert wrote:Different approach but I don't know if it is faster or slower
Thanks Wilbert, but I'm pretty happy with what I've got now, so I won't compare yours for speed (sorry).

I do have another question that came up today in my app: I need to alphabetize (sort) all the lines in text$ now. Currently I'm putting each line in an array with StringField(), calling SortArray(), then rebuilding a new sorted text$ with CopyMemoryString() in a loop. For my 725 KB text, it takes around 20 seconds. You guessed it: too long. :) I've looked hard but can't see how to apply the "Character" type to this particular situation. Any tips on how I should approach this one? Thanks. 8)

[Edit] I was wondering maybe there's a way to "PeekS" an array, instead of rebuilding it with CopyMemoryString?
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: Fastest way to modify a large string, byte-by-byte?

Post by davido »

Hi MachineCode,

Sorry if my suggestion is daft, but here goes:

Could you make a structure like this where the lengths of the lines are a fixed string array:

Code: Select all

Structure XX
    S$
    StructureUnion
      Array ss${20}(1000)
    EndStructureUnion
  EndStructure
You could then treat the text as a single string or line by line using the array.
If sort still works on the array you won't have to move the text.
DE AA EB
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Fastest way to modify a large string, byte-by-byte?

Post by Demivec »

MachineCode wrote:
wilbert wrote:Different approach but I don't know if it is faster or slower
Thanks Wilbert, but I'm pretty happy with what I've got now, so I won't compare yours for speed (sorry).

I do have another question that came up today in my app: I need to alphabetize (sort) all the lines in text$ now. Currently I'm putting each line in an array with StringField(), calling SortArray(), then rebuilding a new sorted text$ with CopyMemoryString() in a loop. For my 725 KB text, it takes around 20 seconds. You guessed it: too long. :) I've looked hard but can't see how to apply the "Character" type to this particular situation. Any tips on how I should approach this one? Thanks. 8)

[Edit] I was wondering maybe there's a way to "PeekS" an array, instead of rebuilding it with CopyMemoryString?
Here is an untested procedure:

Code: Select all

#soc = SizeOf(character) 

Procedure.s sortText(text$)
  
  NewList textLines.s()
  
  *start         = @text$
  *end.character = *start
  
  While *end\c
    If *end\c = #LF ;assumes that #LF is the last character of the string
      *end + #soc
      AddElement(textLines())
      textLines() = Space(*end - *start)
      CopyMemory(*start, @textLines(), *end - *start))
      *start = *end
    Else
      *end + #soc
    EndIf
  Wend
  
  SortList(textLines(), #PB_Sort_Ascending)
  
  *start = @text$
  CopyMemoryString("", @*start) ;prime the copy function
  ForEach textLines()
    CopyMemoryString(@textLines())
  Next
  
  ProcedureReturn text$
EndProcedure
Last edited by Demivec on Mon Mar 04, 2013 8:25 am, edited 1 time in total.
breeze4me
Enthusiast
Enthusiast
Posts: 633
Joined: Thu Mar 09, 2006 9:24 am
Location: S. Kor

Re: Fastest way to modify a large string, byte-by-byte?

Post by breeze4me »

MachineCode wrote:...... alphabetize (sort) all the lines in text$ ......
The following code is not the best one, but it took 47-63ms(Unicode) on my PC.

Windows only.

Code: Select all

ImportC "msvcrt.lib"
  CompilerIf #PB_Compiler_Unicode
    StringField2(*string, *seps) As "_wcstok"
  CompilerElse
    StringField2(*string, *seps) As "_strtok"
  CompilerEndIf
EndImport

CompilerIf #PB_Compiler_Version < 510
  CompilerError "Use PB 5.10+"
CompilerEndIf

Procedure.s SortTextLines(text$)
  Protected NewList Lines.s()
  Protected Seps$ = #CRLF$
  Protected Token, Size, *Buffer, *Pointer
  
  If text$
    Token = StringField2(@text$, @Seps$)
    While Token
      If AddElement(Lines())
        Lines() = PeekS(Token)
        Size + StringByteLength(Lines()) + 2 * SizeOf(Character)
      EndIf
      Token = StringField2(0, @Seps$)
    Wend
    
    SortList(Lines(), #PB_Sort_Ascending)
    
    ;line numbering.
;     Protected Num, NumLength
;     NumLength = Int(Log10(ListSize(Lines()))) + 1
;     ForEach Lines()
;       Num + 1
;       Lines() = RSet(Str(Num), NumLength) + " " + Lines()
;       Size + StringByteLength(Lines()) + 2 * SizeOf(Character)
;     Next
    
    *Buffer = AllocateMemory(Size + SizeOf(Character), #PB_Memory_NoClear)
    If *Buffer
      *Pointer = *Buffer
      CopyMemoryString("", @*Pointer)
      ForEach Lines()
        CopyMemoryString(Lines() + #CRLF$)
      Next
      text$ = PeekS(*Buffer)
      FreeMemory(*Buffer)
    EndIf
  EndIf
  ProcedureReturn text$
EndProcedure


*Buffer = AllocateMemory(6000000)
*Pointer = *Buffer
CopyMemoryString("", @*Pointer)

For i = 1 To 60000
  CopyMemoryString(Chr(Random(126, 33)) + Chr(Random(126, 33)) + RSet(Str(Random(999999)), 6) + " - test string - " + Chr(Random(126, 33)) + #CRLF$)
Next

L1 = MemoryStringLength(*Buffer)
t$ = PeekS(*Buffer)
MessageRequester("Original", PeekS(@t$, 500))

t = ElapsedMilliseconds()
t$ = SortTextLines(t$)
t = ElapsedMilliseconds() - t

L2 = Len(t$)

MessageRequester("Sorted", PeekS(@t$, 500))

MessageRequester("Result", Str(t) + " ms" + #CR$ + #CR$ + "Length:" + #CR$ + "ori " + Str(L1) + #CR$ + "sorted " + Str(L2))
---------------------------
Result
---------------------------
47 ms

Length:
ori 1680000
sorted 1680000
---------------------------
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Fastest way to modify a large string, byte-by-byte?

Post by MachineCode »

I'm now using PureBasic 5.10 (because FindString is twice as fast, and you all know I need speed for my text app).

Demivec, your code crashes with IMA at: "SortList(textLines(), #PB_Sort_Ascending)"

Breeze4me, your code is perfect! Sorts my 725 KB text in around 30 ms. :D Damn I love these forums!
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Fastest way to modify a large string, byte-by-byte?

Post by Demivec »

MachineCode wrote:Demivec, your code crashes with IMA at: "SortList(textLines(), #PB_Sort_Ascending)"
At the moment I'm trying to determine if this reveals a bug or a misunderstanding. :wink:

Because I like the puzzle though, here is a slightly altered version that will at least give breeze4me a run for his money. Please test it with your actual 725 KB text to see how it does on real-world data.

Code: Select all

#soc = SizeOf(character) 
;---------

Procedure.s sortText(text$)
  NewList *textLines.string()

  otext$ = text$ ;This is simply to reserve an output buffer of the same size as the original string
  *start.character = @text$
  *end.character = *start
  
  ;setup pointers to each substring
  While *end\c
    If *end\c = #LF ;assumes that #LF is the last character of each substring
      *end\c = #Null ;replace #LF temporarily to mark it as a substring
      *end + #soc
      AddElement(*textLines())
      *textLines() = *start ;simply point to the string segment without copying it
      *start = *end
    Else
      *end + #soc
    EndIf
  Wend
  
  SortStructuredList(*textLines(), #PB_Sort_Ascending, 0, #PB_String)
  
  ;assemble the output string from the pointers to the sorted substrings
  *start = @otext$
  CopyMemoryString("", @*start) ;prime the copy function
  ForEach *textLines()
    CopyMemoryString(*textLines())
    *start\c = #LF ;restore the #LF and 'unmark' the substring
    *start + 1
  Next
    
  ProcedureReturn otext$
EndProcedure
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Fastest way to modify a large string, byte-by-byte?

Post by MachineCode »

Demivec wrote:At the moment I'm trying to determine if this reveals a bug or a misunderstanding. :wink:

Because I like the puzzle though, here is a slightly altered version that will at least give breeze4me a run for his money. Please test it with your actual 725 KB text to see how it does on real-world data.
Misunderstanding? No, it was a direct copy-paste of your original code and when run, it crashed with that line. This was after I removed an extra closing bracket from your original code too. ;) Yes, you noticed that after. :)

Anyway, I tried your new code against breeze4me and it seems just as fast, but yours ends up having lots of blank lines at the top, because they're obviously being sorted too from all the carriage-returns in my text. Breeze's doesn't include them for some reason? If you can omit them, that'd be better because I hate "Import" in my code because I have no idea what it's doing and I hate having code that I can't understand (even if it works).
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
Post Reply