Faster string concatenation: Alternative method

Share your advanced PureBasic knowledge/code with the community.
Karig
New User
New User
Posts: 7
Joined: Mon Jul 11, 2016 10:20 pm

Faster string concatenation: Alternative method

Post by Karig »

I see someone else has posted his solution for speeding up string concatenation, but I cobbled together this code today, and I thought I should share it. My system uses one structure definition and two procedures, so it should be simple to use. (It also uses Space() instead of AllocateMemory(), so if this leaks memory, it's probably PureBasic's fault. :mrgreen: )

Code: Select all

; cat.pbi --- for fast string concatenation

Structure CatBuffer
  capacity.i
  length.i
  text.s
EndStructure

Procedure EmptyCatBuffer(*cat.CatBuffer)
  Protected blank$ = ""
  
  *cat\capacity = 0
  *cat\length = 0
  PokeS(@*cat\text, blank$, 0)
EndProcedure

Procedure AddToCatBuffer(*cat.CatBuffer, string$)
  ; Appends string$ to the buffer. If the buffer is too small, this procedure
  ; will double the buffer's size before appending the string. Simply doubling
  ; the size on overflow uses more memory but necessitates fewer reallocations
  ; of memory as the buffer grows.
  
  Protected new_length = *cat\length + StringByteLength(string$)
  Protected new_capacity, old_text$, *address
  
  If new_length > *cat\capacity
    new_capacity = *cat\capacity
    If new_capacity = 0
      new_capacity = 1024 * 4
    EndIf
    
    While new_length > new_capacity
      new_capacity * 2
    Wend
  
    old_text$ = *cat\text
    *cat\text = Space(new_capacity)
    *cat\capacity = new_capacity
    PokeS(@*cat\text, old_text$)
  EndIf
  
  *address = @*cat\text + *cat\length
  PokeS(*address, string$)
  *cat\length = new_length
EndProcedure
You declare a variable of type CatBuffer, and you keep calling AddToCatBuffer, passing the variable and the string to append. When you're done (or even before you're done), you can copy the \text member to a string, or even WriteString() it to a file.

Here's some test code I wrote:

Code: Select all

EnableExplicit

XIncludeFile "cat.pbi"

Procedure TestCatBuffer()
  Protected cat.CatBuffer, t$
  
  t$ = "The "
  AddToCatBuffer(cat, t$) : Debug cat\text
  
  t$ = "quick "
  AddToCatBuffer(cat, t$) : Debug cat\text
  
  t$ = "brown fox jumps over the lazy dog."
  AddToCatBuffer(cat, t$) : Debug cat\text
  
  t$ = " Pack my box with five dozen liquor jugs."
  AddToCatBuffer(cat, t$) : Debug cat\text
  
  EmptyCatBuffer(cat) : Debug cat\text
  
  t$ = "A fresh new sentence."
  AddToCatBuffer(cat, t$) : Debug cat\text
EndProcedure

Procedure TestCatBufferSpeed(count.i)
  Protected cat.CatBuffer, t$, i, msec
  
  Debug "SPEED OF CATBUFFER (" + FormatNumber(count, 0) + " iterations):"
  msec = ElapsedMilliseconds()
  For i = 1 To count
    AddToCatBuffer(cat, " add")
  Next
  msec = ElapsedMilliseconds() - msec
  Debug "-- " + FormatNumber(msec/1000, 3) + " seconds."
  Debug "-- " + FormatNumber(Len(cat\text), 0) + " characters in buffer."
  
  Debug "SPEED OF '+' (" + FormatNumber(count, 0) + " iterations):"
  msec = ElapsedMilliseconds()
  For i = 1 To count
    t$ + " add"
  Next
  msec = ElapsedMilliseconds() - msec
  Debug "-- " + FormatNumber(msec/1000, 3) + " seconds."
  Debug "-- " + FormatNumber(Len(t$), 0) + " characters in string."
EndProcedure

TestCatBuffer()
TestCatBufferSpeed(1000)
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5342
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: Faster string concatenation: Alternative method

Post by Kwai chang caine »

Works great here
Can be very usefull, thansk for sharing 8)
ImageThe happiness is a road...
Not a destination
acreis
Enthusiast
Enthusiast
Posts: 182
Joined: Fri Jun 01, 2012 12:20 am

Re: Faster string concatenation: Alternative method

Post by acreis »

I couldn't agree more!
said
Enthusiast
Enthusiast
Posts: 342
Joined: Thu Apr 14, 2011 6:07 pm

Re: Faster string concatenation: Alternative method

Post by said »

Simple, efficient and no risk of memory leak ... 8) 8)
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: Faster string concatenation: Alternative method

Post by davido »

@Karig,

Great idea. Works well here, too.
Thank you for sharing. :D

The speed difference wasn't too clear for 1000 iterations on my machine so I tried a few more:

Debug wrote: The
The quick
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog. Pack my box with five dozen liquor jugs.

A fresh new sentence.
SPEED OF CATBUFFER (1,000 iterations):
-- 0.000 seconds.
-- 4,000 characters in buffer.
SPEED OF '+' (1,000 iterations):
-- 0.004 seconds.
-- 4,000 characters in string.

A fresh new sentence.
SPEED OF CATBUFFER (10,000 iterations):
-- 0.010 seconds.
-- 40,000 characters in buffer.
SPEED OF '+' (10,000 iterations):
-- 0.463 seconds.
-- 40,000 characters in string.

A fresh new sentence.
SPEED OF CATBUFFER (100,000 iterations):
-- 0.056 seconds.
-- 400,000 characters in buffer.
SPEED OF '+' (100,000 iterations):
-- 46.914 seconds.
-- 400,000 characters in string.
DE AA EB
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Re: Faster string concatenation: Alternative method

Post by Lunasole »

Interesting, by default it works faster that ReAllocatememory.
I was surprised this one is about 300% faster that my variant with default value of 10240 chars buffer.
Though if I increase buffer size, that my variants become faster from 150% and more (depending on buffer size).

Seems such the difference is in allocation step (and method) ^^
When I tried my variant with buffer increasing * 2 instead of fixed step, my variant became more that 700% faster comparing to this (in version with pointers+fixed len), or 40% faster (regular mode).
Thanks Karig, that idea is surely better for most cases than increasing with fixed step (even though it brings larger memory overhead)
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
User avatar
nco2k
Addict
Addict
Posts: 1344
Joined: Mon Sep 15, 2003 5:55 am

Re: Faster string concatenation: Alternative method

Post by nco2k »

@Karig
- your "benchmark" results are useless, because you are running with the debugger enabled.
- dont use string$ as the parameter. use *string instead.
- dont use StringByteLength(string$). use MemoryStringLength(*string) * SizeOf(Character) instead.
- dont use a While loop. calculate the capacity in one go.
- dont use PokeS(). use CopyMemory() instead.

c ya,
nco2k
If OSVersion() = #PB_OS_Windows_ME : End : EndIf
said
Enthusiast
Enthusiast
Posts: 342
Joined: Thu Apr 14, 2011 6:07 pm

Re: Faster string concatenation: Alternative method

Post by said »

Hi nco2k,

Can you pls elaborate more on below points?
nco2k wrote:@Karig
- dont use string$ as the parameter. use *string instead.
- dont use StringByteLength(string$). use MemoryStringLength(*string) * SizeOf(Character) instead.
- dont use PokeS(). use CopyMemory() instead.
thanks :)
Said
User avatar
nco2k
Addict
Addict
Posts: 1344
Joined: Mon Sep 15, 2003 5:55 am

Re: Faster string concatenation: Alternative method

Post by nco2k »

your procedures string parameter is a copy of the string you just passed. you can modify it however you like, without affecting the original string:

Code: Select all

Procedure$ Test(String$)
  Protected *Character.Character = @String$ + SizeOf(Character)
  *Character\c = 0
  ProcedureReturn String$
EndProcedure

Test$ = "ABC"

Debug Test(Test$)
Debug Test$
which is good, and the desired effect, but the string-copy has to be created and it costs time. if you want to write "faster string functions", you will want to use pointers, and work with the original string instead. by doing that, you skip the whole string-copy part.

however, purebasics own functions that have string parameters, get them by reference, if im not mistaken. so StringByteLength() & MemoryStringLength() should be pretty much equal, speed wise. so thats not the issue. the issue is PokeS(). in order for the string to be copied, the function needs to know how long the string is, and therefor has to search for the null terminator first. even if you use the length parameter, the function still has to check for the null terminator, within the specified length, and adjust the parameter accordingly. now this is not purebasics fault. this is, in fact, the correct way. im also pretty sure that purebasic wraps around strcpy/wcscpy anyway.

CopyMemory() on the other hand, doesnt check anything. it simply copies n bytes from *src to *dst. since we already know the total length of the string, we can skip the whole null terminator search.

c ya,
nco2k
If OSVersion() = #PB_OS_Windows_ME : End : EndIf
Karig
New User
New User
Posts: 7
Joined: Mon Jul 11, 2016 10:20 pm

Re: Faster string concatenation: Alternative method

Post by Karig »

I wanted my code to balance speed against ease of use, and I wanted it to be easy to understand. Theoretically I could have fiddled with pointers more to eke out more speed, but I wanted something I could just throw strings into.

I'd be using this for situations when the length of strings is unpredictable, e.g., user input, or lines from a file. So the length of the string has to be calculated whether I use CopyMemory() or not.

The main performance bottleneck with "+" seems to be the number of times that the internal buffer has to be reallocated and the string data copied over into the new buffer, so the key seems to be to reduce the number of times you have to do that. If all I do is double the size of the buffer whenever it overflows, and if I start with a buffer size of 4KB, then enlarging the buffer to 4MB takes no more than ten reallocations.

Even with the debugger running, I could append a million strings into a CatBuffer and have it take a little more than one second, while trying to concatenate the same million strings using the "+" operator took about two hours and forty minutes on my machine. CatBuffer's advantages over "+" are less obvious when you only have a hundred or a thousand strings to concatenate, when the job can be done in milliseconds.

Still, I'm happy with this. If anybody wants to tweak the code to make it more like what they think it should be, have at it. 8)
Karig
New User
New User
Posts: 7
Joined: Mon Jul 11, 2016 10:20 pm

Re: Faster string concatenation: Alternative method

Post by Karig »

Also, correct me if I'm wrong, but doesn't PureBasic use copy-on-write for strings passed to procedures? I did a quick-and-dirty test for creating two strings of different length and passing them to a procedure (which does nothing but return the length of the string). Even though string #1 was only 45 bytes long, and string #2 was almost 3MB long, the process of passing the string to the procedure, calling the procedure, and returning from the procedure took the same amount of time --- 7 milliseconds. (If the procedure had done something with the parameter other than read it, then PureBasic would have stopped to create a copy of the string so the procedure could modify it safely, which should have affected the time elapsed.)

Code: Select all

Procedure PassString(s$)
  ProcedureReturn Len(s$)
EndProcedure

DisableDebugger
s$ = "The quick brown fox jumps over the lazy dog." + #LF$
msec = ElapsedMilliseconds()
For i = 1 To 10000
  size1 = PassString(s$)
Next
msec = ElapsedMilliseconds() - msec

For i = 1 To 16
  s$ + s$
Next
msec2 = ElapsedMilliseconds()
For i = 1 To 10000
  size2 = PassString(s$)
Next
msec2 = ElapsedMilliseconds() - msec2

EnableDebugger
Debug "TEST 1: " + Str(size1) + "-byte string: " + Str(msec) + " msec"
Debug "TEST 2: " + Str(size2) + "-byte string: " + Str(msec) + " msec"
Result:

Code: Select all

TEST 1: 45-byte string: 7 msec
TEST 2: 2949120-byte string: 7 msec
User avatar
nco2k
Addict
Addict
Posts: 1344
Joined: Mon Sep 15, 2003 5:55 am

Re: Faster string concatenation: Alternative method

Post by nco2k »

if you want to do benchmarking, you have to physically disable the debugger. using DisableDebugger is not enough! you wrote Str(msec) twice. thats why you get the wrong results for your second run. i also dont really understand what your test is supposed to show?

if you want to measure the difference, then compare a string parameter vs a pointer parameter:

Code: Select all

Procedure p1(str$)
EndProcedure

Procedure p2(*str)
EndProcedure

txt$ = Space(1000)

t1.q = ElapsedMilliseconds()
For i1 = 1 To 1000000
  p1(txt$)
Next
t1 = ElapsedMilliseconds() - t1

t2.q = ElapsedMilliseconds()
For i2 = 1 To 1000000
  p2(@txt$)
Next
t2 = ElapsedMilliseconds() - t2

MessageRequester("Results", "str: "+Str(t1)+" ms"+#CRLF$+"ptr: "+Str(t2)+" ms")
c ya,
nco2k
If OSVersion() = #PB_OS_Windows_ME : End : EndIf
said
Enthusiast
Enthusiast
Posts: 342
Joined: Thu Apr 14, 2011 6:07 pm

Re: Faster string concatenation: Alternative method

Post by said »

nco2k wrote:your procedures string parameter is a copy of the string you just passed. you can modify it however you like, without affecting the original string:

Code: Select all

Procedure$ Test(String$)
  Protected *Character.Character = @String$ + SizeOf(Character)
  *Character\c = 0
  ProcedureReturn String$
EndProcedure

Test$ = "ABC"

Debug Test(Test$)
Debug Test$
which is good, and the desired effect, but the string-copy has to be created and it costs time. if you want to write "faster string functions", you will want to use pointers, and work with the original string instead. by doing that, you skip the whole string-copy part.

however, purebasics own functions that have string parameters, get them by reference, if im not mistaken. so StringByteLength() & MemoryStringLength() should be pretty much equal, speed wise. so thats not the issue. the issue is PokeS(). in order for the string to be copied, the function needs to know how long the string is, and therefor has to search for the null terminator first. even if you use the length parameter, the function still has to check for the null terminator, within the specified length, and adjust the parameter accordingly. now this is not purebasics fault. this is, in fact, the correct way. im also pretty sure that purebasic wraps around strcpy/wcscpy anyway.

CopyMemory() on the other hand, doesnt check anything. it simply copies n bytes from *src to *dst. since we already know the total length of the string, we can skip the whole null terminator search.

c ya,
nco2k
@nco2k, first these are not my procedures :!: and thanks for the explanation, makes sense :)

@Karig, having Debug instructions will enable the debugger and then no benchmarks are realistic enough when the debugger is on, the benchmarking as shown by nco2k using MessageRequester() is valid and thanks for sharing your simple and nice trick :D

Said
Steving
User
User
Posts: 11
Joined: Tue Sep 12, 2017 11:40 pm
Location: San Francisco Bay Area
Contact:

Re: Faster string concatenation: Alternative method

Post by Steving »

I can't believe how much faster it is than standard string concatenation. I measured the performance in a loop with 100,000 iterations adding 4 items per iteration, yielding a total string length of 2,188,895 characters with an elapsed milliseconds of 311. Standard "" + "" string concatenation took (yawn!) 555,746 Milliseconds with an elapsed wall time of 17 minutes. Thank you so much for your contribution. I'm going to make multiple copies so that I can use it to work on multiple strings at at time. I'll have a herd of Cat's when I'm done :D :D :D. This is going to save my bacon. Unfortunately with the API that I'm using I need strings... lots of strings...

Thanks again!

Stevinga
User avatar
skywalk
Addict
Addict
Posts: 3972
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Faster string concatenation: Alternative method

Post by skywalk »

The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
Post Reply