Strip/replace double or multiple spaces with single

Share your advanced PureBasic knowledge/code with the community.
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Strip/replace double or multiple spaces with single

Post by Keya »

Hi Gadget thankyou, thats fairly similar to the original PB one in the first post which didnt perform too badly, but you're using CountString() ... in the original im using FindString(), which im guessing is faster because it stops at the first match whereas Count does the entire string? :)
Also "If NewText <> """ might be redundant because CountString/FindString should pretty much return immediately on empty string?

wilbert,
ahhhh i think i see now, so after movzx youre able to do things like cmp eax instead of cmp al. i like!!!
Last edited by Keya on Fri Nov 13, 2015 7:42 pm, edited 2 times in total.
Gadget
User
User
Posts: 39
Joined: Tue Mar 11, 2014 8:11 pm
Location: UK

Re: Strip/replace double or multiple spaces with single

Post by Gadget »

Yep, I think you're right on both counts. I think I'll investigate further and update my routine.

As always, I'm really pleased to be able to improve my coding from the feedback of more experience folks such as yourself. Thanks for letting me know your thoughts.

Cheers,

Gadget
Windows 10 and PB 5.73 (both x64)
Little John
Addict
Addict
Posts: 4777
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Strip/replace double or multiple spaces with single

Post by Little John »

Hello Keya,

thanks for your comprehensive reply.
I can't really comment on it because I don't know enough about this topic.
However, what you write sounds reasonable. :-)
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Strip/replace double or multiple spaces with single

Post by IdeasVacuum »

Is StringField really so bad? In your main app, the procedure that replaces multi spaces could be run in a thread.

Another angle - remove all the spaces, then replace them with a single space:

Code: Select all

;DisableDebugger

Procedure.s RemoveMultiSpace(sText.s)
;#-----------------------------------
     sText = LTrim(sText)
   sTemp.s = ""
    iCnt.i = 1
    iSub.i = 1
  iTotal.i = CountString(sText, Chr(32))
Dim sSub.s(iTotal)

Repeat
                  sTemp = Trim(StringField(sText, iCnt, Chr(32)))
           If(Len(sTemp) > 0)
             sSub(iSub) = sTemp
                   iSub = iSub + 1
           EndIf

           iCnt = iCnt + 1

Until iCnt = iTotal

sText = ""
 iCnt = 1

Repeat
             sText = sText + sSub(iCnt) + Chr(32)
              iCnt = iCnt + 1

Until iCnt = iSub

ProcedureReturn(Trim(sText))

EndProcedure

  iStartTime.i = ElapsedMilliseconds()
    sTrimmed.s = RemoveMultiSpace("  Hello        I     am   a   splitted string    Hello      I   am   a   splitted string    Hello      I   am   a   splitted string  ")
iElapsedTime.i = (ElapsedMilliseconds() - iStartTime)
;Debug sTrimmed
;Debug iElapsedTime
End
[/size]
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Strip/replace double or multiple spaces with single

Post by Keya »

John,
just an update, I dont know about Linux or Mac but i found out today that PB-Windows uses RtlAllocateHeap for strings, giving me even more confidence that it doesnt matter what data is in there (null-filled-binary or string) or whether we truncate with a floating nullchar - and we don't need to ever restore the nullchar to its original location either... all that should matter is staying within the size/boundary of the original allocation, so it's looking good without 100% confirmation heehee :)

--

IdeasVacuum,
IdeasVacuum wrote:In your main app, the procedure that replaces multi spaces could be run in a thread.
mm but it doesn't matter if you launch Secondary Thread and have Primary Thread doing nothing waiting for it, the timing will be the same, actually slightly worse by a few ms due to extra overhead, so just for the context of this exercise we probably dont need threads :)

I'll try to add timings for your algorithm shortly! Thankyou :)
IdeasVacuum wrote:Is StringField really so bad?
I havent thoroughly tested StringField on its own (but seeing as i had to stop the previous StringField test after 60 minutes in a test that took the other solutions ~30 secs i'd say yes perhaps it is really bad!), but another problem that compounds it is the string concatenation in the loop thats essentially required for StringField when used in that way. Because they're null-terminated strings, simply to append 1 byte requires scanning through the entire string buffer first, just like Len() would do :(

Code: Select all

sTest.s = Space(2000000)
Time1 = ElapsedMilliseconds()
For i = 1 To 10000
  ilen.i = Len(sTest)
Next i
Time2 = ElapsedMilliseconds()
Debug(StrF((Time2-Time1)/10000,3) + " per Len() call")
On my system doing a loop 65535 times to test Len average
Len(Space(50,000)) takes 0.015
Len(Space(100,000)) takes 0.030
Len(Space(500,000)) takes 0.150
Len(Space(1,000,000)) takes 0.302
Len(Space(2,000,000)) takes 0.613
(That's just the timings for the Len(), the building with Space() is actually done prior)

Reallocations are also slow!...

The following test simply starts with an empty "" string buffer, and keeps appending one "x" byte to the string in a loop up to 3,000,000 bytes ... the output shows you when the address change (when the entire buffer has been reallocated):

Code: Select all

s1.s = "BEFORE"
sTest.s = ""
s3.s = "AFTER"

NewLen = Len(sTest)
Debug( FormatDate("%hh:%ii:%ss", Date()) + " @ 0x"+Hex(@sTest) + " Len=" + Str(NewLen) )
LastAddr = @sTest
LastSize = NewLen    

For i = 1 To 2100000
  If Mod(i,1000000) = 0
    Debug( FormatDate("%hh:%ii:%ss", Date()) + " - " + Str(i) + "..." )
  EndIf
  sTest + "x"   ;append 1 byte each loop
  If LastAddr.i <> @sTest
    NewLen = Len(sTest)
    Debug( FormatDate("%hh:%ii:%ss", Date()) + " @ 0x"+Hex(@sTest) + " Len=" + Str(NewLen) + " SizeDiff=" + Str(NewLen-LastSize) )
    LastAddr = @sTest:   LastSize = NewLen    
  EndIf
Next i
Im testing with PB 5.40, Win32... something interesting happens at exactly half a megabyte, where the entire buffer is being reallocated after every 4096 bytes appended:
; 1:26:12 @ 0x391EC0 Len=0
; 1:26:12 @ 0x391E90 Len=4 SizeDiff=4
; 1:26:12 @ 0x396EF8 Len=12 SizeDiff=8
; 1:26:12 @ 0x1F76060 Len=37124 SizeDiff=37112
; 1:26:13 @ 0x391EE8 Len=53236 SizeDiff=16112
; 1:26:13 @ 0x1F60048 Len=57620 SizeDiff=4384
; 1:26:15 @ 0x1F8E040 Len=85996 SizeDiff=28376
; 1:26:17 @ 0x1F60048 Len=102380 SizeDiff=16384
; 1:26:18 @ 0x1F9E030 Len=118756 SizeDiff=16376
; 1:26:20 @ 0x1F60048 Len=135140 SizeDiff=16384
; 1:26:23 @ 0x1FAE020 Len=151516 SizeDiff=16376
; 1:26:25 @ 0x1F60048 Len=167900 SizeDiff=16384
; 1:26:28 @ 0x1FBE010 Len=184276 SizeDiff=16376
; 1:26:31 @ 0x1F60048 Len=200660 SizeDiff=16384
; 1:26:35 @ 0x1FCE000 Len=217036 SizeDiff=16376
; 1:26:38 @ 0x1F60048 Len=233420 SizeDiff=16384
; 1:26:42 @ 0x1FDDFF0 Len=249796 SizeDiff=16376
; 1:28:24 @ 0x22F0020 Len=520180 SizeDiff=270384 ;520180... hmm, half a mb is 524288 so thats 4108 bytes or 4096+12 (boundary tags) short :) perhaps a hardcoded PB or Windowsmemory constant? PB-Win uses RtlHeapAllocate() for both strings + AllocateMemory()
; 1:28:26 @ 0x2370020 Len=524244 SizeDiff=4064 ;every time from here on that we append 4096 bytes the full buffer is reallocated to new address
; 1:28:28 @ 0x2260020 Len=528340 SizeDiff=4096
;...every 4096...
; 1:29:06 @ 0x2260020 Len=597972 SizeDiff=4096
; 1:29:09 @ 0x2300020 Len=602068 SizeDiff=4096
; 1:29:11 @ 0x2260020 Len=606164 SizeDiff=4096
; 1:29:14 @ 0x23A0020 Len=610260 SizeDiff=4096
; 1:29:16 @ 0x2260020 Len=614356 SizeDiff=4096
; 1:29:19 @ 0x23A0020 Len=618452 SizeDiff=4096
; 1:29:21 @ 0x2260020 Len=622548 SizeDiff=4096
; 1:29:24 @ 0x2300020 Len=626644 SizeDiff=4096
; 1:29:26 @ 0x2260020 Len=630740 SizeDiff=4096 ;notice how it often changes back from one address to the next, then back again for a few times, recycling the region it previously freed - this appears to bethe coalescing boundary tags at work...
; 1:29:29 @ 0x2300020 Len=634836 SizeDiff=4096
; 1:29:31 @ 0x2260020 Len=638932 SizeDiff=4096
; 1:29:34 @ 0x2300020 Len=643028 SizeDiff=4096
; 1:29:37 @ 0x2260020 Len=647124 SizeDiff=4096
;... (it's consistently reallocating the full buffer every 4096 bytes from here, but just showing every 100000 bytes for brevity) ...
; 1:31:27 @ 0x2400020 Len=802772 SizeDiff=4096
;...
; 1:32:50 @ 0x2260020 Len=901076 SizeDiff=4096
;...
; 1:34:22 @ 0x2260020 Len=999380 SizeDiff=4096
; 1:34:22 - 1000000...
; 1:34:26 @ 0x2460020 Len=1003476 SizeDiff=4096
;...
; 1:37:59 @ 0x2260020 Len=1200084 SizeDiff=4096
;...
; 1:40:06 @ 0x2260020 Len=1302484 SizeDiff=4096
;...
; 1:42:18 @ 0x2260020 Len=1400788 SizeDiff=4096
;...
; 1:44:46 @ 0x2260020 Len=1503188 SizeDiff=4096
;...
; 1:59:12 @ 0x2260020 Len=1994708 SizeDiff=4096
; 1:59:20 @ 0x2640020 Len=1998804 SizeDiff=4096
; 1:59:22 - 2000000...
; 1:59:28 @ 0x2450020 Len=2002900 SizeDiff=4096
; 1:59:37 @ 0x2640020 Len=2006996 SizeDiff=4096
...
By the stage the string is 2,000,000 in size it's taking 9 seconds just to append 4096 single bytes (debugger is enabled though)

LINUX 64, somewhat similar with the every-4096-bytes reallocations...
1:04:16 @ 0x1EFC970 Len=0
1:04:16 @ 0x1EFCA20 Len=20 SizeDiff=20
1:04:16 @ 0x1EFCAB0 Len=36 SizeDiff=16
1:04:16 @ 0x1F022C0 Len=84 SizeDiff=48
1:04:16 @ 0x1F102E0 Len=20484 SizeDiff=20400
1:04:16 @ 0x1EFD2B0 Len=36868 SizeDiff=16384
1:04:18 @ 0x1F362F0 Len=114612 SizeDiff=77744
1:04:23 @ 0x7F4ECD27B010 Len=249076 SizeDiff=134464
1:04:23 @ 0x7F4ECD23D010 Len=249828 SizeDiff=752 ;now a full minute goes by before the next allocation as its found a nice chunk of free memory half a megabyte in size ...
1:05:25 @ 0x7F4EC9187010 Len=794596 SizeDiff=544768
1:05:26 @ 0x7F4ECD23B010 Len=798692 SizeDiff=4096
... now every 4096 ...
1:06:05 @ 0x7F4EC9155010 Len=999396 SizeDiff=4096
1:06:05 - 1000000...
1:06:06 @ 0x7F4ECD209010 Len=1003492 SizeDiff=4096
...still every 4096...
1:06:59 @ 0x7F4ECD1D3010 Len=1224676 SizeDiff=4096
1:07:00 @ 0x7F4EC911D010 Len=1228772 SizeDiff=4096 ;but here it's found a large chunk of memory and goes from 1228772 to 2000000+ without reallocation
1:11:33 - 2000000...
1:15:29 @ 0x7F4EC8C03010 Len=2469860 SizeDiff=1241088 ;so it was in a 1mb free buffer, but now back to reallocating every 4096 bytes...
1:15:31 @ 0x7F4EC911B010 Len=2473956 SizeDiff=4096
1:15:33 @ 0x7F4EC8C01010 Len=2478052 SizeDiff=4096
1:15:36 @ 0x7F4EC9119010 Len=2482148 SizeDiff=4096
...
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Strip/replace double or multiple spaces with single

Post by wilbert »

Keya wrote:I havent thoroughly tested StringField on its own (but seeing as i had to stop the previous StringField test after 60 minutes in a test that took the other solutions ~30 secs i'd say yes perhaps it is really bad!)
The problem is, that it keeps starting at the beginning every time.
The asm procedures count like this ...
1, 2, 3, 4, 5
The StringField based procedure counts like this ...
1, 12, 123, 1234, 12345
So the function gets slower for every additional field. If you have a very long string and need to process all fields, it can take some time. In most practical cases however it might be fast enough.
Most modern languages have a function to split a string into an array of strings and a function to combine an array of strings into one large string which makes things faster but also needs more memory.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Strip/replace double or multiple spaces with single

Post by Keya »

Updated timings with IdeasVacuum's algorithm :) unfortunately it suffers from the same problem as Rashad's with long strings, but is fine for short strings
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Strip/replace double or multiple spaces with single

Post by IdeasVacuum »

Most modern languages have a function to split a string into an array of strings
That sounds like an enhancement request.......
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Strip/replace double or multiple spaces with single

Post by IdeasVacuum »

Is Mid() faster or slower than StringField()?

Code: Select all

;DisableDebugger

Procedure.s RemoveMultiSpace(sText.s)
;#-----------------------------------
Protected sText2.s = LTrim(sText), sTemp.s = ""
Protected  iPosn.i = 1, iSub.i = 1, iCnt.i = 1, iSubInc.i = #False
Protected iTotal.i = Len(sText2)
Protected Dim sSub.s(iTotal)

Repeat
                  sTemp = Trim(Mid(sText2, iPosn, 1))
           If(Len(sTemp) > 0)

                sSub(iSub) = sSub(iSub) + sTemp
                   iSubInc = #True
           Else
                  If(iSubInc = #True)

                            iSub = iSub + 1
                         iSubInc = #False
                  EndIf
           EndIf

           iPosn = iPosn + 1

Until iPosn = iTotal

sText = ""

Repeat
             sText = sText + sSub(iCnt) + Chr(32)
              iCnt = iCnt + 1

Until iCnt = iSub

ProcedureReturn(Trim(sText))

EndProcedure

      sTrimmed.s = RemoveMultiSpace("  Hello        I     am   a   splitted string    Hello      I   am   a   splitted string    Hello      I   am   a   splitted string  ")
Debug sTrimmed

End
[/size]
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
Post Reply