improving string lengths

Just starting out? Need help? Post your questions and find answers here.
User avatar
Tenaja
Addict
Addict
Posts: 1959
Joined: Tue Nov 09, 2010 10:15 pm

improving string lengths

Post by Tenaja »

A quick test makes it appear that strings consume 4 whole bytes per character:

Code: Select all

Global stringstart.s
stringstart = "1"

Dim astring.s (9)		; String lengths are totally managed by PB. Strings only grow...
						; ...the ram use doesn't seem to shrink.
astring(1) = "2"
						
Global bstring.s
bstring = "3"		
Global endstring.s
endstring = "4"

Debug "Start at " + Str(@stringstart)
Debug "Array at " + Str(@astring)
Debug "next at " + Str(@bstring)
Debug "end at " + Str(@endstring)

Debug StringByteLength(bstring)

Debug "1:"

Debug "astring(0) Length" + Str(StringByteLength(astring(1)))
Debug "element 0 bytes " + Str(@astring(1) - @astring(0))		;no initializing...size = 0
Debug @astring(0)
Debug @astring(1)
Debug @astring(2)
Debug @astring(3)

Debug ""
Debug "2"

astring(0) = "1"	;changing does NOT effect size.
astring(1) = "a"
astring(2) = "b"
astring(3) = "c"

Debug "astring(0) Length " + Str(StringByteLength(astring(1)))
Debug "element 0 bytes " + Str(@astring(1) - @astring(0))		;no initializing...size = 0
Debug @astring(0)		;first init; size = 16 bytes
Debug @astring(1)		;first init; size = 16 bytes
Debug @astring(2)
Debug @astring(3)

Debug ""
Debug "3"

astring(0) = "d"
astring(1) = "e"
astring(2) = "f"
astring(3) = "g"

Debug "astring(0) Length " + Str(StringByteLength(astring(1)))
Debug "element 0 bytes " + Str(@astring(1) - @astring(0))		;no initializing...size = 0
Debug @astring(0)		;first change...still 16 bytes at original location
Debug @astring(1)		;first change...still 16 bytes at original location
Debug @astring(2)
Debug @astring(3)

Debug ""
Debug "4"

astring(0) = "d234"
astring(1) = "e234"			;4 causes increase (3 does not)
astring(2) = "f234"
astring(3) = "g234"

Debug "astring(0) Length " + Str(StringByteLength(astring(1)))
Debug "element 1 bytes " + Str(@astring(2) - @astring(1))		;no initializing...size = 0 (Use 1 because 0 is "managed" into previous location.)
Debug @astring(0)		; now size is 24
Debug @astring(1)		; now size is 24
Debug @astring(2)
Debug @astring(3)

Debug ""
Debug "5"

astring(0) = "d12345678902"
astring(1) = "a12345678902"	;more than 11 and it increases
astring(2) = "b12345678902"
astring(3) = "c12345678902"

Debug "astring(0) Length " + Str(StringByteLength(astring(1)))
Debug "element 0 bytes " + Str(@astring(1) - @astring(0))		;no initializing...size = 0
Debug @astring(0)			;...size is 32
Debug @astring(1)			;...size is 32
Debug @astring(2)
Debug @astring(3)
So, 3 characters take up 16 bytes, but 4 characters consume 24 bytes. That means 4 bytes per character, plus 4 bytes for the zero.
I'm not set to use Unicode (I'm on Windows). Why does each character take more than one byte? (Heck, why more than TWO???)
How do I set it to one byte/char as default? This seems grossly wasteful to use a full 64-bits for an 8-bit character. I could even sort of understand 32-bits, if you want to make it unicode compatible...

But! The manual says :
Name: extension: Memory Consuption:
String .s string length + 1
Fixed String .s{Length} string length
This certainly does not hold true.

The biggest reason people boast about PB is its small compact size, but if you are dealing with very large strings, 25% memory efficiency is intolerable. Any help would be appreciated.

Thanks.
User avatar
blueznl
PureBasic Expert
PureBasic Expert
Posts: 6166
Joined: Sat May 17, 2003 11:31 am
Contact:

Re: improving string lengths

Post by blueznl »

You got it pretty much wrong.

Try this:

Code: Select all

For n = 1 To 1000
  x.s = Chr('a'+Random(26))+x.s
Next n
a.s = ""
b.s = x.s
c.s = ""
Debug @a
Debug @b
Debug @c
Debug @c-@b
Check how much data is used by the string? 1000 bytes plus a little overhead. I dunno what causes the overhead, but it could be either PB internals, the way memory is allocated, I dunno. A check with a short string is definitely not the right way to go.

But it definitely is NOT 16 bytes per character.
( PB6.00 LTS Win11 x64 Asrock AB350 Pro4 Ryzen 5 3600 32GB GTX1060 6GB)
( The path to enlightenment and the PureBasic Survival Guide right here... )
User avatar
luis
Addict
Addict
Posts: 3895
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

Re: improving string lengths

Post by luis »

I don't understand how you can say a ASCII string need 4 bytes for char.
I don't understand the meaning of all that code. Why do you subtract the start address of an array string element from the one before ? Who told you the strings in an array must be all one after another in memory (especially a dynamic array) ?
I don't understand your reasoning but you probably are assuming something that is not true along the way.


EDIT: Removed "Sorry but" at the start of the first sentence.
Last edited by luis on Thu Oct 20, 2011 3:27 pm, edited 1 time in total.
"Have you tried turning it off and on again ?"
User avatar
kenmo
Addict
Addict
Posts: 2047
Joined: Tue Dec 23, 2003 3:54 am

Re: improving string lengths

Post by kenmo »

Yeah, your code is sort of misleading, because:

1. Strings are never guaranteed to be allocated continguously, so subtracting addresses does not give accurate memory usage.
2. I believe the OS handles all allocated memory locations, not PB itself. PB just provides lengths. (Related to point 1.)
3. There DOES seem to be some overhead (usually 16 bytes on my system) as shown by this code below...

Code: Select all

Dim String.s(4)

For i.i = 0 To 4
  String(i) = RSet("", 5000, Chr('A' + i))
Next i

For i = 0 To 4
  Debug "String " + Str(i)
  Debug "Byte Len: " + Str(StringByteLength(String(i)))
  If i > 0
    Debug "Diff: " + Str(@String(i) - @String(i-1))
  EndIf
  Debug ""
Next i
BUT this small overhead is negligible for very large strings. (It's actually more of a concern with many small strings.) And it is probably there for:

Memory alignment / speed reasons?
Internally-used PB header / bookkeeping?
Allow for slight string changes without reallocation?

Those are just guesses.
User avatar
Tenaja
Addict
Addict
Posts: 1959
Joined: Tue Nov 09, 2010 10:15 pm

Re: improving string lengths

Post by Tenaja »

blueznl wrote:You got it pretty much wrong.
No, I didn't, unless the Debug Output got it wrong. However, you gave me an intelligent response, so...I "tried this", as you suggested...
Try this:

Code: Select all

For n = 1 To 1000
  x.s = Chr('a'+Random(26))+x.s
Next n
a.s = ""
b.s = x.s
c.s = ""
Debug @a
Debug @b
Debug @c
Debug @c-@b
Check how much data is used by the string? 1000 bytes plus a little overhead. I dunno what causes the overhead, but it could be either PB internals, the way memory is allocated, I dunno. A check with a short string is definitely not the right way to go.
On MY system, I get a result of 33kB! (ok, 33,112 bytes, to be exact.) So, maybe with large strings it's 20% more efficient, but still pretty poor, and unacceptable.

As it turns out with this project I am working with large quantities of small strings.
But it definitely is NOT 16 bytes per character.
BTW, I did not say 16 bytes per char...I said 4.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: improving string lengths

Post by wilbert »

As said before, the whole idea that you can check memory usage this way makes no sense at all.
From the code mentioned before

Code: Select all

For n = 1 To 1000
  x.s = Chr('a'+Random(26))+x.s
Next n
a.s = ""
b.s = x.s
c.s = ""
Debug @a
Debug @b
Debug @c
Debug @c-@b
On OSX, the output of the list line is different each time.
Here's one for example
-8228512
Seems pretty efficient to me, it uses a negative amount of memory :shock:
Fred
Administrator
Administrator
Posts: 18279
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: improving string lengths

Post by Fred »

To allocate a string, PB uses the standard system allocator (private Heap on Windows, malloc() on linux/osx). Memory returned by the allocators are often 16 bytes aligned, which means than even a string with 1 character will takes 16 bytes in reality. For bigger string, the overhead with be much less: for a 4k string, only 15 bytes could be wasted (at most) which is OK. This can't be avoided, except by rewrite your own heap which is probably not a idea, as the OS is optimized for this kind of service.
buddymatkona
Enthusiast
Enthusiast
Posts: 252
Joined: Mon Aug 16, 2010 4:29 am

Re: improving string lengths

Post by buddymatkona »

Into a PB program with a working memory set of 40 MB, I read a file of equal size.

Code: Select all

Macro FileToString(FileSpec)
       StreamIn = ReadFile ( #PB_Any  ,  FileSpec )
       If  ( StreamIn  )

        Length = Lof ( StreamIn  )
        *MemFile = AllocateMemory ( Length )    
        If Not *MemFile  : Debug " AllocateMemory  Failed" : End : EndIf
        ReadData ( StreamIn  , *MemFile , Length )
        CloseFile ( StreamIn  )            
        Text$ = PeekS ( *MemFile )
        FreeMemory(*MemFile)
   
      Else : Debug " Problem Reading " + Filespec : End 
      EndIf
  EndMacro

SourceDir$ = "D:\PBTest\"
SourceName$ = "Unicode39.txt" ; 39 MB unicode file 
FileToString(SourceDir$ + SourceName$) ;----Read File Into Text$

Code: Select all

            Memory Working Set  (MB)
No File Read                     40
File to *MemFile                 78
File to *MemFile Plus Text$     156
Keep Text$ but Free *MemFile    118
User avatar
Tenaja
Addict
Addict
Posts: 1959
Joined: Tue Nov 09, 2010 10:15 pm

Re: improving string lengths

Post by Tenaja »

Fred wrote:To allocate a string, PB uses the standard system allocator (private Heap on Windows, malloc() on linux/osx). Memory returned by the allocators are often 16 bytes aligned, which means than even a string with 1 character will takes 16 bytes in reality. For bigger string, the overhead with be much less: for a 4k string, only 15 bytes could be wasted (at most) which is OK. This can't be avoided, except by rewrite your own heap which is probably not a idea, as the OS is optimized for this kind of service.
It appears that the best way to have efficient memory use with short strings is to use fixed-length strings:

dim astring.s{16} (1000)

...since this 16-byte string array consumes less memory than a managed-size array holding 4 or more bytes, given the managing overhead.

Thankfully, you fixed the bug that prevented their use with one of the recent releases.
User avatar
Tenaja
Addict
Addict
Posts: 1959
Joined: Tue Nov 09, 2010 10:15 pm

Re: improving string lengths

Post by Tenaja »

BTW, Fred, would you be so kind as to share what the "15 bytes (at most)" of overhead contains?

thanks.
User avatar
Shield
Addict
Addict
Posts: 1021
Joined: Fri Jan 21, 2011 8:25 am
Location: 'stralia!
Contact:

Re: improving string lengths

Post by Shield »

Probably nothing as the OS just aligns memory blocks to addresses dividable by 16 for faster access. :)
Image
Blog: Why Does It Suck? (http://whydoesitsuck.com/)
"You can disagree with me as much as you want, but during this talk, by definition, anybody who disagrees is stupid and ugly."
- Linus Torvalds
Fred
Administrator
Administrator
Posts: 18279
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: improving string lengths

Post by Fred »

yes, just OS overhead
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: improving string lengths

Post by wilbert »

It even makes a difference for the cpu.
For example the x86 instruction movdqa to move 16 bytes to or from a SSE register aligned at 16 bytes is faster compared to the movdqu instruction that does the same for unaligned memory.
User avatar
Tenaja
Addict
Addict
Posts: 1959
Joined: Tue Nov 09, 2010 10:15 pm

Re: improving string lengths

Post by Tenaja »

wilbert wrote:It even makes a difference for the cpu.
For example the x86 instruction movdqa to move 16 bytes to or from a SSE register aligned at 16 bytes is faster compared to the movdqu instruction that does the same for unaligned memory.
That explains the minimum allocation size (which is already obvious) but not the overhead contents.
Zach
Addict
Addict
Posts: 1676
Joined: Sun Dec 12, 2010 12:36 am
Location: Somewhere in the midwest
Contact:

Re: improving string lengths

Post by Zach »

I interpreted Fred's reply as a confirmation of "Yes, it is OS overhead. There is nothing actually there".
Post Reply