Page 1 of 3

improving string lengths

Posted: Tue Oct 18, 2011 6:30 pm
by Tenaja
A quick test makes it appear that strings consume 4 whole bytes per character:

Code: Select all

Global stringstart.s
stringstart = "1"

Dim astring.s (9)		; String lengths are totally managed by PB. Strings only grow...
						; ...the ram use doesn't seem to shrink.
astring(1) = "2"
						
Global bstring.s
bstring = "3"		
Global endstring.s
endstring = "4"

Debug "Start at " + Str(@stringstart)
Debug "Array at " + Str(@astring)
Debug "next at " + Str(@bstring)
Debug "end at " + Str(@endstring)

Debug StringByteLength(bstring)

Debug "1:"

Debug "astring(0) Length" + Str(StringByteLength(astring(1)))
Debug "element 0 bytes " + Str(@astring(1) - @astring(0))		;no initializing...size = 0
Debug @astring(0)
Debug @astring(1)
Debug @astring(2)
Debug @astring(3)

Debug ""
Debug "2"

astring(0) = "1"	;changing does NOT effect size.
astring(1) = "a"
astring(2) = "b"
astring(3) = "c"

Debug "astring(0) Length " + Str(StringByteLength(astring(1)))
Debug "element 0 bytes " + Str(@astring(1) - @astring(0))		;no initializing...size = 0
Debug @astring(0)		;first init; size = 16 bytes
Debug @astring(1)		;first init; size = 16 bytes
Debug @astring(2)
Debug @astring(3)

Debug ""
Debug "3"

astring(0) = "d"
astring(1) = "e"
astring(2) = "f"
astring(3) = "g"

Debug "astring(0) Length " + Str(StringByteLength(astring(1)))
Debug "element 0 bytes " + Str(@astring(1) - @astring(0))		;no initializing...size = 0
Debug @astring(0)		;first change...still 16 bytes at original location
Debug @astring(1)		;first change...still 16 bytes at original location
Debug @astring(2)
Debug @astring(3)

Debug ""
Debug "4"

astring(0) = "d234"
astring(1) = "e234"			;4 causes increase (3 does not)
astring(2) = "f234"
astring(3) = "g234"

Debug "astring(0) Length " + Str(StringByteLength(astring(1)))
Debug "element 1 bytes " + Str(@astring(2) - @astring(1))		;no initializing...size = 0 (Use 1 because 0 is "managed" into previous location.)
Debug @astring(0)		; now size is 24
Debug @astring(1)		; now size is 24
Debug @astring(2)
Debug @astring(3)

Debug ""
Debug "5"

astring(0) = "d12345678902"
astring(1) = "a12345678902"	;more than 11 and it increases
astring(2) = "b12345678902"
astring(3) = "c12345678902"

Debug "astring(0) Length " + Str(StringByteLength(astring(1)))
Debug "element 0 bytes " + Str(@astring(1) - @astring(0))		;no initializing...size = 0
Debug @astring(0)			;...size is 32
Debug @astring(1)			;...size is 32
Debug @astring(2)
Debug @astring(3)
So, 3 characters take up 16 bytes, but 4 characters consume 24 bytes. That means 4 bytes per character, plus 4 bytes for the zero.
I'm not set to use Unicode (I'm on Windows). Why does each character take more than one byte? (Heck, why more than TWO???)
How do I set it to one byte/char as default? This seems grossly wasteful to use a full 64-bits for an 8-bit character. I could even sort of understand 32-bits, if you want to make it unicode compatible...

But! The manual says :
Name: extension: Memory Consuption:
String .s string length + 1
Fixed String .s{Length} string length
This certainly does not hold true.

The biggest reason people boast about PB is its small compact size, but if you are dealing with very large strings, 25% memory efficiency is intolerable. Any help would be appreciated.

Thanks.

Re: improving string lengths

Posted: Tue Oct 18, 2011 7:18 pm
by blueznl
You got it pretty much wrong.

Try this:

Code: Select all

For n = 1 To 1000
  x.s = Chr('a'+Random(26))+x.s
Next n
a.s = ""
b.s = x.s
c.s = ""
Debug @a
Debug @b
Debug @c
Debug @c-@b
Check how much data is used by the string? 1000 bytes plus a little overhead. I dunno what causes the overhead, but it could be either PB internals, the way memory is allocated, I dunno. A check with a short string is definitely not the right way to go.

But it definitely is NOT 16 bytes per character.

Re: improving string lengths

Posted: Tue Oct 18, 2011 7:20 pm
by luis
I don't understand how you can say a ASCII string need 4 bytes for char.
I don't understand the meaning of all that code. Why do you subtract the start address of an array string element from the one before ? Who told you the strings in an array must be all one after another in memory (especially a dynamic array) ?
I don't understand your reasoning but you probably are assuming something that is not true along the way.


EDIT: Removed "Sorry but" at the start of the first sentence.

Re: improving string lengths

Posted: Tue Oct 18, 2011 7:34 pm
by kenmo
Yeah, your code is sort of misleading, because:

1. Strings are never guaranteed to be allocated continguously, so subtracting addresses does not give accurate memory usage.
2. I believe the OS handles all allocated memory locations, not PB itself. PB just provides lengths. (Related to point 1.)
3. There DOES seem to be some overhead (usually 16 bytes on my system) as shown by this code below...

Code: Select all

Dim String.s(4)

For i.i = 0 To 4
  String(i) = RSet("", 5000, Chr('A' + i))
Next i

For i = 0 To 4
  Debug "String " + Str(i)
  Debug "Byte Len: " + Str(StringByteLength(String(i)))
  If i > 0
    Debug "Diff: " + Str(@String(i) - @String(i-1))
  EndIf
  Debug ""
Next i
BUT this small overhead is negligible for very large strings. (It's actually more of a concern with many small strings.) And it is probably there for:

Memory alignment / speed reasons?
Internally-used PB header / bookkeeping?
Allow for slight string changes without reallocation?

Those are just guesses.

Re: improving string lengths

Posted: Wed Oct 19, 2011 8:07 am
by Tenaja
blueznl wrote:You got it pretty much wrong.
No, I didn't, unless the Debug Output got it wrong. However, you gave me an intelligent response, so...I "tried this", as you suggested...
Try this:

Code: Select all

For n = 1 To 1000
  x.s = Chr('a'+Random(26))+x.s
Next n
a.s = ""
b.s = x.s
c.s = ""
Debug @a
Debug @b
Debug @c
Debug @c-@b
Check how much data is used by the string? 1000 bytes plus a little overhead. I dunno what causes the overhead, but it could be either PB internals, the way memory is allocated, I dunno. A check with a short string is definitely not the right way to go.
On MY system, I get a result of 33kB! (ok, 33,112 bytes, to be exact.) So, maybe with large strings it's 20% more efficient, but still pretty poor, and unacceptable.

As it turns out with this project I am working with large quantities of small strings.
But it definitely is NOT 16 bytes per character.
BTW, I did not say 16 bytes per char...I said 4.

Re: improving string lengths

Posted: Wed Oct 19, 2011 9:04 am
by wilbert
As said before, the whole idea that you can check memory usage this way makes no sense at all.
From the code mentioned before

Code: Select all

For n = 1 To 1000
  x.s = Chr('a'+Random(26))+x.s
Next n
a.s = ""
b.s = x.s
c.s = ""
Debug @a
Debug @b
Debug @c
Debug @c-@b
On OSX, the output of the list line is different each time.
Here's one for example
-8228512
Seems pretty efficient to me, it uses a negative amount of memory :shock:

Re: improving string lengths

Posted: Wed Oct 19, 2011 10:53 am
by Fred
To allocate a string, PB uses the standard system allocator (private Heap on Windows, malloc() on linux/osx). Memory returned by the allocators are often 16 bytes aligned, which means than even a string with 1 character will takes 16 bytes in reality. For bigger string, the overhead with be much less: for a 4k string, only 15 bytes could be wasted (at most) which is OK. This can't be avoided, except by rewrite your own heap which is probably not a idea, as the OS is optimized for this kind of service.

Re: improving string lengths

Posted: Wed Oct 19, 2011 12:39 pm
by buddymatkona
Into a PB program with a working memory set of 40 MB, I read a file of equal size.

Code: Select all

Macro FileToString(FileSpec)
       StreamIn = ReadFile ( #PB_Any  ,  FileSpec )
       If  ( StreamIn  )

        Length = Lof ( StreamIn  )
        *MemFile = AllocateMemory ( Length )    
        If Not *MemFile  : Debug " AllocateMemory  Failed" : End : EndIf
        ReadData ( StreamIn  , *MemFile , Length )
        CloseFile ( StreamIn  )            
        Text$ = PeekS ( *MemFile )
        FreeMemory(*MemFile)
   
      Else : Debug " Problem Reading " + Filespec : End 
      EndIf
  EndMacro

SourceDir$ = "D:\PBTest\"
SourceName$ = "Unicode39.txt" ; 39 MB unicode file 
FileToString(SourceDir$ + SourceName$) ;----Read File Into Text$

Code: Select all

            Memory Working Set  (MB)
No File Read                     40
File to *MemFile                 78
File to *MemFile Plus Text$     156
Keep Text$ but Free *MemFile    118

Re: improving string lengths

Posted: Wed Oct 19, 2011 3:49 pm
by Tenaja
Fred wrote:To allocate a string, PB uses the standard system allocator (private Heap on Windows, malloc() on linux/osx). Memory returned by the allocators are often 16 bytes aligned, which means than even a string with 1 character will takes 16 bytes in reality. For bigger string, the overhead with be much less: for a 4k string, only 15 bytes could be wasted (at most) which is OK. This can't be avoided, except by rewrite your own heap which is probably not a idea, as the OS is optimized for this kind of service.
It appears that the best way to have efficient memory use with short strings is to use fixed-length strings:

dim astring.s{16} (1000)

...since this 16-byte string array consumes less memory than a managed-size array holding 4 or more bytes, given the managing overhead.

Thankfully, you fixed the bug that prevented their use with one of the recent releases.

Re: improving string lengths

Posted: Wed Oct 19, 2011 4:07 pm
by Tenaja
BTW, Fred, would you be so kind as to share what the "15 bytes (at most)" of overhead contains?

thanks.

Re: improving string lengths

Posted: Wed Oct 19, 2011 4:19 pm
by Shield
Probably nothing as the OS just aligns memory blocks to addresses dividable by 16 for faster access. :)

Re: improving string lengths

Posted: Wed Oct 19, 2011 4:46 pm
by Fred
yes, just OS overhead

Re: improving string lengths

Posted: Wed Oct 19, 2011 4:51 pm
by wilbert
It even makes a difference for the cpu.
For example the x86 instruction movdqa to move 16 bytes to or from a SSE register aligned at 16 bytes is faster compared to the movdqu instruction that does the same for unaligned memory.

Re: improving string lengths

Posted: Wed Oct 19, 2011 5:51 pm
by Tenaja
wilbert wrote:It even makes a difference for the cpu.
For example the x86 instruction movdqa to move 16 bytes to or from a SSE register aligned at 16 bytes is faster compared to the movdqu instruction that does the same for unaligned memory.
That explains the minimum allocation size (which is already obvious) but not the overhead contents.

Re: improving string lengths

Posted: Wed Oct 19, 2011 5:55 pm
by Zach
I interpreted Fred's reply as a confirmation of "Yes, it is OS overhead. There is nothing actually there".