improving string lengths
Re: improving string lengths
@Tenaja:
The OS assigns 16 byte minimum.
So, if you have a string length of 1 you
got 15 bytes "overhead" (not used).
If you have a string length of 2 you got
14 Bytes "overhead" (not used).
.
.
If you have a string length of 15 you
got 1 Byte "overhead" (not used).
If you have a string length of 16 you
got no "overhead".
Overhead doesn't mean that PureBasic
uses this "Space" for any information.
It's just not used and it is undefiend.
The OS assigns 16 byte minimum.
So, if you have a string length of 1 you
got 15 bytes "overhead" (not used).
If you have a string length of 2 you got
14 Bytes "overhead" (not used).
.
.
If you have a string length of 15 you
got 1 Byte "overhead" (not used).
If you have a string length of 16 you
got no "overhead".
Overhead doesn't mean that PureBasic
uses this "Space" for any information.
It's just not used and it is undefiend.

Re: improving string lengths
That's not entirely correct, Lord, the OS doesn't just allocate a minimum of 16 bytes.
While the OS stores some status information for each memory block, it's more important
to understand that it does align the requested memory to an address that is dividable by 16,
which can cause 15 bytes overhead at most. That's quite a difference.
The values stored at such addresses can be read faster by the CPU because of it's design.
CPUs don't look at memory like we do. If we talk about memory blocks, we think of them as being
an array of bytes, like this:
But a CPU (x86) looks at the same memory like this:
That's why the OS aligns the memory to an address dividable by 16, because
the computer always reads in 4 byte blocks. If the memory wasn't aligned,
there would be performance losses.
While the OS stores some status information for each memory block, it's more important
to understand that it does align the requested memory to an address that is dividable by 16,
which can cause 15 bytes overhead at most. That's quite a difference.
The values stored at such addresses can be read faster by the CPU because of it's design.
CPUs don't look at memory like we do. If we talk about memory blocks, we think of them as being
an array of bytes, like this:
Code: Select all
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
Code: Select all
[1 2 3 4] [5 6 7 8] [9 10 11 12] [13 14 15 16]
the computer always reads in 4 byte blocks. If the memory wasn't aligned,
there would be performance losses.

Blog: Why Does It Suck? (http://whydoesitsuck.com/)
"You can disagree with me as much as you want, but during this talk, by definition, anybody who disagrees is stupid and ugly."
- Linus Torvalds
Re: improving string lengths
I am fully aware of the 16-byte allocation size, and that it takes 16 bytes even if you only have one character.
But that does NOT explain why it jumps from 16 bytes to 24...a number that is NOT divisible by 16. Nor does it account for the extra 12 bytes, since the allocation size increases after 3 characters (plus a "zero")! Clearly, you mean divisible by 4, or 8, otherwise the allocation would be 16, 32, 48, etc., and not the 16, 24, 32, 40 that it currently does.
These are the memory allocation steps for the shown string lengths (inclusive of the zero):
16 bytes: up to 4 characters
24 bytes: up to 12 characters
32 bytes: up to 20 characters
40 bytes: up to 28 characters
...so what happens to the rest of the 12 bytes? Clearly, that is Fred's overhead, since it is not required if the string is fixed-length. This may not be an issue for programs with only a couple hundred strings, or with very long strings...but if the program is working with 100's of thousands of 4-byte strings, then 75% of the memory usage is useless overhead unless you use fixed-length strings.
But that does NOT explain why it jumps from 16 bytes to 24...a number that is NOT divisible by 16. Nor does it account for the extra 12 bytes, since the allocation size increases after 3 characters (plus a "zero")! Clearly, you mean divisible by 4, or 8, otherwise the allocation would be 16, 32, 48, etc., and not the 16, 24, 32, 40 that it currently does.
These are the memory allocation steps for the shown string lengths (inclusive of the zero):
16 bytes: up to 4 characters
24 bytes: up to 12 characters
32 bytes: up to 20 characters
40 bytes: up to 28 characters
...so what happens to the rest of the 12 bytes? Clearly, that is Fred's overhead, since it is not required if the string is fixed-length. This may not be an issue for programs with only a couple hundred strings, or with very long strings...but if the program is working with 100's of thousands of 4-byte strings, then 75% of the memory usage is useless overhead unless you use fixed-length strings.
Re: improving string lengths
Where did you get these numbers ?Tenaja wrote:16 bytes: up to 4 characters
24 bytes: up to 12 characters
32 bytes: up to 20 characters
40 bytes: up to 28 characters
If it has to do with your initial post, that method isn't reliable.
Here's the debug output I get from that source (OS X)
Code: Select all
Start at 3578912
Array at 250596
next at 3581248
end at 3581264
1
1:
astring(0) Length1
element 0 bytes 3581232
0
3581232
0
0
2
astring(0) Length 1
element 0 bytes -48
3581280
3581232
3581296
3581312
3
astring(0) Length 1
element 0 bytes -48
3581280
3581232
3581296
3581312
4
astring(0) Length 4
element 1 bytes 64
3581280
3581232
3581296
3581312
5
astring(0) Length 12
element 0 bytes 32
3581328
3581360
3581392
3581424
Re: improving string lengths
Tenaja, if you have a C compiler, run this code. (If you don't, google "Tiny C Compiler", it's about 100 kB download...)
[/size]
It is similar to your original PB code. Here it allocates arrays of 10 strings, of lengths 1, 4, 12, and 50 characters. And it subtracts the consecutive pointers, like your code.
On my system, ignoring some huge numbers and some negative numbers, I get values of 16, 16, 24, and 64 similar to PB's results. This just demonstrates: the OS aligns the memory however it wants, not consecutively! So subtracting addresses does not equate to memory allocated! PB is doing nothing wrong here, and it is definitely not allocating 75% unnecessary space.
* Edit: Or, maybe more accurately: there might be some bytes of overhead (let's say 3 empty bytes at the end, that the OS surely won't allocate to any other program) but this is OS-enforced overhead, not PB overhead.
Code: Select all
void main() {
char * astring[10];
int i;
// ======= String of 1 character (2 bytes total)
for (i = 0; i < 10; i++) {
astring[i] = (char *)malloc(2);
sprintf(astring[i], "%c", 'A' + i);
}
// Print out difference of pointers
for (i = 0; i < 10; i++) {
if (i > 0)
printf("%d\n", astring[i] - astring[i-1]);
free(astring[i]);
}
printf("\n");
// ======== String of 4 character (5 bytes total)
for (i = 0; i < 10; i++) {
astring[i] = (char *)malloc(5);
sprintf(astring[i], "1..%c", 'A' + i);
}
for (i = 0; i < 10; i++) {
if (i > 0)
printf("%d\n", astring[i] - astring[i-1]);
free(astring[i]);
}
printf("\n");
// ======== String of 12 character (13 bytes total)
for (i = 0; i < 10; i++) {
astring[i] = (char *)malloc(13);
sprintf(astring[i], "d1234567890%d", i);
}
for (i = 0; i < 10; i++) {
if (i > 0)
printf("%d\n", astring[i] - astring[i-1]);
free(astring[i]);
}
printf("\n");
// ======== String of 50 character (51 bytes total)
for (i = 0; i < 10; i++) {
astring[i] = (char *)malloc(51);
sprintf(astring[i], "%050d", i);
}
for (i = 0; i < 10; i++) {
if (i > 0)
printf("%d\n", astring[i] - astring[i-1]);
free(astring[i]);
}
printf("\n");
}
It is similar to your original PB code. Here it allocates arrays of 10 strings, of lengths 1, 4, 12, and 50 characters. And it subtracts the consecutive pointers, like your code.
On my system, ignoring some huge numbers and some negative numbers, I get values of 16, 16, 24, and 64 similar to PB's results. This just demonstrates: the OS aligns the memory however it wants, not consecutively! So subtracting addresses does not equate to memory allocated! PB is doing nothing wrong here, and it is definitely not allocating 75% unnecessary space.
* Edit: Or, maybe more accurately: there might be some bytes of overhead (let's say 3 empty bytes at the end, that the OS surely won't allocate to any other program) but this is OS-enforced overhead, not PB overhead.
Re: improving string lengths
An advanced memory manager treats small memory segments differently and has a seperate pool for them. Conventional memory managers use linked lists plus they may add one or two additional values like flags in front of the returned buffer. You can find some bugs like memory over/underflow by doing a heap walk once in a while, without needing additional help like cookies.
If your application has to deal with many thousands of small strings in your program, the easiest way is to just write your own heap manager and use that.
For windows check: http://msdn.microsoft.com/en-us/library ... 85%29.aspx for example.
If your application has to deal with many thousands of small strings in your program, the easiest way is to just write your own heap manager and use that.
For windows check: http://msdn.microsoft.com/en-us/library ... 85%29.aspx for example.
Re: improving string lengths
The way I got those numbers was with a fresh reboot, and running the code repeatedly until I got the smallest numbers it could give...but they are always small after a reboot. The values are VERY consistent. It's obvious a 3-byte string won't consume 16k, so those values were ignored.
When using fixed-length strings of 16 characters, the results are always exactly 16. When using pb-managed string lengths, it was impossible to get it smaller than I posted. When you resize the strings, it's easy to tell how the memory is managed; there is a pattern.
The difference with C code is the string arrays are always exactly the same size, and always consecutive...just like fixed-length pb string arrays. That's comparing apples to apples.
Remember, I am not concerned with individual strings. I am talking about very large arrays of strings...strings which are typically small. With the fixed-length, pb sets the size ahead and allocates the full array, so they are small and compact on a string level. With managed length strings, they are all over the place with random sizes.
Ramihyn_, I appreciate the suggestion. However, using fixed-length strings, as I mentioned, would likely be more efficient than a memory manager. It certainly codes faster!
When using fixed-length strings of 16 characters, the results are always exactly 16. When using pb-managed string lengths, it was impossible to get it smaller than I posted. When you resize the strings, it's easy to tell how the memory is managed; there is a pattern.
The difference with C code is the string arrays are always exactly the same size, and always consecutive...just like fixed-length pb string arrays. That's comparing apples to apples.
Remember, I am not concerned with individual strings. I am talking about very large arrays of strings...strings which are typically small. With the fixed-length, pb sets the size ahead and allocates the full array, so they are small and compact on a string level. With managed length strings, they are all over the place with random sizes.
Ramihyn_, I appreciate the suggestion. However, using fixed-length strings, as I mentioned, would likely be more efficient than a memory manager. It certainly codes faster!
Re: improving string lengths
It really depends on what you needTenaja wrote:Ramihyn_, I appreciate the suggestion. However, using fixed-length strings, as I mentioned, would likely be more efficient than a memory manager. It certainly codes faster!

Btw. instead of the guessing in this thread, you could just use the GetProcessHeap(s) functions and do a heap walker in PureBasic equal to this http://www.codeproject.com/KB/cpp/HeapWalker.aspx or others you can find on the net.
GetProcessHeap
Getting process heaps example
Re: improving string lengths
Hmm? In my C example, the pointers to the strings are always consecutive. The string locations themselves are dynamically malloc'ed and can spread all over the place, just like an array of normal strings in PB.Tenaja wrote:The difference with C code is the string arrays are always exactly the same size, and always consecutive...just like fixed-length pb string arrays.
That is unlike a fixed length string array in PB, in which pointers are not even used, and all the strings/characters are allocated at once and actually consecutive.
This sort of shows what I mean:
Code: Select all
; Array of dynamic strings
Dim DString.s(4)
DString(0) = "One"
DString(1) = "Two"
DString(2) = "Three"
DString(3) = "Four"
DString(4) = "Five"
Debug "Array of dynamic (regular) strings"
Debug " ( Probably non-constant jump between pointers )"
Debug "0 : " + Str(@DString(0)) + " : " + DString(0)
Debug "1 : " + Str(@DString(1)) + " : " + DString(1)
Debug "2 : " + Str(@DString(2)) + " : " + DString(2)
Debug "3 : " + Str(@DString(3)) + " : " + DString(3)
Debug "4 : " + Str(@DString(4)) + " : " + DString(4)
Debug ""
; Array of fixed strings
Dim FString.s{5}(4)
FString(0) = "One"
FString(1) = "Two"
FString(2) = "Three"
FString(3) = "Four"
FString(4) = "Five"
Debug "Array of fixed-length strings"
Debug " ( Always constant jump between pointers )"
Debug "0 : " + Str(@FString(0)) + " : " + FString(0)
Debug "1 : " + Str(@FString(1)) + " : " + FString(1)
Debug "2 : " + Str(@FString(2)) + " : " + FString(2)
Debug "3 : " + Str(@FString(3)) + " : " + FString(3)
Debug "4 : " + Str(@FString(4)) + " : " + FString(4)
So yes, for your purposes of a large quantity of very small strings, fixed-length arrays would be a good idea since they are allocated as one large, fast, guaranteed-consecutive block. You could even allocate large memory blocks and manage the strings/chars within them yourself, if you want total control.
Re: improving string lengths
Yeah... but if I wanted to use strcat(), then I'd use C...I've grown spoiled with String1 = String2 + String3!kenmo wrote:...You could even allocate large memory blocks and manage the strings/chars within them yourself, if you want total control.
I saw you used coder-managed string allocations in your sample, rather than defined arrays...that might be useful in some cases, not really for me. I don't want to be adding the string arrays as I go along.
Re: improving string lengths
Of course, just tossing ideas out there.
Re: improving string lengths
Tenaja, the method you're using to measure the size of the memory allocated for strings is simply wrong, no matter how you interpret the results or how much you reboot your computer. It's a nonsense calculation.
This is how to measure the size (without OS overhead) of a normal (not fixed) PB string:
There is an unexplained overhead of 4 bytes per string (not per character) regardless of the length of the string. In unicode mode the overhead is 8 bytes per string.
There is a chance this is a bug in PB. Fred will have to say.
This is how to measure the size (without OS overhead) of a normal (not fixed) PB string:
Code: Select all
Procedure ShouldBeStringSize(S.s)
ProcedureReturn (Len(S)+1) * (#PB_Compiler_Unicode+1)
EndProcedure
Import ""
PB_StringHeap
EndImport
If HeapSize_(PB_StringHeap, 0, HeapAlloc_(PB_StringHeap, 0, 17)) = 17
Debug "OK: Confirmed that HeapSize hides the OS overhead"
Else
Debug "NOT OK!! HeapSize is unreliable for this test"
EndIf
Debug "--"
For I = 0 To 70 Step 3
a.s = ReplaceString(Space(I), " ", "1")
pb = HeapSize_(PB_StringHeap, 0, @a.s)
ok = ShouldBeStringSize(a)
; Debug "PB: " + Str(pb)
; Debug "OK: " + Str(ok)
Debug "String length: " + Str(I)
Debug "Overhead: " + Str(pb-ok)
Debug "---"
Next
There is a chance this is a bug in PB. Fred will have to say.
Last edited by Trond on Sun Oct 23, 2011 4:16 pm, edited 1 time in total.
Re: improving string lengths
Trond,
My linker chokes on _PB_StringHeap...do you have a library I need to install?
Thanks.
My linker chokes on _PB_StringHeap...do you have a library I need to install?
Thanks.
Re: improving string lengths
Sorry, I didn't test it on a 32-bit OS. I edited it, maybe it works now?Tenaja wrote:Trond,
My linker chokes on _PB_StringHeap...do you have a library I need to install?
Thanks.
Re: improving string lengths
The problem with your calculation is that even IF there is OS overhead, then that memory is USED within the managing scheme of the strings. Therefore, maybe the string itself isn't in that overhead, but the string requires it anyway. As a result, managed strings "require" that overhead, whatever it is. The bottom line is that I care about overall memory consumption, not individual string use.Trond wrote:...
This is how to measure the size (without OS overhead) of a normal (not fixed) PB string:
....