improving string lengths

Just starting out? Need help? Post your questions and find answers here.
User avatar
Lord
Addict
Addict
Posts: 907
Joined: Tue May 26, 2009 2:11 pm

Re: improving string lengths

Post by Lord »

@Tenaja:

The OS assigns 16 byte minimum.
So, if you have a string length of 1 you
got 15 bytes "overhead" (not used).
If you have a string length of 2 you got
14 Bytes "overhead" (not used).
.
.
If you have a string length of 15 you
got 1 Byte "overhead" (not used).
If you have a string length of 16 you
got no "overhead".

Overhead doesn't mean that PureBasic
uses this "Space" for any information.
It's just not used and it is undefiend.
Image
User avatar
Shield
Addict
Addict
Posts: 1021
Joined: Fri Jan 21, 2011 8:25 am
Location: 'stralia!
Contact:

Re: improving string lengths

Post by Shield »

That's not entirely correct, Lord, the OS doesn't just allocate a minimum of 16 bytes.
While the OS stores some status information for each memory block, it's more important
to understand that it does align the requested memory to an address that is dividable by 16,
which can cause 15 bytes overhead at most. That's quite a difference.

The values stored at such addresses can be read faster by the CPU because of it's design.
CPUs don't look at memory like we do. If we talk about memory blocks, we think of them as being
an array of bytes, like this:

Code: Select all

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
But a CPU (x86) looks at the same memory like this:

Code: Select all

[1 2 3 4] [5 6 7 8] [9 10 11 12] [13 14 15 16]
That's why the OS aligns the memory to an address dividable by 16, because
the computer always reads in 4 byte blocks. If the memory wasn't aligned,
there would be performance losses. :)
Image
Blog: Why Does It Suck? (http://whydoesitsuck.com/)
"You can disagree with me as much as you want, but during this talk, by definition, anybody who disagrees is stupid and ugly."
- Linus Torvalds
User avatar
Tenaja
Addict
Addict
Posts: 1959
Joined: Tue Nov 09, 2010 10:15 pm

Re: improving string lengths

Post by Tenaja »

I am fully aware of the 16-byte allocation size, and that it takes 16 bytes even if you only have one character.

But that does NOT explain why it jumps from 16 bytes to 24...a number that is NOT divisible by 16. Nor does it account for the extra 12 bytes, since the allocation size increases after 3 characters (plus a "zero")! Clearly, you mean divisible by 4, or 8, otherwise the allocation would be 16, 32, 48, etc., and not the 16, 24, 32, 40 that it currently does.

These are the memory allocation steps for the shown string lengths (inclusive of the zero):

16 bytes: up to 4 characters
24 bytes: up to 12 characters
32 bytes: up to 20 characters
40 bytes: up to 28 characters

...so what happens to the rest of the 12 bytes? Clearly, that is Fred's overhead, since it is not required if the string is fixed-length. This may not be an issue for programs with only a couple hundred strings, or with very long strings...but if the program is working with 100's of thousands of 4-byte strings, then 75% of the memory usage is useless overhead unless you use fixed-length strings.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: improving string lengths

Post by wilbert »

Tenaja wrote:16 bytes: up to 4 characters
24 bytes: up to 12 characters
32 bytes: up to 20 characters
40 bytes: up to 28 characters
Where did you get these numbers ?
If it has to do with your initial post, that method isn't reliable.
Here's the debug output I get from that source (OS X)

Code: Select all

Start at 3578912
Array at 250596
next at 3581248
end at 3581264
1
1:
astring(0) Length1
element 0 bytes 3581232
0
3581232
0
0

2
astring(0) Length 1
element 0 bytes -48
3581280
3581232
3581296
3581312

3
astring(0) Length 1
element 0 bytes -48
3581280
3581232
3581296
3581312

4
astring(0) Length 4
element 1 bytes 64
3581280
3581232
3581296
3581312

5
astring(0) Length 12
element 0 bytes 32
3581328
3581360
3581392
3581424
User avatar
kenmo
Addict
Addict
Posts: 2047
Joined: Tue Dec 23, 2003 3:54 am

Re: improving string lengths

Post by kenmo »

Tenaja, if you have a C compiler, run this code. (If you don't, google "Tiny C Compiler", it's about 100 kB download...)

Code: Select all

void main() {
	char * astring[10];
	int i;
	
	// ======= String of 1 character (2 bytes total)
	for (i = 0; i < 10; i++) {
		astring[i] = (char *)malloc(2);
		sprintf(astring[i], "%c", 'A' + i);
	}
	// Print out difference of pointers
	for (i = 0; i < 10; i++) {
		if (i > 0)
			printf("%d\n", astring[i] - astring[i-1]);
		free(astring[i]);
	}
	printf("\n");
	
	// ======== String of 4 character (5 bytes total)
	for (i = 0; i < 10; i++) {
		astring[i] = (char *)malloc(5);
		sprintf(astring[i], "1..%c", 'A' + i);
	}
	for (i = 0; i < 10; i++) {
		if (i > 0)
			printf("%d\n", astring[i] - astring[i-1]);
		free(astring[i]);
	}
	printf("\n");
	
	// ======== String of 12 character (13 bytes total)
	for (i = 0; i < 10; i++) {
		astring[i] = (char *)malloc(13);
		sprintf(astring[i], "d1234567890%d", i);
	}
	for (i = 0; i < 10; i++) {
		if (i > 0)
			printf("%d\n", astring[i] - astring[i-1]);
		free(astring[i]);
	}
	printf("\n");
	
	// ======== String of 50 character (51 bytes total)
	for (i = 0; i < 10; i++) {
		astring[i] = (char *)malloc(51);
		sprintf(astring[i], "%050d", i);
	}
	for (i = 0; i < 10; i++) {
		if (i > 0)
			printf("%d\n", astring[i] - astring[i-1]);
		free(astring[i]);
	}
	printf("\n");

}
[/size]

It is similar to your original PB code. Here it allocates arrays of 10 strings, of lengths 1, 4, 12, and 50 characters. And it subtracts the consecutive pointers, like your code.

On my system, ignoring some huge numbers and some negative numbers, I get values of 16, 16, 24, and 64 similar to PB's results. This just demonstrates: the OS aligns the memory however it wants, not consecutively! So subtracting addresses does not equate to memory allocated! PB is doing nothing wrong here, and it is definitely not allocating 75% unnecessary space.

* Edit: Or, maybe more accurately: there might be some bytes of overhead (let's say 3 empty bytes at the end, that the OS surely won't allocate to any other program) but this is OS-enforced overhead, not PB overhead.
Ramihyn_
Enthusiast
Enthusiast
Posts: 314
Joined: Fri Feb 24, 2006 9:40 am

Re: improving string lengths

Post by Ramihyn_ »

An advanced memory manager treats small memory segments differently and has a seperate pool for them. Conventional memory managers use linked lists plus they may add one or two additional values like flags in front of the returned buffer. You can find some bugs like memory over/underflow by doing a heap walk once in a while, without needing additional help like cookies.

If your application has to deal with many thousands of small strings in your program, the easiest way is to just write your own heap manager and use that.

For windows check: http://msdn.microsoft.com/en-us/library ... 85%29.aspx for example.
User avatar
Tenaja
Addict
Addict
Posts: 1959
Joined: Tue Nov 09, 2010 10:15 pm

Re: improving string lengths

Post by Tenaja »

The way I got those numbers was with a fresh reboot, and running the code repeatedly until I got the smallest numbers it could give...but they are always small after a reboot. The values are VERY consistent. It's obvious a 3-byte string won't consume 16k, so those values were ignored.

When using fixed-length strings of 16 characters, the results are always exactly 16. When using pb-managed string lengths, it was impossible to get it smaller than I posted. When you resize the strings, it's easy to tell how the memory is managed; there is a pattern.

The difference with C code is the string arrays are always exactly the same size, and always consecutive...just like fixed-length pb string arrays. That's comparing apples to apples.

Remember, I am not concerned with individual strings. I am talking about very large arrays of strings...strings which are typically small. With the fixed-length, pb sets the size ahead and allocates the full array, so they are small and compact on a string level. With managed length strings, they are all over the place with random sizes.

Ramihyn_, I appreciate the suggestion. However, using fixed-length strings, as I mentioned, would likely be more efficient than a memory manager. It certainly codes faster!
Ramihyn_
Enthusiast
Enthusiast
Posts: 314
Joined: Fri Feb 24, 2006 9:40 am

Re: improving string lengths

Post by Ramihyn_ »

Tenaja wrote:Ramihyn_, I appreciate the suggestion. However, using fixed-length strings, as I mentioned, would likely be more efficient than a memory manager. It certainly codes faster!
It really depends on what you need :)

Btw. instead of the guessing in this thread, you could just use the GetProcessHeap(s) functions and do a heap walker in PureBasic equal to this http://www.codeproject.com/KB/cpp/HeapWalker.aspx or others you can find on the net.

GetProcessHeap
Getting process heaps example
User avatar
kenmo
Addict
Addict
Posts: 2047
Joined: Tue Dec 23, 2003 3:54 am

Re: improving string lengths

Post by kenmo »

Tenaja wrote:The difference with C code is the string arrays are always exactly the same size, and always consecutive...just like fixed-length pb string arrays.
Hmm? In my C example, the pointers to the strings are always consecutive. The string locations themselves are dynamically malloc'ed and can spread all over the place, just like an array of normal strings in PB.

That is unlike a fixed length string array in PB, in which pointers are not even used, and all the strings/characters are allocated at once and actually consecutive.

This sort of shows what I mean:

Code: Select all

; Array of dynamic strings
Dim DString.s(4)
DString(0) = "One"
DString(1) = "Two"
DString(2) = "Three"
DString(3) = "Four"
DString(4) = "Five"
Debug "Array of dynamic (regular) strings"
Debug "   ( Probably non-constant jump between pointers )"
Debug "0 : " + Str(@DString(0)) + " : " + DString(0)
Debug "1 : " + Str(@DString(1)) + " : " + DString(1)
Debug "2 : " + Str(@DString(2)) + " : " + DString(2)
Debug "3 : " + Str(@DString(3)) + " : " + DString(3)
Debug "4 : " + Str(@DString(4)) + " : " + DString(4)
Debug ""

; Array of fixed strings
Dim FString.s{5}(4)
FString(0) = "One"
FString(1) = "Two"
FString(2) = "Three"
FString(3) = "Four"
FString(4) = "Five"
Debug "Array of fixed-length strings"
Debug "   ( Always constant jump between pointers )"
Debug "0 : " + Str(@FString(0)) + " : " + FString(0)
Debug "1 : " + Str(@FString(1)) + " : " + FString(1)
Debug "2 : " + Str(@FString(2)) + " : " + FString(2)
Debug "3 : " + Str(@FString(3)) + " : " + FString(3)
Debug "4 : " + Str(@FString(4)) + " : " + FString(4)
[/size]

So yes, for your purposes of a large quantity of very small strings, fixed-length arrays would be a good idea since they are allocated as one large, fast, guaranteed-consecutive block. You could even allocate large memory blocks and manage the strings/chars within them yourself, if you want total control.
User avatar
Tenaja
Addict
Addict
Posts: 1959
Joined: Tue Nov 09, 2010 10:15 pm

Re: improving string lengths

Post by Tenaja »

kenmo wrote:...You could even allocate large memory blocks and manage the strings/chars within them yourself, if you want total control.
Yeah... but if I wanted to use strcat(), then I'd use C...I've grown spoiled with String1 = String2 + String3!

I saw you used coder-managed string allocations in your sample, rather than defined arrays...that might be useful in some cases, not really for me. I don't want to be adding the string arrays as I go along.
User avatar
kenmo
Addict
Addict
Posts: 2047
Joined: Tue Dec 23, 2003 3:54 am

Re: improving string lengths

Post by kenmo »

Of course, just tossing ideas out there.
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Re: improving string lengths

Post by Trond »

Tenaja, the method you're using to measure the size of the memory allocated for strings is simply wrong, no matter how you interpret the results or how much you reboot your computer. It's a nonsense calculation.

This is how to measure the size (without OS overhead) of a normal (not fixed) PB string:

Code: Select all

Procedure ShouldBeStringSize(S.s)
  ProcedureReturn (Len(S)+1) * (#PB_Compiler_Unicode+1)
EndProcedure


Import ""
    PB_StringHeap
EndImport

If HeapSize_(PB_StringHeap, 0, HeapAlloc_(PB_StringHeap, 0, 17)) = 17
  Debug "OK: Confirmed that HeapSize hides the OS overhead"
Else
  Debug "NOT OK!! HeapSize is unreliable for this test"
EndIf
Debug "--"


For I = 0 To 70 Step 3
  a.s = ReplaceString(Space(I), " ", "1")
  pb = HeapSize_(PB_StringHeap, 0, @a.s)
  ok = ShouldBeStringSize(a)
  ; Debug "PB: " + Str(pb)
  ; Debug "OK: " + Str(ok)
  Debug "String length: " + Str(I)
  Debug "Overhead: " + Str(pb-ok)
  Debug "---"
Next
There is an unexplained overhead of 4 bytes per string (not per character) regardless of the length of the string. In unicode mode the overhead is 8 bytes per string.

There is a chance this is a bug in PB. Fred will have to say.
Last edited by Trond on Sun Oct 23, 2011 4:16 pm, edited 1 time in total.
User avatar
Tenaja
Addict
Addict
Posts: 1959
Joined: Tue Nov 09, 2010 10:15 pm

Re: improving string lengths

Post by Tenaja »

Trond,
My linker chokes on _PB_StringHeap...do you have a library I need to install?

Thanks.
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Re: improving string lengths

Post by Trond »

Tenaja wrote:Trond,
My linker chokes on _PB_StringHeap...do you have a library I need to install?

Thanks.
Sorry, I didn't test it on a 32-bit OS. I edited it, maybe it works now?
User avatar
Tenaja
Addict
Addict
Posts: 1959
Joined: Tue Nov 09, 2010 10:15 pm

Re: improving string lengths

Post by Tenaja »

Trond wrote:...
This is how to measure the size (without OS overhead) of a normal (not fixed) PB string:
....
The problem with your calculation is that even IF there is OS overhead, then that memory is USED within the managing scheme of the strings. Therefore, maybe the string itself isn't in that overhead, but the string requires it anyway. As a result, managed strings "require" that overhead, whatever it is. The bottom line is that I care about overall memory consumption, not individual string use.
Post Reply