Page 2 of 4

Re: String length should be stored for string variables

Posted: Sun Jul 19, 2020 12:19 am
by Tenaja
There are a million string libraries & routines for C... Instead of asking Fred to break our existing code, why not just implement something on GitHub? Everybody knows it's not difficult to write a module.

Re: String length should be stored for string variables

Posted: Sun Jul 19, 2020 8:15 am
by Rinzwind
Tenaja wrote:There are a million string libraries & routines for C... Instead of asking Fred to break our existing code, why not just implement something on GitHub? Everybody knows it's not difficult to write a module.
Because strings are a language feature of PB with special treatment. Otherwise all string operators would need to be replaced by procedures. This change can be mostly backward compatible because it stays a null terminated string. Quite silly the C world kept using this thing with its obvious performance design flaw and more silly PB copied the broken concept. Probably same blindness also kept internationalization support so long problematic in the C world. Even now. Ok, C technically only has arrays. An array of char <> string functionality.

But indeed you can sadly forget any language/syntax improvements. Even obviously missing things like inline array and structure initialization. Its painful, verbose and ugly to have to create and fill arrays line by line just to be passed one time to a procedure for example. Split and Join are also obvious include candidates. They are part of all decent basic languages ;

Inspiration? https://github.com/antirez/sds

Writing a usable, stable and fast set of string functions is not easy in any way. Especially when throwing in internationalization.

I worked around it by using lists as string builder.

Re: String length should be stored for string variables

Posted: Sun Jul 19, 2020 9:05 am
by STARGÅTE
User_Russian wrote:Most of the time is spent on calculating the length of the string. The longer the string the more time is needed.
I understand. Because with s+"x" the length of s must first be determined, every time.

Re: String length should be stored for string variables

Posted: Sun Jul 19, 2020 9:26 am
by Saki
The string processing of PB is problematic.
There is still a leak when very large strings are requisitioned, per 100mb about 200mb which are not released anymore.
That is at 500mb one GB that is occupied.
This is an enormous amount and in itself means that the string is completely unreleased.
Further it can still come to overwriting if strings are passed in procedures.
I noticed VAL(), which pops depending on the memory usage.
These things can be very difficult to locate.

Code: Select all

x$=Space(5e8)
x$=#Null$
Repeat : ForEver

Re: String length should be stored for string variables

Posted: Sun Jul 19, 2020 9:56 am
by BarryG
Don't set it to #NULL$ to release it. Use this workaround instead and look at Task Manager:

Code: Select all

Macro FreeString(name)
  name=Left(name,1)
  name=""
EndMacro

x$=Space(5e8)
FreeString(x$)

Repeat : Delay(1) : ForEver
See this thread too -> viewtopic.php?f=7&t=30684

Re: String length should be stored for string variables

Posted: Sun Jul 19, 2020 10:14 am
by Saki
Yes, but you should write it in the manual, which you should use a workaround :D

I don't know how many workarounds are in my software,
but there are already some that.
I don't even know if they are still necessary or not.

Why do I need #Null$ or "" when it doesn't work :?:

Re: String length should be stored for string variables

Posted: Sun Jul 19, 2020 12:09 pm
by Mijikai
I think the current string handling is ok but appending the size would be a improvement.
mk-soft wrote:... But then there is no static string anymore and Array of Chars won't work anymore.
A UTF8 character can take from 1 byte to 6 bytes in memory.
This leads to more problems than advantages.
I agree, UTF8 would make no sense at all.
BarryG wrote:Don't set it to #NULL$ to release it. Use this workaround instead and look at Task Manager:
...
Resizing memory may look good but...

General Rules:
There are two special constants for strings:
#Empty$: represents an empty string (exactly the same as "")
#Null$ : represents an null string. This can be used for API functions requiring a null pointer to a string, or to really free a string.

Re: String length should be stored for string variables

Posted: Sun Jul 19, 2020 6:23 pm
by mk-soft
FreeString is not required and the constant #Null$ is fixed with version v5.72 when assigning to a string. This again assigns Nothing to the string.
(Edit: An empty string was assigned in version v5.70 and v5.71, and not Nothing)

The problem with the large memory requirement is internal help string (PB_StringBasePosition) and this will be reduced again if necessary.

Code: Select all

; Set here Breakpoint (F9) and step (F8)

x$=Space(5e8)
;
x$ = ""

; Force free StringBasePointer
dummmy$ = LSet("",1)
;

x$=Space(5e8)
;
x$ = #Null$

; Force free StringBasePointer
dummmy$ = LSet("",1)
;
End

Re: String length should be stored for string variables

Posted: Sun Jul 19, 2020 7:58 pm
by kenmo
I don't want to get off topic, but since I brought up UTF-8, I have some responses:

- PureBasic's "Unicode" strings are sort of an incomplete-implementation of Unicode... sometimes PB treats strings as UCS-16 (fixed 2-byte) and other times as UTF-16 (2-4 byte) requiring workarounds for handling chars > $FFFF

- Windows API has used UTF-16 since Windows 2000... which is a variable 2-4 byte encoding http://zuga.net/articles/text-does-wind ... -or-ucs-2/

- I believe Linux and MacOS use UTF-8 for their APIs, so conversions are always happening. Using UTF-8 internally would benefit PB on these platforms
A UTF8 character can take from 1 byte to 6 bytes in memory
- It is actually 1 to 4 bytes to cover all Unicode, but that doesn't change your point :) https://en.wikipedia.org/wiki/UTF-8
But then there is no static string anymore and Array of Chars won't work anymore.
- Since PB is using UTF-16, which is variable length, Array of Chars already fail if you support any Unicode chars > $FFFF
Moreover, UTF8 is quite slow due to the variable length and many things become more complicated.
- The parsing of strings in RAM is dominated by the slowness of what you do with those strings: draw to the screen, file I/O, network I/O... The site http://utf8everywhere.org/ discusses all these pros and cons.
Q: Won’t the conversions between UTF-8 and UTF-16 when passing strings to Windows slow down my application?

A: First, you will do some conversion either way. It’s either when calling the system, or when interacting with the rest of the world, e.g. when sending a text string over TCP. Also, those of OS APIs which accept strings often perform tasks which are inherently slow, such as UI or file system operations.
From Wikipedia:
Microsoft now recommends UTF-8 for Windows programs, while previously they emphasized "Unicode" (meaning UTF-16) Win32 API, this may mean internal use of UTF-8 will increase in the future.
Microsoft link:
https://docs.microsoft.com/en-us/window ... -code-page
Use UTF-8 character encoding for optimal compatibility between web apps and other *nix-based platforms (Unix, Linux, and variants), minimize localization bugs, and reduce testing overhead.

UTF-8 is the universal code page for internationalization and is able to encode the entire Unicode character set. It is used pervasively on the web, and is the default for *nix-based platforms.

An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.

Just some things to consider :D I think UTF-8 is the closest thing we have to a universal encoding. Now anyway, back to storing-string-length-with-strings....

Re: String length should be stored for string variables

Posted: Mon Jul 20, 2020 4:37 am
by Rinzwind
Also storing the length with the string makes it possible to use them for any binary storage (since null doesnt have to have special meaning anymore if you choose so). Can be quite convenient.

Re: String length should be stored for string variables

Posted: Mon Jul 20, 2020 6:11 am
by wilbert
kenmo wrote:- I believe Linux and MacOS use UTF-8 for their APIs, so conversions are always happening. Using UTF-8 internally would benefit PB on these platforms
Most MacOS APIs require NSString / CFString type for strings.

Storing string lengths would make things faster but it's also true that for most cases the current string functions are fast enough.
Instead of adding the length to every string, caching the length of the last accessed strings could also improve things.
And some functions like Split and Join would be a welcome addition.

It would be nice if the PB string library would be open source like the IDE so we could contribute to it.

Re: String length should be stored for string variables

Posted: Mon Jul 20, 2020 9:05 am
by helpy
-1

If internally storing string length with each PB string a new problem would arise ;-)
This problem would occour if you manipulate a PB string using pointers, memory functions and writing directly to the string memory using Poke or other *PointerToCharcter. Manipulating a string this way would not update the internal string length and PB functions would not work correctly ... :-(

Re: String length should be stored for string variables

Posted: Mon Jul 20, 2020 10:11 am
by Saki
@Wilbert
Yeah, my thought was why not just cache the string lengths over the ardesses.
This would also eliminate overflows if the string end is not found or damaged.
I think you could easily do that with your knowledge.

Re: String length should be stored for string variables

Posted: Mon Jul 20, 2020 10:15 am
by NicTheQuick
Isn't there a thing like MemorySize() for a strings buffer? How does this usually work together with AllocateMemory()?
If the operating system already knows how big the memory buffer to a given pointer is, this could be another idea.

Re: String length should be stored for string variables

Posted: Mon Jul 20, 2020 11:11 am
by Saki
@NickTheQuick
Yes, from the basic idea, that in itself is the most sensible approach.
The main problem is probably that the compatibility to older software or special procedures is always a bit broken.
But in principle, it should be like Win 95.
The old one has to give way to the better new one.
Or it will all be like Gorbtschchow said : "He who is late, will be punished by life"