UTF-8 support for strings
UTF-8 support for strings
also it's working fine to place UTF-8 Strings inside the pb string lib, most of the support funktions do not work with utf-8. For example:
lcase(String) and ucase(String) will return wrong characters when giving UTF-8 String to them. For these funktions a flag would be nice:
lcase(String [,StringFormat])
also in need for:
left, right, mid etc.
under windows you can use: CharLower_() for example, but as it is part of the windows libary, it will not work with linux.
lcase(String) and ucase(String) will return wrong characters when giving UTF-8 String to them. For these funktions a flag would be nice:
lcase(String [,StringFormat])
also in need for:
left, right, mid etc.
under windows you can use: CharLower_() for example, but as it is part of the windows libary, it will not work with linux.
- Kaeru Gaman
- Addict
- Posts: 4826
- Joined: Sun Mar 19, 2006 1:57 pm
- Location: Germany
There is no compileroption for UTF-8, only for SourceKaeru Gaman wrote:first time I hear about such a problem...
did you switch BOTH compileroptions (exe and source) to UTF-8?

PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.

Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.

PB uses Unicode or ASCII for strings, not UTF-8
UTF-8 is only required for some Editor-Like Controls, so you can this
text load and write, but PB uses the string as in compileroptions enabled.
UTF-8 is only required for some Editor-Like Controls, so you can this
text load and write, but PB uses the string as in compileroptions enabled.
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.

Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.

I don't understand. PB uses allways Unicode or ASCII.
You can for example read a UTF-8 String from a File, so you have the
Unicode- or ASCII-String in your Variable (not UTF-8) , manipulate it and save it return as UTF-8
UTF-8 is only for import or export to file or interface and so on.
You can for example read a UTF-8 String from a File, so you have the
Unicode- or ASCII-String in your Variable (not UTF-8) , manipulate it and save it return as UTF-8
UTF-8 is only for import or export to file or interface and so on.
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.

Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.

UTF 8 is not supported natively by Windows (and thus not by the api). As ts-soft said, this format is really not for native string handling, but for mediums designed for transmission etc. E.g. string storage in files is often best done using utf-8 for various reasons.
Now, using the built in memory functions (PeekS, PokeS etc.) there is nothing which you cannot do with strings held in utf-8 format. You can change case, search for substrings... there's no limit.
All you do is grab the memory buffer holding your utf-8 string, convert it to the native format (using PeekS(..., ..., #PB_UTF8)) - this format will either be Ascii or Unicode depending on your compiler settings. When done you can write the modified string back to a buffer in utf-8 format (if you require) using PokeS().
Now, using the built in memory functions (PeekS, PokeS etc.) there is nothing which you cannot do with strings held in utf-8 format. You can change case, search for substrings... there's no limit.
All you do is grab the memory buffer holding your utf-8 string, convert it to the native format (using PeekS(..., ..., #PB_UTF8)) - this format will either be Ascii or Unicode depending on your compiler settings. When done you can write the modified string back to a buffer in utf-8 format (if you require) using PokeS().
I may look like a mule, but I'm not a complete ass.
- Michael Vogel
- Addict
- Posts: 2797
- Joined: Thu Feb 09, 2006 11:27 pm
- Contact:
What do you think, is the best way to check the needed amount of memory for the string beeing created when using PokeS(text.s,-1,#PB_UTF8)?srod wrote:Now, using the built in memory functions (PeekS, PokeS etc.) there is nothing which you cannot do with strings held in utf-8 format. You can change case, search for substrings... there's no limit.
I could allocate twice the length of the original string, but it would be nice to find a way to take only the memory what is really needed.
Michael
Code: Select all
StringByteLength(string$, #PB_UTF8) + 1

Twice as many bytes as the number of characters wouldn't necessarily be enough because utf-8 is a variable length encoding with some characters requiring 4 bytes etc.
I may look like a mule, but I'm not a complete ass.
http://www.xs4all.nl/~bluez/datatalk/pu ... bytelength
http://www.xs4all.nl/~bluez/datatalk/pu ... bytelength
Edit: that's what I get for walking away from the keyboard, I'm a half hour behind Srod...
Well, I'm always a half hour behind anything, pretty much, come to think of it
http://www.xs4all.nl/~bluez/datatalk/pu ... bytelength
Edit: that's what I get for walking away from the keyboard, I'm a half hour behind Srod...
Well, I'm always a half hour behind anything, pretty much, come to think of it

( PB6.00 LTS Win11 x64 Asrock AB350 Pro4 Ryzen 5 3600 32GB GTX1060 6GB)
( The path to enlightenment and the PureBasic Survival Guide right here... )
( The path to enlightenment and the PureBasic Survival Guide right here... )
- Michael Vogel
- Addict
- Posts: 2797
- Joined: Thu Feb 09, 2006 11:27 pm
- Contact:
- Michael Vogel
- Addict
- Posts: 2797
- Joined: Thu Feb 09, 2006 11:27 pm
- Contact:
I need some string conversion functions for UTF and ASCII strings, so I wrote the following procedures:!:
The positive point is, that they work and are fast enough for normal things.
There are still some points to be careful: so the given string for the StringToAscii procedure have to be in UTF8 format - if it is already an ASCII string it may be cutted
The positive point is, that they work and are fast enough for normal things.

There are still some points to be careful: so the given string for the StringToAscii procedure have to be in UTF8 format - if it is already an ASCII string it may be cutted

Code: Select all
Procedure.s StringToUTF(s.s)
#AutoLength=-1
Protected buffer.s
buffer=Space(StringByteLength(s,#PB_UTF8))
PokeS(@buffer,s,#AutoLength,#PB_UTF8)
ProcedureReturn buffer
EndProcedure
Procedure.s StringToASCII(s.s)
; in der aktuellen Version MUSS der String im UTF8-Format ('Weinstraßenlauf' >> 'Weinstraßenlauf')
; vorliegen, sonst wird der String abgeschnitten ('Weinstraßenlauf' >> 'Weinstra') !
Protected buffer.s
#AutoLength=-1
s=PeekS(@s,#AutoLength,#PB_UTF8)
buffer=Space(StringByteLength(s,#PB_Ascii))
PokeS(@buffer,s,#AutoLength,#PB_Ascii)
ProcedureReturn buffer
EndProcedure
Procedure.s StringToFilename(s.s)
Protected z=Len(s)
While z
If FindString("\:/<*|?>"+#DQUOTE$,Mid(s,z,1),1)
PokeB(@s+z-1,32)
EndIf
z-1
Wend
ProcedureReturn s
EndProcedure
You have to add one byte for the Null when you reserve buffer space. Try your code with this slight modification:Michael Vogel wrote:I need some string conversion functions for UTF and ASCII strings, so I wrote the following procedures:!:
The positive point is, that they work and are fast enough for normal things.![]()
There are still some points to be careful: so the given string for the StringToAscii procedure have to be in UTF8 format - if it is already an ASCII string it may be cutted![]()
Code: Select all
Procedure.s StringToUTF(S.s)
#AutoLength = -1
Protected Buffer.s
Buffer = Space(StringByteLength(S,#PB_UTF8) + 1) ;<== add a byte for the Null
PokeS(@Buffer,S,#AutoLength,#PB_UTF8)
ProcedureReturn Buffer
EndProcedure
Procedure.s StringToASCII(S.s)
; in der aktuellen Version MUSS der String im UTF8-Format ('Weinstraßenlauf' >> 'Weinstraßenlauf')
; vorliegen, sonst wird der String abgeschnitten ('Weinstraßenlauf' >> 'Weinstra') !
#AutoLength = -1
Protected Buffer.s
S = PeekS(@S,#AutoLength,#PB_UTF8)
Buffer = Space(StringByteLength(S,#PB_Ascii) + 1) ;<== add a byte for the Null
PokeS(@Buffer,S,#AutoLength,#PB_Ascii)
ProcedureReturn Buffer
EndProcedure
Procedure.s StringToFilename(S.s)
Protected Z = Len(S)
While Z
If FindString("\:/<*|?>"+#DQUOTE$,Mid(S,Z,1),1)
PokeB(@S + Z - 1,32)
EndIf
Z - 1
Wend
ProcedureReturn S
EndProcedure