UTF-8 support for strings

Got an idea for enhancing PureBasic? New command(s) you'd like to see?
Motu
Enthusiast
Enthusiast
Posts: 160
Joined: Tue Oct 19, 2004 12:24 pm

UTF-8 support for strings

Post by Motu »

also it's working fine to place UTF-8 Strings inside the pb string lib, most of the support funktions do not work with utf-8. For example:

lcase(String) and ucase(String) will return wrong characters when giving UTF-8 String to them. For these funktions a flag would be nice:

lcase(String [,StringFormat])

also in need for:
left, right, mid etc.

under windows you can use: CharLower_() for example, but as it is part of the windows libary, it will not work with linux.
User avatar
Kaeru Gaman
Addict
Addict
Posts: 4826
Joined: Sun Mar 19, 2006 1:57 pm
Location: Germany

Post by Kaeru Gaman »

first time I hear about such a problem...
did you switch BOTH compileroptions (exe and source) to UTF-8?
oh... and have a nice day.
Motu
Enthusiast
Enthusiast
Posts: 160
Joined: Tue Oct 19, 2004 12:24 pm

option

Post by Motu »

hi Kaeru,
as far as I know there is no compiler option for utf-8 - only for unicode, what is something different.
User avatar
ts-soft
Always Here
Always Here
Posts: 5756
Joined: Thu Jun 24, 2004 2:44 pm
Location: Berlin - Germany

Post by ts-soft »

Kaeru Gaman wrote:first time I hear about such a problem...
did you switch BOTH compileroptions (exe and source) to UTF-8?
There is no compileroption for UTF-8, only for Source :wink:
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.
Image
Motu
Enthusiast
Enthusiast
Posts: 160
Joined: Tue Oct 19, 2004 12:24 pm

solution

Post by Motu »

so, is there a solution for this problem? If yes, please post it :)
User avatar
ts-soft
Always Here
Always Here
Posts: 5756
Joined: Thu Jun 24, 2004 2:44 pm
Location: Berlin - Germany

Post by ts-soft »

PB uses Unicode or ASCII for strings, not UTF-8
UTF-8 is only required for some Editor-Like Controls, so you can this
text load and write, but PB uses the string as in compileroptions enabled.
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.
Image
Motu
Enthusiast
Enthusiast
Posts: 160
Joined: Tue Oct 19, 2004 12:24 pm

Post by Motu »

So, the main point is - ich can save utf-8 strings under pb in String field but non of the manipulations funktions works korrekt.
Isn't there any solution for this that works under linux as well (like the windows api function - just for linux) ?
User avatar
ts-soft
Always Here
Always Here
Posts: 5756
Joined: Thu Jun 24, 2004 2:44 pm
Location: Berlin - Germany

Post by ts-soft »

I don't understand. PB uses allways Unicode or ASCII.
You can for example read a UTF-8 String from a File, so you have the
Unicode- or ASCII-String in your Variable (not UTF-8) , manipulate it and save it return as UTF-8
UTF-8 is only for import or export to file or interface and so on.
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.
Image
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

UTF 8 is not supported natively by Windows (and thus not by the api). As ts-soft said, this format is really not for native string handling, but for mediums designed for transmission etc. E.g. string storage in files is often best done using utf-8 for various reasons.

Now, using the built in memory functions (PeekS, PokeS etc.) there is nothing which you cannot do with strings held in utf-8 format. You can change case, search for substrings... there's no limit.

All you do is grab the memory buffer holding your utf-8 string, convert it to the native format (using PeekS(..., ..., #PB_UTF8)) - this format will either be Ascii or Unicode depending on your compiler settings. When done you can write the modified string back to a buffer in utf-8 format (if you require) using PokeS().
I may look like a mule, but I'm not a complete ass.
User avatar
Michael Vogel
Addict
Addict
Posts: 2797
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Post by Michael Vogel »

srod wrote:Now, using the built in memory functions (PeekS, PokeS etc.) there is nothing which you cannot do with strings held in utf-8 format. You can change case, search for substrings... there's no limit.
What do you think, is the best way to check the needed amount of memory for the string beeing created when using PokeS(text.s,-1,#PB_UTF8)?

I could allocate twice the length of the original string, but it would be nice to find a way to take only the memory what is really needed.

Michael
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Code: Select all

StringByteLength(string$, #PB_UTF8) + 1
:wink:

Twice as many bytes as the number of characters wouldn't necessarily be enough because utf-8 is a variable length encoding with some characters requiring 4 bytes etc.
I may look like a mule, but I'm not a complete ass.
User avatar
blueznl
PureBasic Expert
PureBasic Expert
Posts: 6166
Joined: Sat May 17, 2003 11:31 am
Contact:

Post by blueznl »

http://www.xs4all.nl/~bluez/datatalk/pu ... bytelength

http://www.xs4all.nl/~bluez/datatalk/pu ... bytelength

Edit: that's what I get for walking away from the keyboard, I'm a half hour behind Srod...

Well, I'm always a half hour behind anything, pretty much, come to think of it :oops:
( PB6.00 LTS Win11 x64 Asrock AB350 Pro4 Ryzen 5 3600 32GB GTX1060 6GB)
( The path to enlightenment and the PureBasic Survival Guide right here... )
User avatar
Michael Vogel
Addict
Addict
Posts: 2797
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Post by Michael Vogel »

srod & blueznl, you're both fast enough :wink:

I just did a short run in the late summer sun and just back you've be done (once again) the right answers for me.

Thanks to you (and all others) in this forum, I love you :D
User avatar
Michael Vogel
Addict
Addict
Posts: 2797
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Post by Michael Vogel »

I need some string conversion functions for UTF and ASCII strings, so I wrote the following procedures:!:

The positive point is, that they work and are fast enough for normal things. :)

There are still some points to be careful: so the given string for the StringToAscii procedure have to be in UTF8 format - if it is already an ASCII string it may be cutted :(

Code: Select all

Procedure.s StringToUTF(s.s)

	#AutoLength=-1
	Protected buffer.s
	buffer=Space(StringByteLength(s,#PB_UTF8))
	PokeS(@buffer,s,#AutoLength,#PB_UTF8)

	ProcedureReturn buffer

EndProcedure
Procedure.s StringToASCII(s.s)
	; in der aktuellen Version MUSS der String im UTF8-Format ('Weinstraßenlauf' >> 'Weinstraßenlauf')
	; vorliegen, sonst wird der String abgeschnitten ('Weinstraßenlauf' >> 'Weinstra') !
	
	Protected buffer.s
	
	#AutoLength=-1
	s=PeekS(@s,#AutoLength,#PB_UTF8)
	buffer=Space(StringByteLength(s,#PB_Ascii))
	PokeS(@buffer,s,#AutoLength,#PB_Ascii)

	ProcedureReturn buffer

EndProcedure
Procedure.s StringToFilename(s.s)

	Protected z=Len(s)
	
	While z
		If FindString("\:/<*|?>"+#DQUOTE$,Mid(s,z,1),1)
			PokeB(@s+z-1,32)
		EndIf
		z-1
	Wend
	
	ProcedureReturn s

EndProcedure
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Post by Demivec »

Michael Vogel wrote:I need some string conversion functions for UTF and ASCII strings, so I wrote the following procedures:!:

The positive point is, that they work and are fast enough for normal things. :)

There are still some points to be careful: so the given string for the StringToAscii procedure have to be in UTF8 format - if it is already an ASCII string it may be cutted :(
You have to add one byte for the Null when you reserve buffer space. Try your code with this slight modification:

Code: Select all

Procedure.s StringToUTF(S.s)
  #AutoLength = -1
  Protected Buffer.s
  
  Buffer = Space(StringByteLength(S,#PB_UTF8) + 1) ;<== add a byte for the Null
  PokeS(@Buffer,S,#AutoLength,#PB_UTF8)
  
  ProcedureReturn Buffer
EndProcedure

Procedure.s StringToASCII(S.s)
  ; in der aktuellen Version MUSS der String im UTF8-Format ('Weinstraßenlauf' >> 'Weinstraßenlauf')
  ; vorliegen, sonst wird der String abgeschnitten ('Weinstraßenlauf' >> 'Weinstra') !
  #AutoLength = -1
  Protected Buffer.s

  S = PeekS(@S,#AutoLength,#PB_UTF8)
  Buffer = Space(StringByteLength(S,#PB_Ascii) + 1) ;<== add a byte for the Null
  PokeS(@Buffer,S,#AutoLength,#PB_Ascii)
  
  ProcedureReturn Buffer
EndProcedure

Procedure.s StringToFilename(S.s)
  Protected Z = Len(S)
  
  While Z
    If FindString("\:/<*|?>"+#DQUOTE$,Mid(S,Z,1),1)
      PokeB(@S + Z - 1,32)
    EndIf
    Z - 1
  Wend
  
  ProcedureReturn S
EndProcedure
Post Reply