UTF-8 support for strings

Motu · Post by **Motu** » Mon Aug 25, 2008 3:27 pm

also it's working fine to place UTF-8 Strings inside the pb string lib, most of the support funktions do not work with utf-8. For example:

lcase(String) and ucase(String) will return wrong characters when giving UTF-8 String to them. For these funktions a flag would be nice:

lcase(String [,StringFormat])

also in need for:
left, right, mid etc.

under windows you can use: CharLower_() for example, but as it is part of the windows libary, it will not work with linux.

Kaeru Gaman · Post by **Kaeru Gaman** » Mon Aug 25, 2008 3:29 pm

first time I hear about such a problem...
did you switch BOTH compileroptions (exe and source) to UTF-8?

Motu · Post by **Motu** » Mon Aug 25, 2008 3:36 pm

hi Kaeru,
as far as I know there is no compiler option for utf-8 - only for unicode, what is something different.

ts-soft · Post by **ts-soft** » Mon Aug 25, 2008 3:37 pm

Kaeru Gaman wrote:first time I hear about such a problem...
did you switch BOTH compileroptions (exe and source) to UTF-8?

There is no compileroption for UTF-8, only for Source

Motu · Post by **Motu** » Mon Aug 25, 2008 3:39 pm

so, is there a solution for this problem? If yes, please post it

ts-soft · Post by **ts-soft** » Mon Aug 25, 2008 3:43 pm

PB uses Unicode or ASCII for strings, not UTF-8
UTF-8 is only required for some Editor-Like Controls, so you can this
text load and write, but PB uses the string as in compileroptions enabled.

Motu · Post by **Motu** » Mon Aug 25, 2008 3:57 pm

So, the main point is - ich can save utf-8 strings under pb in String field but non of the manipulations funktions works korrekt.
Isn't there any solution for this that works under linux as well (like the windows api function - just for linux) ?

ts-soft · Post by **ts-soft** » Mon Aug 25, 2008 4:23 pm

I don't understand. PB uses allways Unicode or ASCII.
You can for example read a UTF-8 String from a File, so you have the
Unicode- or ASCII-String in your Variable (not UTF-8) , manipulate it and save it return as UTF-8
UTF-8 is only for import or export to file or interface and so on.

srod · Post by **srod** » Mon Aug 25, 2008 4:26 pm

UTF 8 is not supported natively by Windows (and thus not by the api). As ts-soft said, this format is really not for native string handling, but for mediums designed for transmission etc. E.g. string storage in files is often best done using utf-8 for various reasons.

Now, using the built in memory functions (PeekS, PokeS etc.) there is nothing which you cannot do with strings held in utf-8 format. You can change case, search for substrings... there's no limit.

All you do is grab the memory buffer holding your utf-8 string, convert it to the native format (using PeekS(..., ..., #PB_UTF8)) - this format will either be Ascii or Unicode depending on your compiler settings. When done you can write the modified string back to a buffer in utf-8 format (if you require) using PokeS().

Michael Vogel · Post by **Michael Vogel** » Sun Aug 31, 2008 8:21 am

srod wrote:Now, using the built in memory functions (PeekS, PokeS etc.) there is nothing which you cannot do with strings held in utf-8 format. You can change case, search for substrings... there's no limit.

What do you think, is the best way to check the needed amount of memory for the string beeing created when using PokeS(text.s,-1,#PB_UTF8)?

I could allocate twice the length of the original string, but it would be nice to find a way to take only the memory what is really needed.

Michael

srod · Post by **srod** » Sun Aug 31, 2008 10:33 am

Code: Select all

StringByteLength(string$, #PB_UTF8) + 1

Twice as many bytes as the number of characters wouldn't necessarily be enough because utf-8 is a variable length encoding with some characters requiring 4 bytes etc.

blueznl · Post by **blueznl** » Sun Aug 31, 2008 10:52 am

http://www.xs4all.nl/~bluez/datatalk/pu ... bytelength

http://www.xs4all.nl/~bluez/datatalk/pu ... bytelength

Edit: that's what I get for walking away from the keyboard, I'm a half hour behind Srod...

Well, I'm always a half hour behind anything, pretty much, come to think of it

Michael Vogel · Post by **Michael Vogel** » Sun Aug 31, 2008 3:02 pm

srod & blueznl, you're both fast enough

I just did a short run in the late summer sun and just back you've be done (once again) the right answers for me.

Thanks to you (and all others) in this forum, I love you

Michael Vogel · Post by **Michael Vogel** » Sat Sep 06, 2008 12:14 pm

I need some string conversion functions for UTF and ASCII strings, so I wrote the following procedures:!:

The positive point is, that they work and are fast enough for normal things.

There are still some points to be careful: so the given string for the StringToAscii procedure have to be in UTF8 format - if it is already an ASCII string it may be cutted

Code: Select all

Procedure.s StringToUTF(s.s)

	#AutoLength=-1
	Protected buffer.s
	buffer=Space(StringByteLength(s,#PB_UTF8))
	PokeS(@buffer,s,#AutoLength,#PB_UTF8)

	ProcedureReturn buffer

EndProcedure
Procedure.s StringToASCII(s.s)
	; in der aktuellen Version MUSS der String im UTF8-Format ('WeinstraÃŸenlauf' >> 'Weinstraßenlauf')
	; vorliegen, sonst wird der String abgeschnitten ('Weinstraßenlauf' >> 'Weinstra') !
	
	Protected buffer.s
	
	#AutoLength=-1
	s=PeekS(@s,#AutoLength,#PB_UTF8)
	buffer=Space(StringByteLength(s,#PB_Ascii))
	PokeS(@buffer,s,#AutoLength,#PB_Ascii)

	ProcedureReturn buffer

EndProcedure
Procedure.s StringToFilename(s.s)

	Protected z=Len(s)
	
	While z
		If FindString("\:/<*|?>"+#DQUOTE$,Mid(s,z,1),1)
			PokeB(@s+z-1,32)
		EndIf
		z-1
	Wend
	
	ProcedureReturn s

EndProcedure

Demivec · Post by **Demivec** » Sat Sep 06, 2008 8:39 pm

Michael Vogel wrote:I need some string conversion functions for UTF and ASCII strings, so I wrote the following procedures:!:

The positive point is, that they work and are fast enough for normal things.

There are still some points to be careful: so the given string for the StringToAscii procedure have to be in UTF8 format - if it is already an ASCII string it may be cutted

You have to add one byte for the Null when you reserve buffer space. Try your code with this slight modification:

Code: Select all

Procedure.s StringToUTF(S.s)
  #AutoLength = -1
  Protected Buffer.s
  
  Buffer = Space(StringByteLength(S,#PB_UTF8) + 1) ;<== add a byte for the Null
  PokeS(@Buffer,S,#AutoLength,#PB_UTF8)
  
  ProcedureReturn Buffer
EndProcedure

Procedure.s StringToASCII(S.s)
  ; in der aktuellen Version MUSS der String im UTF8-Format ('WeinstraÃŸenlauf' >> 'Weinstraßenlauf')
  ; vorliegen, sonst wird der String abgeschnitten ('Weinstraßenlauf' >> 'Weinstra') !
  #AutoLength = -1
  Protected Buffer.s

  S = PeekS(@S,#AutoLength,#PB_UTF8)
  Buffer = Space(StringByteLength(S,#PB_Ascii) + 1) ;<== add a byte for the Null
  PokeS(@Buffer,S,#AutoLength,#PB_Ascii)
  
  ProcedureReturn Buffer
EndProcedure

Procedure.s StringToFilename(S.s)
  Protected Z = Len(S)
  
  While Z
    If FindString("\:/<*|?>"+#DQUOTE$,Mid(S,Z,1),1)
      PokeB(@S + Z - 1,32)
    EndIf
    Z - 1
  Wend
  
  ProcedureReturn S
EndProcedure