UTF-8 support for strings

Got an idea for enhancing PureBasic? New command(s) you'd like to see?
User avatar
ts-soft
Always Here
Always Here
Posts: 5756
Joined: Thu Jun 24, 2004 2:44 pm
Location: Berlin - Germany

Post by ts-soft »

It is not sure to use other formats in a pb stringvariable, use memory for this:

Code: Select all

Procedure StringToUTF(S.s)
  #AutoLength = -1
  Protected *Buffer
 
  *Buffer = AllocateMemory(StringByteLength(S,#PB_UTF8) + 1) ;<== add a byte for the Null
  PokeS(*Buffer,S, #AutoLength, #PB_UTF8)
 
  ProcedureReturn *Buffer
EndProcedure 
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.
Image
User avatar
blueznl
PureBasic Expert
PureBasic Expert
Posts: 6166
Joined: Sat May 17, 2003 11:31 am
Contact:

Post by blueznl »

Demivec wrote: You have to add one byte for the Null when you reserve buffer space.
Actually TWO when you work in Windows Unicode / UTF16.... even the null comes double...
( PB6.00 LTS Win11 x64 Asrock AB350 Pro4 Ryzen 5 3600 32GB GTX1060 6GB)
( The path to enlightenment and the PureBasic Survival Guide right here... )
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Post by Demivec »

blueznl wrote:
Demivec wrote: You have to add one byte for the Null when you reserve buffer space.
Actually TWO when you work in Windows Unicode / UTF16.... even the null comes double...
True, but the procedures are designed specifically for Ascii-to-UTF8 and UTF8-to-Ascii.
spacefractal
User
User
Posts: 17
Joined: Tue Jan 24, 2006 7:05 pm

Post by spacefractal »

ts-soft's didn't actually work as it should when I tested here, here is fixed modified version which worked here (these functions does here both way):

Code: Select all

; UTF8 to Unicode/AscII (depend if the app is compiled as unicode or not).
Procedure.s Unicode(s.s)
  Protected *Buffer 
  
  *Buffer = AllocateMemory(StringByteLength(S,#PB_UTF8) + 2) ;<== add a byte for the Null (1 or 2?) 
  PokeS(*Buffer,S, -1, #PB_Ascii) 
  Result$=PeekS(*Buffer, -1, #PB_UTF8)
  FreeMemory(*Buffer)
  ProcedureReturn Result$
EndProcedure
 
; Unicode/AscII (depend if the app is compiled as unicode or not) to UTf8.
Procedure.s UTF8(s.s)
  Protected *Buffer 
  
  *Buffer = AllocateMemory(StringByteLength(S,#PB_UTF8) + 2) ;<== add a byte for the Null (1 or 2)? 
  PokeS(*Buffer,S, -1, #PB_UTF8); 
  Result$=PeekS(*Buffer, -1, #PB_Ascii);
  FreeMemory(*Buffer)
  ProcedureReturn Result$
EndProcedure
 
UTF8 is a ASCII formatted string using variable length for encodning the chars, hence it need to been "saved" to ASCII, and then convert it to a string using #PB_UTF8.
User avatar
ts-soft
Always Here
Always Here
Posts: 5756
Joined: Thu Jun 24, 2004 2:44 pm
Location: Berlin - Germany

Post by ts-soft »

PB Stringmanager support only Unicode in unicode-applications and ASCII
in ASCII application. Your Return of a stringvariable, that hold a UTF-8 in the
buffer, this is not sure. The UTF-8 is only sure in a allocated memory but
never in a stringvariable.
UTF-8 is never required in a Stringvariable.
If a lib requires UTF-8, you can use a pseudotype or a pointer to memory
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.
Image
User avatar
blueznl
PureBasic Expert
PureBasic Expert
Posts: 6166
Joined: Sat May 17, 2003 11:31 am
Contact:

Post by blueznl »

You can store an UTF8 string in a string variable, as a UTF8 string will never contain a zero. Of course, PB's string handling commands will all be thrown off-track...

Hmm.

Except for Linux, I suppose. Is PB Unicode in Linux in UTF16 or UTF8 in memory?
( PB6.00 LTS Win11 x64 Asrock AB350 Pro4 Ryzen 5 3600 32GB GTX1060 6GB)
( The path to enlightenment and the PureBasic Survival Guide right here... )
User avatar
Michael Vogel
Addict
Addict
Posts: 2797
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Post by Michael Vogel »

The reason for using such routines is simple: sometimes it is necessary to handle different files within one program (preferences, database etc.) - so both text representations must be handled also.

In my case, I have to handle (addtionally to a simple INI file) GPX, HST and TCX files for GPS data. For normal, these files consist of UTF-8 text, but sometimes there is also simple ASCII content.

In such cases, my routines above can help - maybe a fast WhatStringTypeIs() function would be fine to check, if a string is ASCII or UTF8 formated.
Post Reply