Understanding strings in Unicode

Just starting out? Need help? Post your questions and find answers here.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Understanding strings in Unicode

Post by Oso »

I understand from the documentation at https://www.purebasic.com/documentation ... index.html that strings are stored as Unicode. Therefore, if I want to see ASCII strings in memory, presumably I need to step over alternate bytes, but if I don't do that, why do I see the strange values of 30976, 8192 in a single memory location? Is it because PeekC looks at two-byte pairs?

Code: Select all

st.s = "My String"
ptr = @st.s
lg = Len(st.s) * 2
For mem=ptr To ptr+lg
  Debug PeekC(mem)
Next mem
77
30976
121
8192
32
21248
[...]
105
28160
110
26368
103
0
0
If I change my code to use PeekA, as below, it shows me 0 for the alternate bytes. Therefore, I'm not sure I understand why I get 30976, 8192 for PeekC.

Code: Select all

st.s = "My String"
ptr = @st.s
lg = Len(st.s) * 2
For mem=ptr To ptr+lg
  Debug PeekA(mem)
Next mem
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Understanding strings in Unicode

Post by STARGÅTE »

You have to go 2 byte steps.

Code: Select all

st.s = "My String"
ptr = @st.s
lg = Len(st.s) * 2
For mem=ptr To ptr+lg Step SizeOf(Character)
  Debug PeekC(mem)
Next mem
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: Understanding strings in Unicode

Post by Oso »

STARGÅTE wrote: Thu Sep 29, 2022 11:12 pm You have to go 2 byte steps.
Thanks STARGÅTE, yes I found that I needed to skip 1, but if I don't, where is the 30976 coming from? That's what I didn't follow.
User avatar
idle
Always Here
Always Here
Posts: 5042
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Understanding strings in Unicode

Post by idle »

You were stepping through by a byte but reading 2 bytes so you were getting low 77, high 121<<8, low 121
The PB type Character is a legacy now, it was used to enable code to be compiled as ascii or unicode, so it's size differed by the mode the compiler was in either 1 byte for ascii or 2 for unicode mode,

The most efficient way of passing a string in pb is with a while loop which is ok as PB strings are null terminated.
and use unicode or ascii directly on pointers

Code: Select all

st.s = "My String"
*ptr.unicode = @st.s

While *ptr\u ; <> 0  
  Debug *ptr\u 
  *ptr+SizeOf(unicode) ; +2 
Wend 

*utf8 = UTF8(st)  ;convert the string st to utf8 

Structure ar     ;a dummy array strcture of type .a 
  a.a[0] 
EndStructure    

*pa.ar = *utf8    ;set the pointer to the utf8 string,  

sz = MemorySize(*utf8)-1  ;get size this is stored by os windows and before the pointer on linux and osx  
While a < sz 
  Debug *pa\a[a]     
  a + SizeOf(Ascii) ; will get replaced with 1   
Wend 

FreeMemory(*utf8) 


additionally if you use the functions utf8() or ascii() to convert a string the byte length is available with a call to memorysize() on windows it's provided by the c runtime and on osx and mac it's stored before the pointer so it's quick and safer if you don't want to risk overflows
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: Understanding strings in Unicode

Post by Oso »

idle wrote: Fri Sep 30, 2022 12:08 am You were stepping through by a byte but reading 2 bytes so you were getting low 77, high 121<<8, low 121
Many thanks Idle, I've got it now.
idle wrote: Fri Sep 30, 2022 12:08 am The most efficient way of passing a string in pb is with a while loop which is ok as PB strings are null terminated and use unicode or ascii directly on pointers

Code: Select all

While *ptr\u ; <> 0  
  Debug *ptr\u 
I take it that \u signifies to return two bytes for Unicode? Is this documented? I struggle to find things like this in the help, with it not being an easily-searched verb./statement.
User avatar
idle
Always Here
Always Here
Posts: 5042
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Understanding strings in Unicode

Post by idle »

Oso wrote: Fri Sep 30, 2022 12:39 am I take it that \u signifies to return two bytes for Unicode? Is this documented? I struggle to find things like this in the help, with it not being an easily-searched verb./statement.
Yes it's the unicode type which is unsigned 2 bytes.

yes you can find the type documentation here
https://www.purebasic.com/documentation ... ables.html
plouf
Enthusiast
Enthusiast
Posts: 250
Joined: Fri Apr 25, 2003 6:35 pm
Location: Athens,Greece

Re: Understanding strings in Unicode

Post by plouf »

But that is not that simple since utf8 is variable length
From 1 to 4 bytes ...

So the assumption is some solutions here that size= *2
Makes some or totally wrong results (called bug ;-))
and that wrong would occur be under specific situations..

For example if using
English characters char=1 always
Greek characters char=2 always
Chinese chars. Char=3 bytes etc
Christos
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Understanding strings in Unicode

Post by STARGÅTE »

plouf wrote: Fri Sep 30, 2022 6:05 am But that is not that simple since utf8 is variable length
From 1 to 4 bytes ...

So the assumption is some solutions here that size= *2
Makes some or totally wrong results (called bug ;-))
and that wrong would occur be under specific situations..

For example if using
English characters char=1 always
Greek characters char=2 always
Chinese chars. Char=3 bytes etc
It is true that UTF8 has a variable length. But we talk here about the internal representation of a string in Pure Basic, which is "Unicode" or more correct UTF-16, using exact 2 Bytes per each character for the BMP (Basic multilingual plane of unicode).
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
juergenkulow
Enthusiast
Enthusiast
Posts: 544
Joined: Wed Sep 25, 2019 10:18 am

Re: Understanding strings in Unicode

Post by juergenkulow »

Code: Select all

; DrawText() and DrawVectorText(Text$) draws Unicode.
If OpenWindow(0, 0, 0, 400, 200, "VectorDrawing", #PB_Window_SystemMenu | #PB_Window_ScreenCentered)
    CanvasGadget(0, 0, 0, 400, 200)
    LoadFont(0, "Arial", 20, #PB_Font_Bold)
    
    If StartVectorDrawing(CanvasVectorOutput(0))
    
      VectorFont(FontID(0), 25)
      VectorSourceColor(RGBA(0, 0, 0, 80))
      Text$ = "U+1D56C 𝕬          𝕭𝕮𝕯𝕰𝕱𝕲𝕳𝕴𝕵𝕶"
      For i = 1 To 6
        MovePathCursor(200 - VectorTextWidth(Text$)/2, 100 - VectorTextHeight(Text$)/2)
        DrawVectorText(Text$)
        RotateCoordinates(200, 100, 30)
      Next i

      StopVectorDrawing()
    EndIf
    ShowMemoryViewer(@Text$,Len(Text$)*2)
    Repeat
      Event = WaitWindowEvent()
    Until Event = #PB_Event_CloseWindow
  EndIf
Please ask your questions, because switch on the cognition apparatus decides on the only known life in the universe.Wersten :DDüsseldorf NRW Germany Europe Earth Solar System Flake Bubble Orionarm
Milky Way Local_Group Virgo Supercluster Laniakea Universe
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: Understanding strings in Unicode

Post by Oso »

Thanks to all for the contributions on this. The main question was really about what happened to plain ASCII text when it was stored in memory. The fact that it becomes Unicode as a two-byte pair explains why I sometimes see text l i k e__t h i s__i n__s o m e__s o f t w a r e__a p p l i c a t i o n files. :)

The other thing that's relevant here is that when we write ASCII text to a file, we specifically have to tell WriteString to use #PB_Ascii otherwise presumably we're going to get the Unicode that's in memory written, even though there is a specific #PB_Unicode option anyway.
User avatar
mk-soft
Always Here
Always Here
Posts: 5335
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Understanding strings in Unicode

Post by mk-soft »

Somewhat hidden in the PB help (see CreateFile).
Default is UTF8
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
#NULL
Addict
Addict
Posts: 1440
Joined: Thu Aug 30, 2007 11:54 pm
Location: right here

Re: Understanding strings in Unicode

Post by #NULL »

STARGÅTE wrote: Fri Sep 30, 2022 7:06 am It is true that UTF8 has a variable length. But we talk here about the internal representation of a string in Pure Basic, which is "Unicode" or more correct UTF-16, using exact 2 Bytes per each character for the BMP (Basic multilingual plane of unicode).
UTF-16 has surrogate pairs too (but outside of the BMP)
german wiki link

Code: Select all

s.s = "AA𝄞BB"
ShowMemoryViewer(@s, 12) ; 41 00 41 00 34 D8 1E DD 42 00 42 00
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: Understanding strings in Unicode

Post by Oso »

mk-soft wrote: Fri Sep 30, 2022 3:08 pm Somewhat hidden in the PB help (see CreateFile).
Default is UTF8
In practice, for ASCII strings, the default UTF8 doesn't appear to make any difference :

Code: Select all

OpenConsole()
CreateFile(0,"unicode.txt", #PB_Unicode)
mystring.s = "Somewhat hidden in the PB help (see CreateFile)."
WriteString(0, mystring.s, #PB_Unicode)
CloseFile(0)
I created three files using the code above, one each for the flags #PB_ASCII #PB_Unicode #PB_UTF8 :

dir
30/09/2022 18:17 48 ascii.txt
30/09/2022 18:19 96 unicode.txt
30/09/2022 18:18 48 utf8.txt

type ascii.txt
Somewhat hidden in the PB help (see CreateFile).

type unicode.txt
S o m e w h a t h i d d e n i n t h e P B h e l p ( s e e C r e a t e F i l e ) .

type utf8.txt
Somewhat hidden in the PB help (see CreateFile).

comp ascii.txt utf8.txt
Comparing ascii.txt and utf8.txt...
Files compare OK
User avatar
mk-soft
Always Here
Always Here
Posts: 5335
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Understanding strings in Unicode

Post by mk-soft »

You can not compare ascii and utf8, see wiki

Code: Select all

Structure ArrayOfByte
  b.b[0]
EndStructure

Define s1.s = "aou äöü"
Define s2.s

l1 = StringByteLength(s1, #PB_Ascii)
l2 = StringByteLength(s1, #PB_UTF8)
l3 = StringByteLength(s1, #PB_Unicode)

Debug "Len ASCII = " + l1
Debug "Len UTF8 = " + l2
Debug "Len UNICODE = " + l3

*t1 = Ascii(s1)
*t2 = UTF8(s1)
*t3 = @s1

Debug "ASCII"
*addr.ArrayOfByte = *t1
s2 = ""
uBound = l1 - 1
For i = 0 To uBound
  s2.s + RSet(Hex(*addr\b[i], #PB_Byte), 2, "0") + ","
Next
Debug s2

Debug "UFT8"
*addr.ArrayOfByte = *t2
s2 = ""
uBound = l2 - 1
For i = 0 To uBound
  s2.s + RSet(Hex(*addr\b[i], #PB_Byte), 2, "0") + ","
Next
Debug s2

Debug "UNICODE"
*addr.ArrayOfByte = *t3
s2 = ""
uBound = l3 - 1
For i = 0 To uBound
  s2.s + RSet(Hex(*addr\b[i], #PB_Byte), 2, "0") + ","
Next
Debug s2
FreeMemory(*t1)
FreeMemory(*t2)
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: Understanding strings in Unicode

Post by Oso »

mk-soft wrote: Fri Sep 30, 2022 10:37 pm You can not compare ascii and utf8, see wiki

Code: Select all

Define s1.s = "aou äöü"
Okay, thanks for that mk-soft. I understand what you mean. In my example text, all the characters preceded ASCII 128 so the results of ASCII and UTF-8 were identical. Good to make that distinction :oops:
Post Reply