Unicode question

Just starting out? Need help? Post your questions and find answers here.
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Unicode question

Post by boddhi »

Hi,

Beginner's question: Is there a quick way to read a string of unicode characters encoded with 2 bytes in a file or in memory, using the YY XX scheme instead of XX YY (i.e. with the Null character before the character code), without needing to read each two-byte character one by one?
Note: The data in the file is in Motorola format.

Example of a hexa string in files: 00 53 00 74 00 72 00 69 00 6E 00 67

Thanks.
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
User avatar
idle
Always Here
Always Here
Posts: 6035
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Unicode question

Post by idle »

use bswap16
https://www.purebasic.fr/english/viewtopic.php?t=77563
read in 2 bytes and call bswap16
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: Unicode question

Post by boddhi »

@Idle,
idle wrote: Wed Jan 10, 2024 10:06 am use bswap16
Thanks for your response and your link.

But it isn't what I'm searching or I don't understood the topic (It's very possible too, I'm not familiar with assembler and C :mrgreen: )
I would operate on variable length (>8 characters) strings in one shot.
For now, I read words and manipulate, concatenate them to obtain final string...
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
AZJIO
Addict
Addict
Posts: 2226
Joined: Sun May 14, 2017 1:48 am

Re: Unicode question

Post by AZJIO »

I thought there was a native WinAPI function that would read UTF16BE, but that didn't happen. Now I think I need to read binary and then rearrange the bytes to get UTF16LE.


Here is an example that did not produce results. The WideCharToMultiByte function has a problem with initial letters, since it treats them as UTF16LE, which is why the first character is always corrupted.

Code: Select all

EnableExplicit

Global hFile
Global *Buffer
Global *Buffer2
Global FSize
Global Filename$
Global Count
Global Count2

Procedure _OpenFile(Filename$)
	Protected Pos
	FSize = FileSize(Filename$)
	If FSize > -1
		hFile = CreateFile_(Filename$, #GENERIC_READ | #GENERIC_WRITE , #FILE_SHARE_READ | #FILE_SHARE_WRITE, #Null, #OPEN_EXISTING, #FILE_ATTRIBUTE_NORMAL, 0)
		If hFile <> #INVALID_HANDLE_VALUE
			*Buffer = AllocateMemory(FSize)
			Pos = 3
			If SetFilePointer_(hFile, Pos, 0, #FILE_BEGIN) <>  #INVALID_SET_FILE_POINTER
				If ReadFile_(hFile, *Buffer, FSize, @Count, 0)
					ProcedureReturn Count
				EndIf
			EndIf
			CloseHandle_(hFile)
		EndIf
	EndIf
	ProcedureReturn 0
EndProcedure

Filename$ = "C:\PB\Source\Current\16BE.txt"
; Filename$ = "C:\PB\Source\Current\1.pb"
Debug _OpenFile(Filename$)
; ShowMemoryViewer(*Buffer, 100)
Debug Count
If Count
	Count2 = WideCharToMultiByte_(#CP_ACP, 0, *Buffer, Count, 0, 0, 0, 0)
	*Buffer2 = AllocateMemory(Count2)
	WideCharToMultiByte_(#CP_ACP, 0, *Buffer, Count, *Buffer2, Count2, 0, 0)
	Debug PeekS(*Buffer2, Count2, #PB_Ascii)
Else
	Debug "0 bytes read"
EndIf

If *Buffer : FreeMemory(*Buffer) : EndIf
If *Buffer2 : FreeMemory(*Buffer2) : EndIf
juergenkulow
Enthusiast
Enthusiast
Posts: 581
Joined: Wed Sep 25, 2019 10:18 am

Re: Unicode question

Post by juergenkulow »

Code: Select all

; Motorolaformat
Procedure.s Motorolaformat(*ptr.Ascii,l.i)
  Protected s$=Space(l/2)
  Protected *p.Ascii=@s$
  Protected *e=*ptr+l
  Protected a1.Ascii
  Protected a2.Ascii
  While *ptr<*e
    a1\a=*ptr\a
    *ptr+1
    a2\a=*ptr\a
    *ptr+1
    *p\a=a2\a
    *p+1
    *p\a=a1\a
    *p+1
  Wend 
  ShowMemoryViewer(@s$,?E-?L)
  ProcedureReturn s$
EndProcedure

s$=Space((?E-?L)/2)
CopyMemory(?L,@s$,?E-?L)
Debug Motorolaformat(@s$,Len(s$)*2)

DataSection
  L:
  Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
  E:
EndDataSection
; String
; 00000000006FB730  53 00 74 00 72 00 69 00 6E 00 67 00              S.t.r.i.n.g.
I find bswap16 more elegant and faster.
AZJIO
Addict
Addict
Posts: 2226
Joined: Sun May 14, 2017 1:48 am

Re: Unicode question

Post by AZJIO »

Code: Select all

EnableExplicit

Global *Buffer
Global FSize
Global Filename$
Global Count
Global Format
Global file_id


Procedure _ReadFile(Filename$)
	If Filename$
		file_id = ReadFile(#PB_Any, Filename$, #PB_Unicode)
		If file_id
			Format = ReadStringFormat(file_id)
			FSize = Lof(file_id)
			*Buffer = AllocateMemory(FSize)
			If *Buffer
				Count = ReadData(file_id, *Buffer, FSize)
			EndIf
			CloseFile(file_id)
		EndIf
	EndIf
; 	ProcedureReturn *Buffer
EndProcedure

Procedure BEtoLE(*b, n)
    Protected i, *c.Byte, *x.Byte

    If *b = 0 Or n = 0
        ProcedureReturn 0
    EndIf

    For i = 0 To n - 1 Step 2
    	;             *c\b = *b.b
    	*x = *b + i
    	*c = *b + i + 1
    	Swap *x\b, *c\b
    Next
EndProcedure

Filename$ = "C:\PB\Source\Current\16BE.txt"
; Filename$ = "C:\PB\Source\Current\1.pb"
_ReadFile(Filename$)
If Count
; 	ShowMemoryViewer(*Buffer, 100)
	BEtoLE(*Buffer, Count)
; 	ShowMemoryViewer(*Buffer, 100)
	Debug PeekS(*Buffer, Count, #PB_Unicode)
EndIf

If *Buffer : FreeMemory(*Buffer) : EndIf
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: Unicode question

Post by boddhi »

Thanks AZIJO & juergenkulow for your help

For the moment, I use a old technique (I don't know if it is more faster but it works :mrgreen: ) :
Note : I always know in advance the number of characters to be recovered

Code: Select all

Length.u=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)

Length-2
For Count.u=0 To Length Step 2
  Value.u=PeekW(@String+Count)
  String2.s+Chr((Value&$FF)<<8+(Value>>8)&$FF)
Next
Debug String2

DataSection
  L:
  Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
  E:
EndDataSection
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
User avatar
mk-soft
Always Here
Always Here
Posts: 6321
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Unicode question

Post by mk-soft »

Use correct byte order inside data section
Intel lowByte hiByte notation.

Code: Select all

Length.u=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)

Debug String

string2.s = PeekS(?L)
Debug string2

DataSection
  L:
  Data.a $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67, $00
  Data.u 0
  E:
EndDataSection
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
AZJIO
Addict
Addict
Posts: 2226
Joined: Sun May 14, 2017 1:48 am

Re: Unicode question

Post by AZJIO »

Code: Select all

Debug PeekS(?L)
Length.u=(?E-?L)
String.s=Space(Length/2 + 1)
CopyMemory(@"",@String,1)
CopyMemory(?L,@String + 1,?E-?L)

Debug PeekS(@String + 2)

DataSection
  L:
  Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
  E:
EndDataSection
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: Unicode question

Post by boddhi »

HI mk-soft
mk-soft wrote: Use correct byte order inside data section
boddhi wrote: Example of a hexa string in files: 00 53 00 74 00 72 00 69 00 6E 00 67
:wink:
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: Unicode question

Post by boddhi »

@AZJIO

Eureka: Simple and efficient !
Offset of one byte... why didn't I think of that? :oops: :oops: :oops:

I retrieve the string from the file into memory and offset it with one byte.

Thanks :wink:
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
User avatar
mk-soft
Always Here
Always Here
Posts: 6321
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Unicode question

Post by mk-soft »

Simply moving a byte may mean that the unicode characters are no longer correct.
It is best to enter them as Unicode in datasection right away

Code: Select all

Length.u=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)

Debug String

string2.s = PeekS(?L)
Debug string2

MessageRequester("Title", string2)

DataSection
  L:
  Data.u $53, $74, $72, $69, $6E, $67, $2614, $26D4
  Data.u 0
  E:
EndDataSection
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
juergenkulow
Enthusiast
Enthusiast
Posts: 581
Joined: Wed Sep 25, 2019 10:18 am

Re: Unicode question

Post by juergenkulow »

Give me a second try:

Code: Select all

; Motorolaformat with Pointer to Unicode and Ascii without PeekW, Chr
Structure aatype
  a1.a
  a2.a
EndStructure

Structure uaatype
  StructureUnion
    u.u
    aa.aatype
  EndStructureUnion
EndStructure
  
Length=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)
*ptr.uaatype=@String 
*pend=*ptr+Length
For *ptr.uaatype=@String To *pend Step 2
  *ptr\u=*ptr\aa\a1<<8+*ptr\aa\a2
Next
Debug String

DataSection
  L:
  Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
  E:
EndDataSection
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: Unicode question

Post by boddhi »

Further to my question, I discover that my problem is more complex to solve, as I could potentially encounter the situation described below.

Strings are encoded in Windows Unicode-BMP format, which consists in encoding the character value on 2 bytes (a WORD). To retrieve the string from memory or a file, each character must be concatenated with Chr().
Except that in memory or in the file, for example, an 'A' is encoded $00 41 and not $41 00. As a result, I can't use PEEK with the #PB_Unicode option.

In short, is it possible to retrieve in one shot a string coded $00 41 instead of $41 00, or $03 9A instead of $9A 03, (without having to use a loop for each character)?
I've also tried WideCharToMultiByte but didn't succeed (maybe I didn't use the right parameters).

To better understand the situation, here's an example (valid only with intel memory alignment). Click on Start after each CallDebugger

Code: Select all

Procedure.s MemoryRead(*ArgBuffer,ArgLength.a)
  Protected.a Count
  Protected.u Value
  Protected.s String
  
  ArgLength-2
  For Count=0 To ArgLength Step 2
    Value=PeekW(*ArgBuffer+Count)
    Value=(Value&$FF)<<8+(Value>>8)&$FF
    String+Chr(Value)
  Next
  ProcedureReturn String
EndProcedure

String.s="Normal"
Debug "Memory bloc is equal to (with Intel alignment) : 4E 00 6F 00 72 00 6D 00 61 00 6C 00"
Debug "Reading memory (#PB_Unicode) : "+PeekS(@string,12,#PB_Unicode)
ShowMemoryViewer(@String,Len(String)*2)
CallDebugger
*Buffer=AllocateMemory(13)
CopyMemory(@String,*Buffer+1,12)
Debug "Memory bloc is equal to (with Intel alignment) : 00 4E 00 6F 00 72 00 6D 00 61 00 6C"
Debug "Reading memory (MemoryRead procedure) : "+MemoryRead(*Buffer,12)
Debug "Reading memory (#PB_Ascii) : "+PeekS(*Buffer,12,#PB_Ascii)
Debug "Reading memory (#PB_UTF8) : "+PeekS(*Buffer,12,#PB_UTF8)
Debug "Reading memory (#PB_Unicode) : "+PeekS(*Buffer,12,#PB_Unicode)
String2.s=Space(6)
*buffer2=AllocateMemory(6)
WideCharToMultiByte_(#CP_ACP,0,@String,12,*buffer2,6,0,0)
Debug "WideCharToMultiByte : "+PeekS(*buffer2,6,#PB_Ascii)
; String2.s=Space(20)
; Longueur=WideCharToMultiByte_(#CP_ACP,0,@String,12,@String2,6,0,0)
; Debug "String2 : "+String2
; ShowMemoryViewer(@String2,6)
;CallDebugger

Debug "----"
String="Κανονικά"
ReAllocateMemory(*Buffer,17)
CopyMemory(@String,*Buffer+1,16)
PokeB(*Buffer,3)
Debug "Memory bloc is now equal to (with Intel alignment) : 03 9A 03 B1 $03 BD 03 BF 03 BD 03 B9 03 BA 03 AC"
Debug "Reading memory (MemoryRead procedure) : "+MemoryRead(*Buffer,16)
Debug "Reading memory (#PB_Ascii) : "+PeekS(*Buffer,16,#PB_Ascii)
Debug "Reading memory (#PB_UTF8) : "+PeekS(*Buffer,16,#PB_UTF8)
Debug "Reading memory (#PB_Unicode) : "+PeekS(*Buffer,16,#PB_Unicode)
String2=Space(16)
ReAllocateMemory(*buffer2,16)
WideCharToMultiByte_(#CP_ACP,0,@String,16,@String2,8,0,0)
Debug "WideCharToMultiByte : "+String2
WideCharToMultiByte_(#CP_ACP,0,@String,16,*buffer2,8,0,0)
ShowMemoryViewer(*Buffer2,16)
CallDebugger
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
juergenkulow
Enthusiast
Enthusiast
Posts: 581
Joined: Wed Sep 25, 2019 10:18 am

Re: Unicode question

Post by juergenkulow »

Code: Select all

; to Motorola
Procedure.s MemoryRead(*ArgBuffer,ArgLength.a)
  Protected.a Count
  Protected.u Value
  Protected.s String
  
  ArgLength-2
  For Count=0 To ArgLength Step 2
    Value=PeekW(*ArgBuffer+Count)
    Value=(Value&$FF)<<8+(Value>>8)&$FF
    ; Debug Hex(Value)
    String+Chr(Value)
  Next
  ProcedureReturn String
EndProcedure

Structure aatype
  a1.a
  a2.a
EndStructure

Structure uaatype
  StructureUnion
    u.u
    aa.aatype
  EndStructureUnion
EndStructure

Procedure toMotorola(*ptr.uaatype,l.i)
  Protected *pend=*ptr+l
  For *ptr=*ptr To *pend Step 2
    *ptr\u=*ptr\aa\a1<<8+*ptr\aa\a2
  Next
EndProcedure  
  
Debug MemoryRead(?L,?E-?L)
s.s="Κ α ν ο ν ι κ ά"
toMotorola(@s,Len(s)*2)
ShowMemoryViewer(@s,Len(s)*2)
Debug MemoryRead(@s,Len(s)*2)
End 

DataSection
  L:
  Data.a $03, $9A, $03, $B1, $03, $BD, $03, $BF, $03, $BD, $03, $B9, $03, $BA, $03, $AC
  E:
EndDataSection
; Κανονικά
; Κ α ν ο ν ι κ ά
; 0000000001E112F0  03 9A 00 20 03 B1 00 20 03 BD 00 20 03 BF 00 20  .š. .±. .½. .¿. 
; 0000000001E11300  03 BD 00 20 03 B9 00 20 03 BA 00 20 03 AC        .½. .¹. .º. .¬
Your procedure is fine, but unfortunately your test program is not.
Post Reply