Page 1 of 2

Unicode question

Posted: Wed Jan 10, 2024 8:33 am
by boddhi
Hi,

Beginner's question: Is there a quick way to read a string of unicode characters encoded with 2 bytes in a file or in memory, using the YY XX scheme instead of XX YY (i.e. with the Null character before the character code), without needing to read each two-byte character one by one?
Note: The data in the file is in Motorola format.

Example of a hexa string in files: 00 53 00 74 00 72 00 69 00 6E 00 67

Thanks.

Re: Unicode question

Posted: Wed Jan 10, 2024 10:06 am
by idle
use bswap16
https://www.purebasic.fr/english/viewtopic.php?t=77563
read in 2 bytes and call bswap16

Re: Unicode question

Posted: Wed Jan 10, 2024 11:30 am
by boddhi
@Idle,
idle wrote: Wed Jan 10, 2024 10:06 am use bswap16
Thanks for your response and your link.

But it isn't what I'm searching or I don't understood the topic (It's very possible too, I'm not familiar with assembler and C :mrgreen: )
I would operate on variable length (>8 characters) strings in one shot.
For now, I read words and manipulate, concatenate them to obtain final string...

Re: Unicode question

Posted: Wed Jan 10, 2024 12:00 pm
by AZJIO
I thought there was a native WinAPI function that would read UTF16BE, but that didn't happen. Now I think I need to read binary and then rearrange the bytes to get UTF16LE.


Here is an example that did not produce results. The WideCharToMultiByte function has a problem with initial letters, since it treats them as UTF16LE, which is why the first character is always corrupted.

Code: Select all

EnableExplicit

Global hFile
Global *Buffer
Global *Buffer2
Global FSize
Global Filename$
Global Count
Global Count2

Procedure _OpenFile(Filename$)
	Protected Pos
	FSize = FileSize(Filename$)
	If FSize > -1
		hFile = CreateFile_(Filename$, #GENERIC_READ | #GENERIC_WRITE , #FILE_SHARE_READ | #FILE_SHARE_WRITE, #Null, #OPEN_EXISTING, #FILE_ATTRIBUTE_NORMAL, 0)
		If hFile <> #INVALID_HANDLE_VALUE
			*Buffer = AllocateMemory(FSize)
			Pos = 3
			If SetFilePointer_(hFile, Pos, 0, #FILE_BEGIN) <>  #INVALID_SET_FILE_POINTER
				If ReadFile_(hFile, *Buffer, FSize, @Count, 0)
					ProcedureReturn Count
				EndIf
			EndIf
			CloseHandle_(hFile)
		EndIf
	EndIf
	ProcedureReturn 0
EndProcedure

Filename$ = "C:\PB\Source\Current\16BE.txt"
; Filename$ = "C:\PB\Source\Current\1.pb"
Debug _OpenFile(Filename$)
; ShowMemoryViewer(*Buffer, 100)
Debug Count
If Count
	Count2 = WideCharToMultiByte_(#CP_ACP, 0, *Buffer, Count, 0, 0, 0, 0)
	*Buffer2 = AllocateMemory(Count2)
	WideCharToMultiByte_(#CP_ACP, 0, *Buffer, Count, *Buffer2, Count2, 0, 0)
	Debug PeekS(*Buffer2, Count2, #PB_Ascii)
Else
	Debug "0 bytes read"
EndIf

If *Buffer : FreeMemory(*Buffer) : EndIf
If *Buffer2 : FreeMemory(*Buffer2) : EndIf

Re: Unicode question

Posted: Wed Jan 10, 2024 12:03 pm
by juergenkulow

Code: Select all

; Motorolaformat
Procedure.s Motorolaformat(*ptr.Ascii,l.i)
  Protected s$=Space(l/2)
  Protected *p.Ascii=@s$
  Protected *e=*ptr+l
  Protected a1.Ascii
  Protected a2.Ascii
  While *ptr<*e
    a1\a=*ptr\a
    *ptr+1
    a2\a=*ptr\a
    *ptr+1
    *p\a=a2\a
    *p+1
    *p\a=a1\a
    *p+1
  Wend 
  ShowMemoryViewer(@s$,?E-?L)
  ProcedureReturn s$
EndProcedure

s$=Space((?E-?L)/2)
CopyMemory(?L,@s$,?E-?L)
Debug Motorolaformat(@s$,Len(s$)*2)

DataSection
  L:
  Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
  E:
EndDataSection
; String
; 00000000006FB730  53 00 74 00 72 00 69 00 6E 00 67 00              S.t.r.i.n.g.
I find bswap16 more elegant and faster.

Re: Unicode question

Posted: Wed Jan 10, 2024 12:41 pm
by AZJIO

Code: Select all

EnableExplicit

Global *Buffer
Global FSize
Global Filename$
Global Count
Global Format
Global file_id


Procedure _ReadFile(Filename$)
	If Filename$
		file_id = ReadFile(#PB_Any, Filename$, #PB_Unicode)
		If file_id
			Format = ReadStringFormat(file_id)
			FSize = Lof(file_id)
			*Buffer = AllocateMemory(FSize)
			If *Buffer
				Count = ReadData(file_id, *Buffer, FSize)
			EndIf
			CloseFile(file_id)
		EndIf
	EndIf
; 	ProcedureReturn *Buffer
EndProcedure

Procedure BEtoLE(*b, n)
    Protected i, *c.Byte, *x.Byte

    If *b = 0 Or n = 0
        ProcedureReturn 0
    EndIf

    For i = 0 To n - 1 Step 2
    	;             *c\b = *b.b
    	*x = *b + i
    	*c = *b + i + 1
    	Swap *x\b, *c\b
    Next
EndProcedure

Filename$ = "C:\PB\Source\Current\16BE.txt"
; Filename$ = "C:\PB\Source\Current\1.pb"
_ReadFile(Filename$)
If Count
; 	ShowMemoryViewer(*Buffer, 100)
	BEtoLE(*Buffer, Count)
; 	ShowMemoryViewer(*Buffer, 100)
	Debug PeekS(*Buffer, Count, #PB_Unicode)
EndIf

If *Buffer : FreeMemory(*Buffer) : EndIf

Re: Unicode question

Posted: Wed Jan 10, 2024 1:01 pm
by boddhi
Thanks AZIJO & juergenkulow for your help

For the moment, I use a old technique (I don't know if it is more faster but it works :mrgreen: ) :
Note : I always know in advance the number of characters to be recovered

Code: Select all

Length.u=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)

Length-2
For Count.u=0 To Length Step 2
  Value.u=PeekW(@String+Count)
  String2.s+Chr((Value&$FF)<<8+(Value>>8)&$FF)
Next
Debug String2

DataSection
  L:
  Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
  E:
EndDataSection

Re: Unicode question

Posted: Wed Jan 10, 2024 3:07 pm
by mk-soft
Use correct byte order inside data section
Intel lowByte hiByte notation.

Code: Select all

Length.u=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)

Debug String

string2.s = PeekS(?L)
Debug string2

DataSection
  L:
  Data.a $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67, $00
  Data.u 0
  E:
EndDataSection

Re: Unicode question

Posted: Wed Jan 10, 2024 3:18 pm
by AZJIO

Code: Select all

Debug PeekS(?L)
Length.u=(?E-?L)
String.s=Space(Length/2 + 1)
CopyMemory(@"",@String,1)
CopyMemory(?L,@String + 1,?E-?L)

Debug PeekS(@String + 2)

DataSection
  L:
  Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
  E:
EndDataSection

Re: Unicode question

Posted: Wed Jan 10, 2024 4:15 pm
by boddhi
HI mk-soft
mk-soft wrote: Use correct byte order inside data section
boddhi wrote: Example of a hexa string in files: 00 53 00 74 00 72 00 69 00 6E 00 67
:wink:

Re: Unicode question

Posted: Wed Jan 10, 2024 5:26 pm
by boddhi
@AZJIO

Eureka: Simple and efficient !
Offset of one byte... why didn't I think of that? :oops: :oops: :oops:

I retrieve the string from the file into memory and offset it with one byte.

Thanks :wink:

Re: Unicode question

Posted: Wed Jan 10, 2024 7:16 pm
by mk-soft
Simply moving a byte may mean that the unicode characters are no longer correct.
It is best to enter them as Unicode in datasection right away

Code: Select all

Length.u=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)

Debug String

string2.s = PeekS(?L)
Debug string2

MessageRequester("Title", string2)

DataSection
  L:
  Data.u $53, $74, $72, $69, $6E, $67, $2614, $26D4
  Data.u 0
  E:
EndDataSection

Re: Unicode question

Posted: Fri Jan 12, 2024 10:52 am
by juergenkulow
Give me a second try:

Code: Select all

; Motorolaformat with Pointer to Unicode and Ascii without PeekW, Chr
Structure aatype
  a1.a
  a2.a
EndStructure

Structure uaatype
  StructureUnion
    u.u
    aa.aatype
  EndStructureUnion
EndStructure
  
Length=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)
*ptr.uaatype=@String 
*pend=*ptr+Length
For *ptr.uaatype=@String To *pend Step 2
  *ptr\u=*ptr\aa\a1<<8+*ptr\aa\a2
Next
Debug String

DataSection
  L:
  Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
  E:
EndDataSection

Re: Unicode question

Posted: Sat Jan 13, 2024 10:19 am
by boddhi
Further to my question, I discover that my problem is more complex to solve, as I could potentially encounter the situation described below.

Strings are encoded in Windows Unicode-BMP format, which consists in encoding the character value on 2 bytes (a WORD). To retrieve the string from memory or a file, each character must be concatenated with Chr().
Except that in memory or in the file, for example, an 'A' is encoded $00 41 and not $41 00. As a result, I can't use PEEK with the #PB_Unicode option.

In short, is it possible to retrieve in one shot a string coded $00 41 instead of $41 00, or $03 9A instead of $9A 03, (without having to use a loop for each character)?
I've also tried WideCharToMultiByte but didn't succeed (maybe I didn't use the right parameters).

To better understand the situation, here's an example (valid only with intel memory alignment). Click on Start after each CallDebugger

Code: Select all

Procedure.s MemoryRead(*ArgBuffer,ArgLength.a)
  Protected.a Count
  Protected.u Value
  Protected.s String
  
  ArgLength-2
  For Count=0 To ArgLength Step 2
    Value=PeekW(*ArgBuffer+Count)
    Value=(Value&$FF)<<8+(Value>>8)&$FF
    String+Chr(Value)
  Next
  ProcedureReturn String
EndProcedure

String.s="Normal"
Debug "Memory bloc is equal to (with Intel alignment) : 4E 00 6F 00 72 00 6D 00 61 00 6C 00"
Debug "Reading memory (#PB_Unicode) : "+PeekS(@string,12,#PB_Unicode)
ShowMemoryViewer(@String,Len(String)*2)
CallDebugger
*Buffer=AllocateMemory(13)
CopyMemory(@String,*Buffer+1,12)
Debug "Memory bloc is equal to (with Intel alignment) : 00 4E 00 6F 00 72 00 6D 00 61 00 6C"
Debug "Reading memory (MemoryRead procedure) : "+MemoryRead(*Buffer,12)
Debug "Reading memory (#PB_Ascii) : "+PeekS(*Buffer,12,#PB_Ascii)
Debug "Reading memory (#PB_UTF8) : "+PeekS(*Buffer,12,#PB_UTF8)
Debug "Reading memory (#PB_Unicode) : "+PeekS(*Buffer,12,#PB_Unicode)
String2.s=Space(6)
*buffer2=AllocateMemory(6)
WideCharToMultiByte_(#CP_ACP,0,@String,12,*buffer2,6,0,0)
Debug "WideCharToMultiByte : "+PeekS(*buffer2,6,#PB_Ascii)
; String2.s=Space(20)
; Longueur=WideCharToMultiByte_(#CP_ACP,0,@String,12,@String2,6,0,0)
; Debug "String2 : "+String2
; ShowMemoryViewer(@String2,6)
;CallDebugger

Debug "----"
String="Κανονικά"
ReAllocateMemory(*Buffer,17)
CopyMemory(@String,*Buffer+1,16)
PokeB(*Buffer,3)
Debug "Memory bloc is now equal to (with Intel alignment) : 03 9A 03 B1 $03 BD 03 BF 03 BD 03 B9 03 BA 03 AC"
Debug "Reading memory (MemoryRead procedure) : "+MemoryRead(*Buffer,16)
Debug "Reading memory (#PB_Ascii) : "+PeekS(*Buffer,16,#PB_Ascii)
Debug "Reading memory (#PB_UTF8) : "+PeekS(*Buffer,16,#PB_UTF8)
Debug "Reading memory (#PB_Unicode) : "+PeekS(*Buffer,16,#PB_Unicode)
String2=Space(16)
ReAllocateMemory(*buffer2,16)
WideCharToMultiByte_(#CP_ACP,0,@String,16,@String2,8,0,0)
Debug "WideCharToMultiByte : "+String2
WideCharToMultiByte_(#CP_ACP,0,@String,16,*buffer2,8,0,0)
ShowMemoryViewer(*Buffer2,16)
CallDebugger

Re: Unicode question

Posted: Sat Jan 13, 2024 12:47 pm
by juergenkulow

Code: Select all

; to Motorola
Procedure.s MemoryRead(*ArgBuffer,ArgLength.a)
  Protected.a Count
  Protected.u Value
  Protected.s String
  
  ArgLength-2
  For Count=0 To ArgLength Step 2
    Value=PeekW(*ArgBuffer+Count)
    Value=(Value&$FF)<<8+(Value>>8)&$FF
    ; Debug Hex(Value)
    String+Chr(Value)
  Next
  ProcedureReturn String
EndProcedure

Structure aatype
  a1.a
  a2.a
EndStructure

Structure uaatype
  StructureUnion
    u.u
    aa.aatype
  EndStructureUnion
EndStructure

Procedure toMotorola(*ptr.uaatype,l.i)
  Protected *pend=*ptr+l
  For *ptr=*ptr To *pend Step 2
    *ptr\u=*ptr\aa\a1<<8+*ptr\aa\a2
  Next
EndProcedure  
  
Debug MemoryRead(?L,?E-?L)
s.s="Κ α ν ο ν ι κ ά"
toMotorola(@s,Len(s)*2)
ShowMemoryViewer(@s,Len(s)*2)
Debug MemoryRead(@s,Len(s)*2)
End 

DataSection
  L:
  Data.a $03, $9A, $03, $B1, $03, $BD, $03, $BF, $03, $BD, $03, $B9, $03, $BA, $03, $AC
  E:
EndDataSection
; Κανονικά
; Κ α ν ο ν ι κ ά
; 0000000001E112F0  03 9A 00 20 03 B1 00 20 03 BD 00 20 03 BF 00 20  .š. .±. .½. .¿. 
; 0000000001E11300  03 BD 00 20 03 B9 00 20 03 BA 00 20 03 AC        .½. .¹. .º. .¬
Your procedure is fine, but unfortunately your test program is not.