Unicode question
Unicode question
Hi,
Beginner's question: Is there a quick way to read a string of unicode characters encoded with 2 bytes in a file or in memory, using the YY XX scheme instead of XX YY (i.e. with the Null character before the character code), without needing to read each two-byte character one by one?
Note: The data in the file is in Motorola format.
Example of a hexa string in files: 00 53 00 74 00 72 00 69 00 6E 00 67
Thanks.
Beginner's question: Is there a quick way to read a string of unicode characters encoded with 2 bytes in a file or in memory, using the YY XX scheme instead of XX YY (i.e. with the Null character before the character code), without needing to read each two-byte character one by one?
Note: The data in the file is in Motorola format.
Example of a hexa string in files: 00 53 00 74 00 72 00 69 00 6E 00 67
Thanks.
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
Except on this sentence...
Re: Unicode question
@Idle,
But it isn't what I'm searching or I don't understood the topic (It's very possible too, I'm not familiar with assembler and C
)
I would operate on variable length (>8 characters) strings in one shot.
For now, I read words and manipulate, concatenate them to obtain final string...
Thanks for your response and your link.
But it isn't what I'm searching or I don't understood the topic (It's very possible too, I'm not familiar with assembler and C
I would operate on variable length (>8 characters) strings in one shot.
For now, I read words and manipulate, concatenate them to obtain final string...
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
Except on this sentence...
Re: Unicode question
I thought there was a native WinAPI function that would read UTF16BE, but that didn't happen. Now I think I need to read binary and then rearrange the bytes to get UTF16LE.
Here is an example that did not produce results. The WideCharToMultiByte function has a problem with initial letters, since it treats them as UTF16LE, which is why the first character is always corrupted.
Here is an example that did not produce results. The WideCharToMultiByte function has a problem with initial letters, since it treats them as UTF16LE, which is why the first character is always corrupted.
Code: Select all
EnableExplicit
Global hFile
Global *Buffer
Global *Buffer2
Global FSize
Global Filename$
Global Count
Global Count2
Procedure _OpenFile(Filename$)
Protected Pos
FSize = FileSize(Filename$)
If FSize > -1
hFile = CreateFile_(Filename$, #GENERIC_READ | #GENERIC_WRITE , #FILE_SHARE_READ | #FILE_SHARE_WRITE, #Null, #OPEN_EXISTING, #FILE_ATTRIBUTE_NORMAL, 0)
If hFile <> #INVALID_HANDLE_VALUE
*Buffer = AllocateMemory(FSize)
Pos = 3
If SetFilePointer_(hFile, Pos, 0, #FILE_BEGIN) <> #INVALID_SET_FILE_POINTER
If ReadFile_(hFile, *Buffer, FSize, @Count, 0)
ProcedureReturn Count
EndIf
EndIf
CloseHandle_(hFile)
EndIf
EndIf
ProcedureReturn 0
EndProcedure
Filename$ = "C:\PB\Source\Current\16BE.txt"
; Filename$ = "C:\PB\Source\Current\1.pb"
Debug _OpenFile(Filename$)
; ShowMemoryViewer(*Buffer, 100)
Debug Count
If Count
Count2 = WideCharToMultiByte_(#CP_ACP, 0, *Buffer, Count, 0, 0, 0, 0)
*Buffer2 = AllocateMemory(Count2)
WideCharToMultiByte_(#CP_ACP, 0, *Buffer, Count, *Buffer2, Count2, 0, 0)
Debug PeekS(*Buffer2, Count2, #PB_Ascii)
Else
Debug "0 bytes read"
EndIf
If *Buffer : FreeMemory(*Buffer) : EndIf
If *Buffer2 : FreeMemory(*Buffer2) : EndIf-
juergenkulow
- Enthusiast

- Posts: 581
- Joined: Wed Sep 25, 2019 10:18 am
Re: Unicode question
Code: Select all
; Motorolaformat
Procedure.s Motorolaformat(*ptr.Ascii,l.i)
Protected s$=Space(l/2)
Protected *p.Ascii=@s$
Protected *e=*ptr+l
Protected a1.Ascii
Protected a2.Ascii
While *ptr<*e
a1\a=*ptr\a
*ptr+1
a2\a=*ptr\a
*ptr+1
*p\a=a2\a
*p+1
*p\a=a1\a
*p+1
Wend
ShowMemoryViewer(@s$,?E-?L)
ProcedureReturn s$
EndProcedure
s$=Space((?E-?L)/2)
CopyMemory(?L,@s$,?E-?L)
Debug Motorolaformat(@s$,Len(s$)*2)
DataSection
L:
Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
E:
EndDataSection
; String
; 00000000006FB730 53 00 74 00 72 00 69 00 6E 00 67 00 S.t.r.i.n.g.
Re: Unicode question
Code: Select all
EnableExplicit
Global *Buffer
Global FSize
Global Filename$
Global Count
Global Format
Global file_id
Procedure _ReadFile(Filename$)
If Filename$
file_id = ReadFile(#PB_Any, Filename$, #PB_Unicode)
If file_id
Format = ReadStringFormat(file_id)
FSize = Lof(file_id)
*Buffer = AllocateMemory(FSize)
If *Buffer
Count = ReadData(file_id, *Buffer, FSize)
EndIf
CloseFile(file_id)
EndIf
EndIf
; ProcedureReturn *Buffer
EndProcedure
Procedure BEtoLE(*b, n)
Protected i, *c.Byte, *x.Byte
If *b = 0 Or n = 0
ProcedureReturn 0
EndIf
For i = 0 To n - 1 Step 2
; *c\b = *b.b
*x = *b + i
*c = *b + i + 1
Swap *x\b, *c\b
Next
EndProcedure
Filename$ = "C:\PB\Source\Current\16BE.txt"
; Filename$ = "C:\PB\Source\Current\1.pb"
_ReadFile(Filename$)
If Count
; ShowMemoryViewer(*Buffer, 100)
BEtoLE(*Buffer, Count)
; ShowMemoryViewer(*Buffer, 100)
Debug PeekS(*Buffer, Count, #PB_Unicode)
EndIf
If *Buffer : FreeMemory(*Buffer) : EndIfRe: Unicode question
Thanks AZIJO & juergenkulow for your help
For the moment, I use a old technique (I don't know if it is more faster but it works
) :
Note : I always know in advance the number of characters to be recovered
For the moment, I use a old technique (I don't know if it is more faster but it works
Note : I always know in advance the number of characters to be recovered
Code: Select all
Length.u=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)
Length-2
For Count.u=0 To Length Step 2
Value.u=PeekW(@String+Count)
String2.s+Chr((Value&$FF)<<8+(Value>>8)&$FF)
Next
Debug String2
DataSection
L:
Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
E:
EndDataSection
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
Except on this sentence...
Re: Unicode question
Use correct byte order inside data section
Intel lowByte hiByte notation.
Intel lowByte hiByte notation.
Code: Select all
Length.u=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)
Debug String
string2.s = PeekS(?L)
Debug string2
DataSection
L:
Data.a $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67, $00
Data.u 0
E:
EndDataSection
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
Re: Unicode question
Code: Select all
Debug PeekS(?L)
Length.u=(?E-?L)
String.s=Space(Length/2 + 1)
CopyMemory(@"",@String,1)
CopyMemory(?L,@String + 1,?E-?L)
Debug PeekS(@String + 2)
DataSection
L:
Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
E:
EndDataSectionRe: Unicode question
HI mk-soft

mk-soft wrote: Use correct byte order inside data section
boddhi wrote: Example of a hexa string in files: 00 53 00 74 00 72 00 69 00 6E 00 67
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
Except on this sentence...
Re: Unicode question
@AZJIO
Eureka: Simple and efficient !
Offset of one byte... why didn't I think of that?
I retrieve the string from the file into memory and offset it with one byte.
Thanks
Eureka: Simple and efficient !
Offset of one byte... why didn't I think of that?
I retrieve the string from the file into memory and offset it with one byte.
Thanks
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
Except on this sentence...
Re: Unicode question
Simply moving a byte may mean that the unicode characters are no longer correct.
It is best to enter them as Unicode in datasection right away
It is best to enter them as Unicode in datasection right away
Code: Select all
Length.u=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)
Debug String
string2.s = PeekS(?L)
Debug string2
MessageRequester("Title", string2)
DataSection
L:
Data.u $53, $74, $72, $69, $6E, $67, $2614, $26D4
Data.u 0
E:
EndDataSection
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
-
juergenkulow
- Enthusiast

- Posts: 581
- Joined: Wed Sep 25, 2019 10:18 am
Re: Unicode question
Give me a second try:
Code: Select all
; Motorolaformat with Pointer to Unicode and Ascii without PeekW, Chr
Structure aatype
a1.a
a2.a
EndStructure
Structure uaatype
StructureUnion
u.u
aa.aatype
EndStructureUnion
EndStructure
Length=(?E-?L)
String.s=Space(Length/2)
CopyMemory(?L,@String,?E-?L)
*ptr.uaatype=@String
*pend=*ptr+Length
For *ptr.uaatype=@String To *pend Step 2
*ptr\u=*ptr\aa\a1<<8+*ptr\aa\a2
Next
Debug String
DataSection
L:
Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67
E:
EndDataSection
Re: Unicode question
Further to my question, I discover that my problem is more complex to solve, as I could potentially encounter the situation described below.
Strings are encoded in Windows Unicode-BMP format, which consists in encoding the character value on 2 bytes (a WORD). To retrieve the string from memory or a file, each character must be concatenated with Chr().
Except that in memory or in the file, for example, an 'A' is encoded $00 41 and not $41 00. As a result, I can't use PEEK with the #PB_Unicode option.
In short, is it possible to retrieve in one shot a string coded $00 41 instead of $41 00, or $03 9A instead of $9A 03, (without having to use a loop for each character)?
I've also tried WideCharToMultiByte but didn't succeed (maybe I didn't use the right parameters).
To better understand the situation, here's an example (valid only with intel memory alignment). Click on Start after each CallDebugger
Strings are encoded in Windows Unicode-BMP format, which consists in encoding the character value on 2 bytes (a WORD). To retrieve the string from memory or a file, each character must be concatenated with Chr().
Except that in memory or in the file, for example, an 'A' is encoded $00 41 and not $41 00. As a result, I can't use PEEK with the #PB_Unicode option.
In short, is it possible to retrieve in one shot a string coded $00 41 instead of $41 00, or $03 9A instead of $9A 03, (without having to use a loop for each character)?
I've also tried WideCharToMultiByte but didn't succeed (maybe I didn't use the right parameters).
To better understand the situation, here's an example (valid only with intel memory alignment). Click on Start after each CallDebugger
Code: Select all
Procedure.s MemoryRead(*ArgBuffer,ArgLength.a)
Protected.a Count
Protected.u Value
Protected.s String
ArgLength-2
For Count=0 To ArgLength Step 2
Value=PeekW(*ArgBuffer+Count)
Value=(Value&$FF)<<8+(Value>>8)&$FF
String+Chr(Value)
Next
ProcedureReturn String
EndProcedure
String.s="Normal"
Debug "Memory bloc is equal to (with Intel alignment) : 4E 00 6F 00 72 00 6D 00 61 00 6C 00"
Debug "Reading memory (#PB_Unicode) : "+PeekS(@string,12,#PB_Unicode)
ShowMemoryViewer(@String,Len(String)*2)
CallDebugger
*Buffer=AllocateMemory(13)
CopyMemory(@String,*Buffer+1,12)
Debug "Memory bloc is equal to (with Intel alignment) : 00 4E 00 6F 00 72 00 6D 00 61 00 6C"
Debug "Reading memory (MemoryRead procedure) : "+MemoryRead(*Buffer,12)
Debug "Reading memory (#PB_Ascii) : "+PeekS(*Buffer,12,#PB_Ascii)
Debug "Reading memory (#PB_UTF8) : "+PeekS(*Buffer,12,#PB_UTF8)
Debug "Reading memory (#PB_Unicode) : "+PeekS(*Buffer,12,#PB_Unicode)
String2.s=Space(6)
*buffer2=AllocateMemory(6)
WideCharToMultiByte_(#CP_ACP,0,@String,12,*buffer2,6,0,0)
Debug "WideCharToMultiByte : "+PeekS(*buffer2,6,#PB_Ascii)
; String2.s=Space(20)
; Longueur=WideCharToMultiByte_(#CP_ACP,0,@String,12,@String2,6,0,0)
; Debug "String2 : "+String2
; ShowMemoryViewer(@String2,6)
;CallDebugger
Debug "----"
String="Κανονικά"
ReAllocateMemory(*Buffer,17)
CopyMemory(@String,*Buffer+1,16)
PokeB(*Buffer,3)
Debug "Memory bloc is now equal to (with Intel alignment) : 03 9A 03 B1 $03 BD 03 BF 03 BD 03 B9 03 BA 03 AC"
Debug "Reading memory (MemoryRead procedure) : "+MemoryRead(*Buffer,16)
Debug "Reading memory (#PB_Ascii) : "+PeekS(*Buffer,16,#PB_Ascii)
Debug "Reading memory (#PB_UTF8) : "+PeekS(*Buffer,16,#PB_UTF8)
Debug "Reading memory (#PB_Unicode) : "+PeekS(*Buffer,16,#PB_Unicode)
String2=Space(16)
ReAllocateMemory(*buffer2,16)
WideCharToMultiByte_(#CP_ACP,0,@String,16,@String2,8,0,0)
Debug "WideCharToMultiByte : "+String2
WideCharToMultiByte_(#CP_ACP,0,@String,16,*buffer2,8,0,0)
ShowMemoryViewer(*Buffer2,16)
CallDebugger
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
Except on this sentence...
-
juergenkulow
- Enthusiast

- Posts: 581
- Joined: Wed Sep 25, 2019 10:18 am
Re: Unicode question
Code: Select all
; to Motorola
Procedure.s MemoryRead(*ArgBuffer,ArgLength.a)
Protected.a Count
Protected.u Value
Protected.s String
ArgLength-2
For Count=0 To ArgLength Step 2
Value=PeekW(*ArgBuffer+Count)
Value=(Value&$FF)<<8+(Value>>8)&$FF
; Debug Hex(Value)
String+Chr(Value)
Next
ProcedureReturn String
EndProcedure
Structure aatype
a1.a
a2.a
EndStructure
Structure uaatype
StructureUnion
u.u
aa.aatype
EndStructureUnion
EndStructure
Procedure toMotorola(*ptr.uaatype,l.i)
Protected *pend=*ptr+l
For *ptr=*ptr To *pend Step 2
*ptr\u=*ptr\aa\a1<<8+*ptr\aa\a2
Next
EndProcedure
Debug MemoryRead(?L,?E-?L)
s.s="Κ α ν ο ν ι κ ά"
toMotorola(@s,Len(s)*2)
ShowMemoryViewer(@s,Len(s)*2)
Debug MemoryRead(@s,Len(s)*2)
End
DataSection
L:
Data.a $03, $9A, $03, $B1, $03, $BD, $03, $BF, $03, $BD, $03, $B9, $03, $BA, $03, $AC
E:
EndDataSection
; Κανονικά
; Κ α ν ο ν ι κ ά
; 0000000001E112F0 03 9A 00 20 03 B1 00 20 03 BD 00 20 03 BF 00 20 .. .±. .½. .¿.
; 0000000001E11300 03 BD 00 20 03 B9 00 20 03 BA 00 20 03 AC .½. .¹. .º. .¬
