Unicode question

Just starting out? Need help? Post your questions and find answers here.
User avatar
mk-soft
Always Here
Always Here
Posts: 6321
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Unicode question

Post by mk-soft »

There is something wrong with your data. It is not normal for Unicode characters in Windows to be in high-low byte notation. Therefore your start pointer is wrong.
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
DarkDragon
Addict
Addict
Posts: 2347
Joined: Mon Jun 02, 2003 9:16 am
Location: Germany
Contact:

Re: Unicode question

Post by DarkDragon »

mk-soft wrote: Sat Jan 13, 2024 12:57 pm There is something wrong with your data. It is not normal for Unicode characters in Windows to be in high-low byte notation. Therefore your start pointer is wrong.
Endianness isn't OS specific. Several protocols and file formats need a specific endianness. The BOM usually decides which endian is used, but not always. Sometimes the protocol says big endian and then a BOM is unnecessary because it's always big endian.
bye,
Daniel
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: Unicode question

Post by boddhi »

@juergenkulow
juergenkulow wrote: Sat Jan 13, 2024 12:47 pm Your procedure is fine, but unfortunately your test program is not.
I don't quite understand why my example isn't a good one. Maybe I'm missing something...

@mk-soft
Below, a hexadecimal view of my file.
I retrieve a bigger block of data with ReadData(), so it retains the same structure once in memory...
Image
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
SMaag
Enthusiast
Enthusiast
Posts: 329
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: Unicode question

Post by SMaag »

some more Versions how to change String Endianess

Code: Select all


EnableExplicit

; Version 1  
Procedure ToggleStringEndianess(*Char.Character)
  Protected *a1.Ascii = *Char
  Protected *a2.Ascii = *a1 + 1
  
  While *Char\c 
    Swap *a1\a, *a2\a
    *Char + SizeOf(Character)
    *a1 = *Char
    *a2 = *a1 + 1    
  Wend
EndProcedure

; -------------------------------------------------------------
; Version 2 with Pointer Structure 
Structure pChar
  c.c[0]  
  a.a[0]  
EndStructure

Procedure ToggleStringEndianess2(*Char.pChar)   
  While *Char\c[0] 
    Swap *Char\a[0], *Char\a[1]
    *Char + SizeOf(Character)
  Wend
EndProcedure

; -------------------------------------------------------------
; Version 3 Assembler
Procedure ToggleStringEndianess3(*Char.Character)
  CompilerIf #PB_Compiler_64Bit
    While *Char\c
      !MOV RAX, [p.p_Char]
      !MOV DX, WORD[RAX]
      !XCHG DL, DH  ; for 16 Bit ByteSwap it's the Exchange command 
      !MOV  WORD[RAX], DX
      *Char + SizeOf(Character)
    Wend
  CompilerElse ; #PB_Compiler_32Bit
    While *Char\c
      !MOV EAX, [p.p_Char]
      !MOV DX, WORD[EAX]
      !XCHG DL, DH  ; for 16 Bit ByteSwap it's the Exchange command 
      !MOV  WORD[EAX], DX
      *Char + SizeOf(Character)
    Wend   
  CompilerEndIf
    
EndProcedure

; -------------------------------------------------------------
; Testcode
Define.s sTest

; Version 1
Debug "Version 1"
Debug ""
sTest = PeekS(?MotorolaString)
; sTest = PeekS(?IntelString)
Debug sTest
ToggleStringEndianess(@sTest)
Debug sTest
ToggleStringEndianess(@sTest)
Debug sTest

; Version 2
Debug ""
Debug "Version 2"
Debug ""
sTest = PeekS(?MotorolaString)
; sTest = PeekS(?IntelString)
Debug sTest
ToggleStringEndianess2(@sTest)
Debug sTest
ToggleStringEndianess2(@sTest)
Debug sTest

; Version 3
Debug ""
Debug "Version 3 Assembler"
Debug ""
sTest = PeekS(?MotorolaString)
; sTest = PeekS(?IntelString)
Debug sTest
ToggleStringEndianess3(@sTest)
Debug sTest
ToggleStringEndianess3(@sTest)
Debug sTest

DataSection
  ; "String" in Motorola notation Big Endian and Intel notation Little Endian
  MotorolaString:
  Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67, $0, $0
  
  IntelString:
  Data.a $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67, $0, $0, $0
EndDataSection
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: Unicode question

Post by boddhi »

SMaag wrote: Sat Jan 13, 2024 3:23 pm some more Versions how to change String Endianess
Thanks for your propositions. :wink:
I have already my own. See here

My goal is to know if I can do that without a loop (For...Next, While...Wend) and read the string from memory as PeekS() can do it.
Maybe a windows API ? or else...
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Unicode question

Post by wilbert »

boddhi wrote: Sat Jan 13, 2024 4:18 pm My goal is to know if I can do that without a loop (For...Next, While...Wend) and read the string from memory as PeekS() can do it.
Maybe a windows API ? or else...
Why are you looking for something without a loop ?
A function like PeekS or WideCharToMultiByte also uses a loop internally.
The difference is that the procedure is already compiled so you don't see it.
Windows (x64)
Raspberry Pi OS (Arm64)
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: Unicode question

Post by boddhi »

wilbert wrote: Why are you looking for something without a loop ?
A function like PeekS or WideCharToMultiByte also uses a loop internally.
The difference is that the procedure is already compiled so you don't see it.
My answer would be that these functions were created to simplify the programmer's life. So if a function (e.g API) I don't know already exists, why reinvent the wheel? :wink:
Using Unicode with Windows isn't as simple as that and there may be techniques or informations I don't know about despite my research.
I'm not a programming pro, I'm making a request that may not have a positive response, in which case (which it seems to be) I'll bypass the problem.
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
DarkDragon
Addict
Addict
Posts: 2347
Joined: Mon Jun 02, 2003 9:16 am
Location: Germany
Contact:

Re: Unicode question

Post by DarkDragon »

wilbert wrote: Sat Jan 13, 2024 6:38 pm
boddhi wrote: Sat Jan 13, 2024 4:18 pm My goal is to know if I can do that without a loop (For...Next, While...Wend) and read the string from memory as PeekS() can do it.
Maybe a windows API ? or else...
Why are you looking for something without a loop ?
A function like PeekS or WideCharToMultiByte also uses a loop internally.
The difference is that the procedure is already compiled so you don't see it.
We have Read-/WriteStringFormat, but we can only handle one format with Read-/WriteString, which seems a bit incomplete. This should be a feature request. Extend Read-/WriteString and PeekS/PokeS by endian flags.
bye,
Daniel
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: Unicode question

Post by boddhi »

DarkDragon wrote: We have Read-/WriteStringFormat
Note that I may have omitted :oops: : My files are binary files, so it's impossible to determine the encoding with ReadStringFormat().
Last edited by boddhi on Sat Jan 13, 2024 9:32 pm, edited 1 time in total.
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
DarkDragon
Addict
Addict
Posts: 2347
Joined: Mon Jun 02, 2003 9:16 am
Location: Germany
Contact:

Re: Unicode question

Post by DarkDragon »

boddhi wrote: Sat Jan 13, 2024 9:23 pm
DarkDragon wrote: We have Read-/WriteStringFormat
Note that I may have omitted :oops: : My file is a binary file, so it's impossible to determine the encoding with ReadStringFormat().
Of course, that's not what I've meant. The presence of these functions implies PureBasic can handle different endianness when reading/writing strings from/to files/memory without further additions. Unfortunately it cannot.
That may happen and is totally ok, the best you can do is create a feature request.
bye,
Daniel
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: Unicode question

Post by boddhi »

DarkDragon wrote: Of course, that's not what I've meant.
I understood that :wink: :)
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
Post Reply