Page 2 of 2

Re: Unicode question

Posted: Sat Jan 13, 2024 12:57 pm
by mk-soft
There is something wrong with your data. It is not normal for Unicode characters in Windows to be in high-low byte notation. Therefore your start pointer is wrong.

Re: Unicode question

Posted: Sat Jan 13, 2024 1:23 pm
by DarkDragon
mk-soft wrote: Sat Jan 13, 2024 12:57 pm There is something wrong with your data. It is not normal for Unicode characters in Windows to be in high-low byte notation. Therefore your start pointer is wrong.
Endianness isn't OS specific. Several protocols and file formats need a specific endianness. The BOM usually decides which endian is used, but not always. Sometimes the protocol says big endian and then a BOM is unnecessary because it's always big endian.

Re: Unicode question

Posted: Sat Jan 13, 2024 3:18 pm
by boddhi
@juergenkulow
juergenkulow wrote: Sat Jan 13, 2024 12:47 pm Your procedure is fine, but unfortunately your test program is not.
I don't quite understand why my example isn't a good one. Maybe I'm missing something...

@mk-soft
Below, a hexadecimal view of my file.
I retrieve a bigger block of data with ReadData(), so it retains the same structure once in memory...
Image

Re: Unicode question

Posted: Sat Jan 13, 2024 3:23 pm
by SMaag
some more Versions how to change String Endianess

Code: Select all


EnableExplicit

; Version 1  
Procedure ToggleStringEndianess(*Char.Character)
  Protected *a1.Ascii = *Char
  Protected *a2.Ascii = *a1 + 1
  
  While *Char\c 
    Swap *a1\a, *a2\a
    *Char + SizeOf(Character)
    *a1 = *Char
    *a2 = *a1 + 1    
  Wend
EndProcedure

; -------------------------------------------------------------
; Version 2 with Pointer Structure 
Structure pChar
  c.c[0]  
  a.a[0]  
EndStructure

Procedure ToggleStringEndianess2(*Char.pChar)   
  While *Char\c[0] 
    Swap *Char\a[0], *Char\a[1]
    *Char + SizeOf(Character)
  Wend
EndProcedure

; -------------------------------------------------------------
; Version 3 Assembler
Procedure ToggleStringEndianess3(*Char.Character)
  CompilerIf #PB_Compiler_64Bit
    While *Char\c
      !MOV RAX, [p.p_Char]
      !MOV DX, WORD[RAX]
      !XCHG DL, DH  ; for 16 Bit ByteSwap it's the Exchange command 
      !MOV  WORD[RAX], DX
      *Char + SizeOf(Character)
    Wend
  CompilerElse ; #PB_Compiler_32Bit
    While *Char\c
      !MOV EAX, [p.p_Char]
      !MOV DX, WORD[EAX]
      !XCHG DL, DH  ; for 16 Bit ByteSwap it's the Exchange command 
      !MOV  WORD[EAX], DX
      *Char + SizeOf(Character)
    Wend   
  CompilerEndIf
    
EndProcedure

; -------------------------------------------------------------
; Testcode
Define.s sTest

; Version 1
Debug "Version 1"
Debug ""
sTest = PeekS(?MotorolaString)
; sTest = PeekS(?IntelString)
Debug sTest
ToggleStringEndianess(@sTest)
Debug sTest
ToggleStringEndianess(@sTest)
Debug sTest

; Version 2
Debug ""
Debug "Version 2"
Debug ""
sTest = PeekS(?MotorolaString)
; sTest = PeekS(?IntelString)
Debug sTest
ToggleStringEndianess2(@sTest)
Debug sTest
ToggleStringEndianess2(@sTest)
Debug sTest

; Version 3
Debug ""
Debug "Version 3 Assembler"
Debug ""
sTest = PeekS(?MotorolaString)
; sTest = PeekS(?IntelString)
Debug sTest
ToggleStringEndianess3(@sTest)
Debug sTest
ToggleStringEndianess3(@sTest)
Debug sTest

DataSection
  ; "String" in Motorola notation Big Endian and Intel notation Little Endian
  MotorolaString:
  Data.a $00, $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67, $0, $0
  
  IntelString:
  Data.a $53, $00, $74, $00, $72, $00, $69, $00, $6E, $00, $67, $0, $0, $0
EndDataSection

Re: Unicode question

Posted: Sat Jan 13, 2024 4:18 pm
by boddhi
SMaag wrote: Sat Jan 13, 2024 3:23 pm some more Versions how to change String Endianess
Thanks for your propositions. :wink:
I have already my own. See here

My goal is to know if I can do that without a loop (For...Next, While...Wend) and read the string from memory as PeekS() can do it.
Maybe a windows API ? or else...

Re: Unicode question

Posted: Sat Jan 13, 2024 6:38 pm
by wilbert
boddhi wrote: Sat Jan 13, 2024 4:18 pm My goal is to know if I can do that without a loop (For...Next, While...Wend) and read the string from memory as PeekS() can do it.
Maybe a windows API ? or else...
Why are you looking for something without a loop ?
A function like PeekS or WideCharToMultiByte also uses a loop internally.
The difference is that the procedure is already compiled so you don't see it.

Re: Unicode question

Posted: Sat Jan 13, 2024 9:03 pm
by boddhi
wilbert wrote: Why are you looking for something without a loop ?
A function like PeekS or WideCharToMultiByte also uses a loop internally.
The difference is that the procedure is already compiled so you don't see it.
My answer would be that these functions were created to simplify the programmer's life. So if a function (e.g API) I don't know already exists, why reinvent the wheel? :wink:
Using Unicode with Windows isn't as simple as that and there may be techniques or informations I don't know about despite my research.
I'm not a programming pro, I'm making a request that may not have a positive response, in which case (which it seems to be) I'll bypass the problem.

Re: Unicode question

Posted: Sat Jan 13, 2024 9:12 pm
by DarkDragon
wilbert wrote: Sat Jan 13, 2024 6:38 pm
boddhi wrote: Sat Jan 13, 2024 4:18 pm My goal is to know if I can do that without a loop (For...Next, While...Wend) and read the string from memory as PeekS() can do it.
Maybe a windows API ? or else...
Why are you looking for something without a loop ?
A function like PeekS or WideCharToMultiByte also uses a loop internally.
The difference is that the procedure is already compiled so you don't see it.
We have Read-/WriteStringFormat, but we can only handle one format with Read-/WriteString, which seems a bit incomplete. This should be a feature request. Extend Read-/WriteString and PeekS/PokeS by endian flags.

Re: Unicode question

Posted: Sat Jan 13, 2024 9:23 pm
by boddhi
DarkDragon wrote: We have Read-/WriteStringFormat
Note that I may have omitted :oops: : My files are binary files, so it's impossible to determine the encoding with ReadStringFormat().

Re: Unicode question

Posted: Sat Jan 13, 2024 9:29 pm
by DarkDragon
boddhi wrote: Sat Jan 13, 2024 9:23 pm
DarkDragon wrote: We have Read-/WriteStringFormat
Note that I may have omitted :oops: : My file is a binary file, so it's impossible to determine the encoding with ReadStringFormat().
Of course, that's not what I've meant. The presence of these functions implies PureBasic can handle different endianness when reading/writing strings from/to files/memory without further additions. Unfortunately it cannot.
That may happen and is totally ok, the best you can do is create a feature request.

Re: Unicode question

Posted: Sat Jan 13, 2024 9:34 pm
by boddhi
DarkDragon wrote: Of course, that's not what I've meant.
I understood that :wink: :)