Determining characterset of memory area

Share your advanced PureBasic knowledge/code with the community.
User avatar
Kukulkan
Addict
Addict
Posts: 1396
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Determining characterset of memory area

Post by Kukulkan »

Hi,

I just posted this code in the german forum as an answer and I thought it might be usefull for some others here, too.

Code: Select all

; Try's to gather the characterset of the text in the given memory area.
; If you set ByteLength.i = 0, it assumes a null terminated string.
; Possible return-codes:
; "utf-16LE-bom" -> contains utf-16LE encoded text including a bom
; "utf-16BE-bom" -> contains utf-16BE encoded text including a bom
; "utf-8bom" -> contains utf-8 encoded text including a bom
; "utf-8" -> contains utf-8 encoded text
; "ascii" -> only 7 bit chars
; "iso"   -> contains chars > 7 bit, but not utf-8
Procedure.s GetCharactersetMemory(MemPointer.i, ByteLength.i = 0)

  If PeekA(MemPointer.i + 0) = 239 And PeekA(MemPointer.i + 1) = 187 And PeekA(MemPointer.i + 2) = 191
    ; correct UTF8 BOM
    ProcedureReturn "utf-8bom"
  EndIf
  If PeekA(MemPointer.i + 0) = $FF And PeekA(MemPointer.i + 1) = $FE
    ; correct UTF16LE BOM
    ProcedureReturn "utf-16LE-bom"
  EndIf
  If PeekA(MemPointer.i + 0) = $FE And PeekA(MemPointer.i + 1) = $FF
    ; correct UTF16LE BOM
    ProcedureReturn "utf-16BE-bom"
  EndIf
  
  Protected Code.a = 0, AddBytes.i = 0
  Protected x = 0, a = 0
  Protected ValidUTF8 = #True ; init (to negotiate)
  Protected IsASC = #True     ; init (to negotiate)
  
  Repeat
    Code = PeekA(MemPointer.i + x)
    If Code > 127 And ValidUTF8 = #True
      IsASC = #False
      ; This may be the beginning of a UTF8 char
      If     Code & %11100000 = %11000000 ; 1 additional byte
        AddBytes = 1
      ElseIf Code & %11110000 = %11100000 ; 2 additional byte
        AddBytes = 2
      ElseIf Code & %11111000 = %11110000 ; 3 additional byte
        AddBytes = 3
      ElseIf Code & %11111100 = %11111000 ; 4 additional byte
        AddBytes = 4
      ElseIf Code & %11111110 = %11111100 ; 5 additional byte
        AddBytes = 5
      Else
        ValidUTF8 = #False
        Break ; no utf8, because it does not fit the standard
      EndIf
      ; validate utf8 characters
      For a = 1 To AddBytes
        x = x + 1
        Code = PeekA(MemPointer.i + x)
        If Code & %11000000 <> %10000000
          ValidUTF8 = #False
          Break; no utf8, because following bytes do not match "10xxxxxx"
        EndIf
      Next
    EndIf
    x = x + 1
  Until (x >= ByteLength.i And ByteLength.i > 0) Or Code = 0
  
  If ValidUTF8 = #True
    ; found a utf8 start byte followed by at least one following byte (needed for valid utf8)
    ProcedureReturn "utf-8"
  EndIf
  
  If IsASC = #True
    ProcedureReturn "ascii"
  EndIf
  
  ProcedureReturn "iso"

EndProcedure

; Peeking strings from pointers that point to single-byte strings in memory
; It detects the encoding and ensures that it is correctly returned as multibyte or singlebyte.
; Works with unicode memory areas.
Procedure.s PeekSSmart(Memory.i, Length.i)
  Protected CharSet.s = GetCharactersetMemory(Memory.i, Length.i)
  Select CharSet.s
    Case "utf-16LE-bom"
      ProcedureReturn PeekS(Memory.i+2, (Length.i-2)/2, #PB_Unicode) ; peek as UTF16LE ignoring BOM
    Case "utf-16BE-bom"
      Debug "utf-16BE-bom not supportet by PeekSSmart()"
      ProcedureReturn ""
      ;  ProcedureReturn PeekS(Memory.i+2, (Length.i-2)/2, #PB_UTF16BE) ; peek as UTF16BE ignoring BOM
    Case "utf-8bom"
      ProcedureReturn PeekS(Memory.i+3, Length.i-3, #PB_UTF8) ; peek as utf8 ignoring BOM
    Case "utf-8"
      ProcedureReturn PeekS(Memory.i, Length.i, #PB_UTF8) ; peek as utf8
    Case "iso"
      ProcedureReturn PeekS(Memory.i, Length.i, #PB_Ascii) ; peek as single byte
    Case "ascii"
      ProcedureReturn PeekS(Memory.i, Length.i, #PB_Ascii) ; peek as single byte
  EndSelect
EndProcedure
Any enhancements are welcome.

Kukulkan
Last edited by Kukulkan on Thu Dec 13, 2012 6:17 pm, edited 1 time in total.
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Determining characterset of memory area

Post by IdeasVacuum »

Looks interesting! You are using 'Memory.i' where it should simply be 'Memory'.
Does pointer arithmetic like Memory + 2 work on both 32bit and 64bit?
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
luis
Addict
Addict
Posts: 3895
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

Re: Determining characterset of memory area

Post by luis »

Hi, thanks, looks very interesting but:

1) Where are the #PB_UTF16, #PB_UTF16BE peeks() constants coming from ?
The help talks about #PB_Ascii, #PB_UTF8 and #PB_Unicode as valid flags for peeks().

2) If I stuff a string in the buffer to be analyzed using PokeS(*buffer, "unicode text", -1, #PB_Unicode) the buffer is not correctly identified then ?
"Have you tried turning it off and on again ?"
User avatar
ts-soft
Always Here
Always Here
Posts: 5756
Joined: Thu Jun 24, 2004 2:44 pm
Location: Berlin - Germany

Re: Determining characterset of memory area

Post by ts-soft »

IdeasVacuum wrote:Does pointer arithmetic like Memory + 2 work on both 32bit and 64bit?
+ 2 bytes is always the same, on 32-bit and 64-bit :wink:
Only the range of Integer is bigger.
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.
Image
User avatar
Kukulkan
Addict
Addict
Posts: 1396
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Determining characterset of memory area

Post by Kukulkan »

Hi,
Does pointer arithmetic like Memory + 2 work on both 32bit and 64bit?
Yes, it works on both as I'm working byte wise and the offset for PeekS() is always bytes. The main intend is to work properly with files read to memory or received by some routine (cURL callbacks etc.) or base64 decoded text or unzipped using zlib etc. You then may have bytes in memory and the PeekSSmart() is able to get the content then into a correct string (both single byte and unicode mode).
1) Where are the #PB_UTF16, #PB_UTF16BE peeks() constants coming from ?
You are right. I've never had UTF16 files to test so I did how I thought. The constants are from ReadStringFormat() function. I assume that #PB_UTF16 must be replaced with #PB_Unicode and and #PB_UTF16BE is not working like expected. Sorry for this. Maybe someone can handle this using a loop to correctly read UTF16BE strings? I sadly have no examples.
[edit]#PB_UTF16 is the same value than #PB_Unicode (9). It should work even without modification, but I modified the initial post anyway.[/edit]
2) If I stuff a string in the buffer to be analyzed using PokeS(*buffer, "unicode text", -1, #PB_Unicode) the buffer is not correctly identified then ?
Yes, because in "unicode text" there is not a single unicode character. Upon this it also fits to "ASCII" which is correct. You do not need to do any conversion with such string as it fits in ASCII character set. Try some unicode text like "Smörebröd".

I comment some lines in the original post to follow the suggestions.

Kukulkan
Post Reply