Array of UTF8 strings in a Unicode program?

Oma · Post by **Oma** » Tue Dec 08, 2015 6:53 pm

Thank you wilbert!
Very ingenious method and works fine on 64-Bit-Linux.
But on 32-Bit i get an Invalid memory access in the
Test(0) = UTF8String("Meier")-line.

If i convert the *str to a quad-Type str.q for testing it, no Invalid memory access occurs, but the Debug-Strings stays empty.

At the moment i have no idea what's the problem here. It seems, that the address must be a 64-Bit on 32-Bit Systems too.

Best Regards, Charly

ElementE · Post by **ElementE** » Tue Dec 08, 2015 7:18 pm

Hi wilbert.
When I run your code in unicode mode, I get the following debug output:

썍沼敬r
Müller

Oringial code:

wilbert wrote:Here a utf8 conversion using prototype working in both ascii and unicode mode.

Code: Select all

Prototype.s ProtoUTF8String(str.p-utf8)

Procedure.s UTF8String_(*str)
  CompilerIf #PB_Compiler_Unicode
    ProcedureReturn PeekS(*str, (MemoryStringLength(*str, #PB_Ascii) + 1) >> 1)
  CompilerElse
    ProcedureReturn PeekS(*str)
  CompilerEndIf
EndProcedure

Global UTF8String.ProtoUTF8String = @UTF8String_()


Dim Test.s(2)
Test(0) = UTF8String("Meier")
Test(1) = UTF8String("Müller")
Test(2) = UTF8String("Schmidt")

Debug Test(1)
Debug PeekS(@Test(1), -1, #PB_UTF8)

wilbert · Post by **wilbert** » Tue Dec 08, 2015 8:05 pm

Oma wrote:on 32-Bit i get an Invalid memory access in the
Test(0) = UTF8String("Meier")-line.

That's strange. On OSX it works fine with both the x86 and x64 version of PB.
Does it fail in both ascii and unicode mode ?

ElementE wrote:When I run your code in unicode mode, I get the following debug output:
썍沼敬r
Müller

It's supposed to output that in unicode mode

skywalk · Post by **skywalk** » Tue Dec 08, 2015 9:32 pm

A different approach. Convert native string array to string buffer of user-defined encoding.

Code: Select all

CompilerIf #PB_Compiler_Unicode = 0
  MessageRequester("try-uni-Array-utf8-mem", "Requires #PB_Compiler_Unicode." + #CRLF$ +
                   "PB v5.4+ is dropping Ascii compiler switch." + #CRLF$, #MB_ICONWARNING)
  End
CompilerEndIf
EnableExplicit
Procedure.i JoinToMem(*nBytes.Integer, Array A$(1), Delm$=#Empty$, iStart.i=0, iStop.i=-1, Enc.i=#PB_Ascii)
  ; REV:  151207, skywalk
  ;       Spinoff of Join() which returns a concatenated native PB string(Unicode only).
  ; RETURN:
  ;  *s     = String buffer concatenated from A$() in user-defined encoding(Enc).
  ;  nBytes = num bytes within string buffer. *s contains Chr(0)'s so cannot use Len(PeekS(*s,-1,Enc))
  ; NOTES:
  ;   Delm$       Action
  ;   #Empty$     OK to write Chr(0) after each element.
  ;   #Null$      Set #PB_String_NoZero and add nothing between elements.
  ;   > 0         Set #PB_String_NoZero and only add Delm$ between elements.
  Protected.i i, k, npb, nBytes, lenbEOS, *p, *s
  Protected.i PokeNoZero = #PB_String_NoZero
  Protected.s r$
  If iStart < 0
    iStart = 0
  EndIf
  If iStop < 0
    k = ArraySize(A$())
    iStop = k
  Else
    If iStart >= iStop
      iStop = iStart
      k = iStop - iStart
    Else
      k = iStop - iStart + 1
    EndIf
  EndIf
  ; Determine nBytes required to hold string array contents
  For i = iStart To iStop
    nBytes + StringByteLength(A$(i), Enc)
  Next i
  If Delm$
    nBytes + k * StringByteLength(Delm$, Enc) ; Add room for delimiters
  EndIf
  If @Delm$ And Len(Delm$) < 1  ; Delm$ has an address so allow PokeZero
    PokeNoZero = 0
    If Enc <> #PB_Unicode       ; Set Size(bytes) of trailing nullchar in user defined encoding.
      lenbEOS = 1
    Else
      lenbEOS = 2
    EndIf
    nBytes + k * lenbEOS        ; Account for nullchar delimiters + 1 trailer.
  EndIf
  *s = AllocateMemory(nBytes+lenbEOS)
  If *s
    If nBytes <= MemorySize(*s) ; Verify enough memory created
      *p = *s                   ; Create tracking pointer for concatenating memory
      If Delm$
        npb = PokeS(*p, A$(iStart), -1, Enc | PokeNoZero)
        *p + npb + lenbEOS
        If k > 0
          npb = PokeS(*p, Delm$, -1, Enc | PokeNoZero)
          *p + npb + lenbEOS
          k = iStop - 1         ; Avoid recalculating k-1 in For-Next loop
          For i = iStart + 1 To k
            npb = PokeS(*p, A$(i), -1, Enc | PokeNoZero)
            *p + npb + lenbEOS
            npb = PokeS(*p, Delm$, -1, Enc | PokeNoZero)
            *p + npb + lenbEOS
          Next i
          npb = PokeS(*p, A$(i), -1, Enc | PokeNoZero)
        EndIf
      Else
        npb = PokeS(*p, A$(iStart), -1, Enc | PokeNoZero)
        *p + npb + lenbEOS
        If k > 0
          For i = iStart + 1 To iStop
            npb = PokeS(*p, A$(i), -1, Enc | PokeNoZero)
            *p + npb + lenbEOS
          Next i
        EndIf
      EndIf
      ;FreeMemory(*s)           ; Not now, but remember to free memory when done.
    EndIf
  EndIf
  If *nBytes                    ; Avoid null pointer.
    *nBytes\i = nBytes          ; Buffer contains 0's so <> Len(PeekS(*s,-1,Enc))
  EndIf
  ProcedureReturn *s
EndProcedure
;-{ TEST
#NUL$ = #Empty$
#SP$  = " "
#CMA$ = ","
Define.i i, *s, nBytes, Enc, nPts = 5
Define.s s$
Dim a$(nPts-1)
a$(0) = "000011112222333344445555666677778888"
a$(1) = "Huber"
a$(2) = "Völler"
a$(3) = "Müller"
a$(4) = "Šimûnek"
Enc = #PB_UTF8
;Enc = #PB_Ascii
;Enc = #PB_Unicode
*s = JoinToMem(@nBytes, a$(), #CMA$, 0, -1, Enc)
ShowMemoryViewer(*s, nBytes)
Debug PeekS(*s+nBytes-2, 1, Enc)
FreeMemory(*s)
*s = JoinToMem(@nBytes, a$(), #Null$, 0, -1, Enc)
ShowMemoryViewer(*s, nBytes)
Debug PeekS(*s+nBytes-2, 1, Enc)
FreeMemory(*s)
*s = JoinToMem(@nBytes, a$(), #NUL$, 0, -1, Enc)
ShowMemoryViewer(*s, nBytes)
Debug PeekS(*s+nBytes-2, 1, Enc)
FreeMemory(*s)
;-} TEST

Demivec · Post by **Demivec** » Wed Dec 09, 2015 1:53 am

kenmo wrote:
Demivec wrote:After the value is assigned to Test(1) the value can't be retrieved again. It is as if it is a Null$. The debug output from your demo shows nothing for that index, the debug output shows only 4 lines because the 2nd line is blank.

...

I compiled it as Unicode with Windows 8.1 x64 PB v5.40.
Very weird. Maybe cut out code until you find exactly what ruins the string (might be a 64-bit PB bug?). I can't test the 64-bit version right now.

My tests today show that there are no anomalies with the output, or anything else. Can't explain the difference or the initial occurrence. Hopefully the space-time continuum is also all back as it should be.

@Skywalk: Your code only outputs 3 e's and shows strings via the memory viewer. Given what happen when I tested kenmo's code I'll wait a day and try it again.

skywalk · Post by **skywalk** » Wed Dec 09, 2015 2:47 am

Demivec wrote:@Skywalk: Your code only outputs 3 e's and shows strings via the memory viewer. Given what happen when I tested kenmo's code I'll wait a day and try it again.

Haha, yes, that is the correct output. The string buffer is intended to be passed to a dll as a pointer. I only provided a few debug commands to show the memory and a common character in the buffer. The difference with this approach is the strings are immediately terminated with 0's and/or delimiters per user setting. And with user defined encoding. I have no immediate use for the the array approach.

kenmo · Post by **kenmo** » Wed Dec 09, 2015 4:18 am

Demivec wrote: My tests today show that there are no anomalies with the output, or anything else. Can't explain the difference or the initial occurrence. Hopefully the space-time continuum is also all back as it should be.

Sounds like you are the third person to post about an intermittent null-string issue...
http://www.purebasic.fr/english/viewtop ... 13&t=63910

Oma · Post by **Oma** » Wed Dec 09, 2015 6:26 am

It's seems to be a big theme

Hello Wilbert,

Does it fail in both ascii and unicode mode ?

yes, it happens in Ascii and Unicode-mode.

If I have the time i'll try it again in the evening. (But I have no experience with the prototype

)

Best Regards,
Charly

mikejs · Post by **mikejs** » Wed Dec 09, 2015 12:51 pm

mikejs wrote:
wilbert wrote:Here a utf8 conversion using prototype working in both ascii and unicode mode.
That works for me

... correction. It works for me on 64bit, but does some very strange things on 32bit. It's probably going to be easier to just construct a block of memory to pass to the function in this specific case (the strings I need to pass are known at compile-time and do not need to vary at run-time), but I think there needs to be a better native solution to this.

I'll post something in Feature Requests...

PureBasic Forums - English

Array of UTF8 strings in a Unicode program?

Re: Array of UTF8 strings in a Unicode program?

Re: Array of UTF8 strings in a Unicode program?

Re: Array of UTF8 strings in a Unicode program?

Re: Array of UTF8 strings in a Unicode program?

Re: Array of UTF8 strings in a Unicode program?

Re: Array of UTF8 strings in a Unicode program?

Re: Array of UTF8 strings in a Unicode program?

Re: Array of UTF8 strings in a Unicode program?

Re: Array of UTF8 strings in a Unicode program?