URLEn/Decoder with ASCII and UTF8 flag

Got an idea for enhancing PureBasic? New command(s) you'd like to see?
User avatar
mback2k
Enthusiast
Enthusiast
Posts: 257
Joined: Sun Dec 02, 2007 12:11 pm
Location: Germany

URLEn/Decoder with ASCII and UTF8 flag

Post by mback2k »

Hello everyone,

I would really like to see the URLDecoder function support different character sets, because of the following problem:

Source code:

Code: Select all

Procedure.s UTF8ToString(UTF8$)
  Protected String$ = Space(StringByteLength(UTF8$, #PB_Ascii)+1)
  PokeS(@String$, UTF8$, -1, #PB_Ascii)
  ProcedureReturn PeekS(@String$, -1, #PB_UTF8)
EndProcedure

Test$ = "H:/Musik/Die%20%C3%84rzte/13/15%20Nie%20gesagt.mp3"

Debug Test$
Debug URLDecoder(Test$)
Debug UTF8ToString(URLDecoder(Test$))
Correct result in ASCII mode:

Code: Select all

H:/Musik/Die%20%C3%84rzte/13/15%20Nie%20gesagt.mp3
[removed this line - please look at the image]
H:/Musik/Die Ärzte/13/15 Nie gesagt.mp3
  • Image
Invalid result in Unicode mode:

Code: Select all

H:/Musik/Die%20%C3%84rzte/13/15%20Nie%20gesagt.mp3
[removed this line - please look at the image]
H:/Musik/Die ?rzte/13/15 Nie gesagt.mp3
  • Image
The difference is in the last line, in Unicode mode the "Ä" is replaced with a "?". I tried various PeekS/PokeS combinations, but I cannot get it to work, because the string returned by URLDecoder (2nd line) is already invalid.

Thanks in advance!
Last edited by mback2k on Sat Jan 17, 2009 6:03 pm, edited 2 times in total.
User avatar
mback2k
Enthusiast
Enthusiast
Posts: 257
Joined: Sun Dec 02, 2007 12:11 pm
Location: Germany

Post by mback2k »

Appendix to my first post:

Basically the URLEncoder and URLDecoder functions need to be able to encode and decode to UTF8 %-encoded URLs.

ASCII: URLEncoder("Ä") = %C4
UTF8: URLEncoder("Ä") = %C3%84

ASCII: URLDecoder("%C4") = Ä
UTF8: URLDecoder("%C3%84") = Ä

The executables Unicode/Ascii switch should not have any effect on this.
User avatar
mback2k
Enthusiast
Enthusiast
Posts: 257
Joined: Sun Dec 02, 2007 12:11 pm
Location: Germany

Post by mback2k »

Sorry for multiple postings, but I want to keep you guys up to date on the problem.

I managed to create working functions that always do the same in ASCII and Unicode mode:

Code: Select all

Procedure.s URLEncode(URL$)
  Protected Result$, Char$, Char, Index, *Buffer
  For Index = 1 To Len(URL$)
    Char$ = Right(Left(URL$, Index), 1)
    Char = Asc(Char$)
    If Char < 128
      If (Char > 32 And Char <> 37 And Char <> 34 And Char <> 60 And Char < 62) Or (Char > 62 And Char < 91) Or (Char > 94 And Char <> 96 And Char < 123) Or (Char = 126)
        Result$ + Char$
      Else
        Result$ + "%" + RSet(Hex(Char), 2, "0")
      EndIf
    Else
      *Buffer = AllocateMemory(4)
      If *Buffer
        PokeS(*Buffer, Char$, -1, #PB_UTF8)
        Result$ + "%" + RSet(Hex(PeekB(*Buffer)&$FF), 2, "0") + "%" + RSet(Hex(PeekB(*Buffer+1)&$FF), 2, "0")
        FreeMemory(*Buffer)
      EndIf
    EndIf
  Next
  ProcedureReturn Result$
EndProcedure

Procedure.s URLDecode(URL$)
  Protected Result$, Char$, Char, Index, *Buffer
  Result$ = URL$
  Index = FindString(URL$, "%", 1)
  If Index
    Repeat
      Char$ = Right(Left(Result$, Index+2), 2)
      Char = Val("$"+Char$)
      If Char < 128
        Result$ = ReplaceString(Result$, "%"+Char$, Chr(Char))
      ElseIf Right(Left(Result$, Index+3), 1) = "%"
        *Buffer = AllocateMemory(4)
        If *Buffer
          PokeB(*Buffer, Char)
          PokeB(*Buffer+1, Val("$"+Right(Left(Result$, Index+5), 2)))
          Result$ = ReplaceString(Result$, Right(Left(Result$, Index+5), 6), PeekS(*Buffer, -1, #PB_UTF8))
          FreeMemory(*Buffer)
        EndIf
      EndIf
      Index = FindString(Result$, "%", Index+1)
    Until Not Index
  EndIf
  ProcedureReturn Result$
EndProcedure

Procedure.s DummyString1()
  Protected Result$, Char
  For Char = 32 To 126
    Result$ + Chr(Char)
  Next
  ProcedureReturn Result$
EndProcedure

Procedure.s DummyString2()
  Protected Result$, Char
  For Char = 128 To 255
    Result$ + Chr(Char)
  Next
  ProcedureReturn Result$
EndProcedure

DummyString1$ = DummyString1()
DummyString2$ = DummyString2()

If URLEncoder(DummyString1$) = URLEncode(DummyString1$)
  Debug "ASCII Test 1 passed!"
Else
  Debug "ASCII Test 1 failed!"
EndIf

If URLDecode(URLEncode(DummyString1$)) = DummyString1$
  Debug "ASCII Test 2 passed!"
Else
  Debug "ASCII Test 2 failed!"
EndIf

If URLDecode(URLEncode(DummyString2$)) = DummyString2$
  Debug "UTF8 Test passed!"
Else
  Debug "UTF8 Test failed!"
EndIf
Fred, any chance to have PB's functions updated?

The only problem that still exists is that you can't use LCase or UCase with the returned string in Unicode mode :(
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

1. You don't describe what output is the correct one, so we can't know what you want.
2. In the code from the first post you treat a PB string as an UTF-8 strings. That's not possible. PB strings are ascii in ascii mode and UCS2 in unicode mode, trying to treat them as something different will make things break in unexpected ways. If you need to store a string in a specific format, store it in a memory buffer. I can see from your code that it will defintely break in unicode mode, maybe it won't in ascii mode.
3. The internal debug output doesn't support unicode characters. No matter if your string was ever so correctly returned from your functions it would show up as ???? if it was made of unicode characters. Use the external debugger to see what you're doing.
4. I don't have the problem here. To my surprise the last line shows an Ä in both unicode and ascii mode. I have no good explanation for this, as in my theory, it should break!
5. If URLDecoder() doesn't support unicode in ascii mode, then that's your own problem and what we have unicode mode for.
6. If URLDecoder() doesn't support unicode in unicode mode, then it sounds like a bug and you should post a simple piece of code to show it in the bug report section. If it's following RFC 3986 then the current behaviour is buggy, but I don't know which RFC it's following.
User avatar
mback2k
Enthusiast
Enthusiast
Posts: 257
Joined: Sun Dec 02, 2007 12:11 pm
Location: Germany

Post by mback2k »

1. I thought that is clear. The ASCII screen is correct.
2. But I don't want that. I want to decode the string as UTF8 and store it in ASCII or Unicode, depending on the compilation switch.
3. I am using the external debugger, because I am using jaPBe. At least it's running PBDebuggerUnicode.exe.
4. It does show an "?" in Unicode, and that is wrong.
5. I am not an idiot, thanks for treating me like one.
6. It's not following the RFCs, because %-encoded URLs can contain ASCII and UTF8 characters in %-form. If the first %xx is higher than 127 (out of ASCII range), it's a 2-byte UTF8 character that is represented by %xx%xx.

I provided 2 functions that do follow the RFCs and it works. The only thing that is still not working for me is the transfer from UTF8 encoding to Unicode.

I am trying to explain this as good as possible, but English is not my native language, so excuse any unwanted rudeness.
dige
Addict
Addict
Posts: 1391
Joined: Wed Apr 30, 2003 8:15 am
Location: Germany
Contact:

Re: URLEn/Decoder with ASCII and UTF8 flag

Post by dige »

UTF-8 is recommended by RFC 3986.

http://tools.ietf.org/html/rfc3986
http://en.wikipedia.org/wiki/URL_Encoding

and should be also supported :)
+1 for URLEncoder (url.s, [flags])
flags: #PB_UTF-8 , #PB_ASCII
Little John
Addict
Addict
Posts: 4779
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: URLEn/Decoder with ASCII and UTF8 flag

Post by Little John »

After several years, this is still a "hot topic".
+1 from me for this wish.

Until this feature will be built into PB, we can use two small procedures which I posted here.
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: URLEn/Decoder with ASCII and UTF8 flag

Post by davido »

+1

Odd omission. I assumed PureBasic was moving into the future with the impending switch to Unicode.
DE AA EB
User avatar
eddy
Addict
Addict
Posts: 1479
Joined: Mon May 26, 2003 3:07 pm
Location: Nantes

Re: URLEn/Decoder with ASCII and UTF8 flag

Post by eddy »

+1
Imagewin10 x64 5.72 | IDE | PB plugin | Tools | Sprite | JSON | visual tool
Little John
Addict
Addict
Posts: 4779
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: URLEn/Decoder with ASCII and UTF8 flag

Post by Little John »

Implemented in PB 5.40. :-)
Post Reply