[Implemented] UTF-8 support for encoding and decoding URLs

Share your advanced PureBasic knowledge/code with the community.
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

[Implemented] UTF-8 support for encoding and decoding URLs

Post by Little John »

Hi all,

in northern Germany, there is a nice old town called "Lübeck".
The URL of its entry in the English Wikipedia is
After saving this URL to a bookmark (using Firefox 32.0.3 on Windows 7 x64), it looks like this:
My problem was, that I could not convert "Lübeck" to "L%C3%BCbeck" or vice versa by using PB's built-in URLEncoder() and URLDecoder() (with PB 5.31 beta 1):

Code: Select all

Debug URLEncoder("Lübeck")       ; -> L%FCbeck
Debug URLDecoder("L%C3%BCbeck")  ; -> Lübeck
Obviously, Firefox's URL encoding is UTF-8 based, while PB's URL encoding and decoding is ASCII based. This is a regrettable limitation of PB, especially since UTF-8 is the de-facto standard character encoding on the internet.

The following procedures add UTF-8 support for encoding and decoding URLs. They work correctly only when compiled in Unicode mode.

//edit 2014-12-20:
  • renamed both functions
  • removed the option for ASCII encoding in both functions
  • EncodeURLComponent_UTF8() fixed, so that it now works correctly according to RFC 3986
    (does not utilize PB's built-in function URLEncoder() anymore)
  • DecodeURLComponent_UTF8() can now handle '+' in encoded URLs
  • added function FindBadURLCharacter()
  • improved documentation
  • slightly improved demo code

Code: Select all

; PB 5.31

EnableExplicit


#URLUnreservedCharacters$ = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~"

Procedure.s EncodeURLComponent_UTF8 (urlComponent$)
   ; in : urlComponent$: separate component of a URL, which is to be encoded
   ;                     (Do not encode a complete URL.)
   ; out: return value : percent-encoded URL component in UTF-8 format
   ;
   ; Under normal circumstances, the only time when octets within a URI
   ; are percent-encoded is during the process of producing the URI from
   ; its component parts. This is when an implementation determines which
   ; of the reserved characters are to be used as subcomponent delimiters
   ; and which can be safely used as data.
   ; Implementations must not percent-encode the same string more than
   ; once.
   ; [RFC 3986, section 2.4]
   Protected *buffer, *char.Character, *a.Ascii, ret$
   Protected numBytes.i, *fin
   
   *buffer = AllocateMemory(5)  ; in UTF-8, 1 character can take up to 4 bytes
   If *buffer
      *char = @urlComponent$
      While *char\c
         If FindString(#URLUnreservedCharacters$, Chr(*char\c))
            ret$ + Chr(*char\c)
         Else
            numBytes = PokeS(*buffer, Chr(*char\c), 1, #PB_UTF8)
            *fin = *buffer + numBytes - 1
            For *a = *buffer To *fin
               ret$ + "%" + RSet(Hex(*a\a), 2, "0")
            Next
         EndIf
         *char + SizeOf(Character)
      Wend
      FreeMemory(*buffer)
   EndIf
   
   ProcedureReturn ret$
EndProcedure


Procedure.s DecodeURLComponent_UTF8 (urlComponent$)
   ; in : urlComponent$: separate component of a URL, which may contain
   ;                     percent-encoded characters in UTF-8 format
   ;                     (Do not decode a complete URL.)
   ; out: return value : decoded URL component
   ;
   ; When a URI is dereferenced, the components and subcomponents
   ; significant to the scheme-specific dereferencing process (if any)
   ; must be parsed and separated before the percent-encoded octets within
   ; those components can be safely decoded, as otherwise the data may be
   ; mistaken for component delimiters.
   ; Implementations must not percent-decode the same string more than
   ; once.
   ; [RFC 3986, section 2.4]
   Protected *buffer, *char.Character, *a.Ascii, ret$
   Protected length.i=Len(urlComponent$)
   
   If length > 0
      urlComponent$ = ReplaceString(urlComponent$, "+", "%20")
      
      *buffer = AllocateMemory(length)
      If *buffer
         *char = @urlComponent$
         *a = *buffer
         While *char\c
            If *char\c = '%'
               *char\c = '$'
               *a\a = Val(PeekS(*char, 3))
               *char + 3*SizeOf(Character)
            Else
               *a\a = *char\c
               *char + SizeOf(Character)
            EndIf
            *a + 1
         Wend
         ret$ = PeekS(*buffer, *a-*buffer, #PB_UTF8)
         FreeMemory(*buffer)
      EndIf
   EndIf
   
   ProcedureReturn ret$
EndProcedure


#URLBadCharacters$ = #DQUOTE$ + "<>\^`{|}"

Procedure.i FindBadURLCharacter (url$)
   ; in : URL to be checked
   ; out: leftmost position in 'url$' of any character that is *never* allowed in a URL;
   ;      0 if there is no such "bad" character in 'url$'
   ;
   ; [according to RFC 3986]
   Protected *char.Character, posn.i
   
   *char = @url$
   posn = 1
   While *char\c <> 0
      If (*char\c < 33) Or (*char\c > 126) Or FindString(#URLBadCharacters$, Chr(*char\c))
         ProcedureReturn posn
      EndIf
      *char + SizeOf(Character)
      posn + 1
   Wend
   
   ProcedureReturn 0
EndProcedure


; -- Demo
Define c, a$, b$, c$

For c = 1 To 10000
   a$ = EncodeURLComponent_UTF8(Chr(c))
   If DecodeURLComponent_UTF8(a$) <> Chr(c)
      Debug "Error: " + Hex(c) + " " + Chr(c) + "     " + a$ + " " + DecodeURLComponent_UTF8(a$)
   EndIf
Next

a$ = EncodeURLComponent_UTF8("Lübeck")    ; -> L%C3%BCbeck
b$ = EncodeURLComponent_UTF8("cœur")      ; -> c%C5%93ur
c$ = EncodeURLComponent_UTF8("Œ-æ-Æ-€")   ; -> %C5%92-%C3%A6-%C3%86-%E2%82%AC
Debug a$
Debug b$
Debug c$
Debug ""
Debug DecodeURLComponent_UTF8(a$)         ; -> Lübeck
Debug DecodeURLComponent_UTF8(b$)         ; -> cœur
Debug DecodeURLComponent_UTF8(c$)         ; -> Œ-æ-Æ-€
Debug ""

For c = 1 To 10000
   a$ = EncodeURLComponent_UTF8(Chr(c))
   If FindBadURLCharacter(a$)
      Debug "Error: " + Hex(c) + " " + Chr(c) + "     " + a$
   EndIf   
Next

Debug FindBadURLCharacter("A test")
Debug FindBadURLCharacter("A%20test")
Debug FindBadURLCharacter("A+test")
Debug FindBadURLCharacter("http://en.wikipedia.org/wiki/Lübeck")
Debug FindBadURLCharacter("http://en.wikipedia.org/wiki/L%C3%BCbeck")
PS: See here for a related feature request.
Last edited by Little John on Sat Dec 20, 2014 10:54 pm, edited 4 times in total.
CalamityJames
User
User
Posts: 78
Joined: Sat Mar 13, 2010 4:50 pm

Re: UTF-8 support for URLEncoder() and URLDecoder()

Post by CalamityJames »

I would add that I would also like to see this function added to PB.

This elegant method is unfortunately not perfect: it doesn't work with "cœur" or any other word containing "œ" or Œ or or æ or Æ.
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: UTF-8 support for URLEncoder() and URLDecoder()

Post by davido »

+1
DE AA EB
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: UTF-8 support for URLEncoder() and URLDecoder()

Post by Little John »

CalamityJames wrote:it doesn't work with "cœur" or any other word containing "œ" or Œ or or æ or Æ.
Thank you very much for pointing that out!
I have rewritten the code in the first post, so hopefully it will now work as expected.
CalamityJames
User
User
Posts: 78
Joined: Sat Mar 13, 2010 4:50 pm

Re: UTF-8 support for URLEncoder() and URLDecoder()

Post by CalamityJames »

I happened to have a string with most of the odd characters lurking in one of my own programs (áäâåãæàÁÄÂÅÃÆÀç¢Ç©ÐéëêèÉË€ÊÈíïîìÍÏÎÌI£ñÑóöôøõœòÓÖÔØÕҮߚŠ™úüûùÚÜÛÙÿýÝŸ“”«»¿¡…~†‡) so I used it to test your new version and there were no problems. It's a pity that these odd characters (æ,Æ,œ,Œ) - which are not treated logically in UTF-8 - make the code so much longer, but at least we have something which works. Thanks very much.
CalamityJames
User
User
Posts: 78
Joined: Sat Mar 13, 2010 4:50 pm

Re: UTF-8 support for URLEncoder() and URLDecoder()

Post by CalamityJames »

A bit more checking shows that it still isn't right - the € and quite a lot of minor characters are sill encoded wrongly. I found a website (http://www.w3schools.com/tags/ref_urlencode.asp) which appears to examine what your browser sends. I created a reference text for all ascii characters from 32 - 255 using Firefox on this website and wrote some code to create the same text. I hope it's right…

Code: Select all

EnableExplicit
Global ReferenceText.s = "%20!%22%23%24%25%26'()*%2B%2C-.%2F0123456789%3A%3B%3C%3D%3E%3F%40ABCDEFGHIJKLMNOPQRSTUVWXYZ%5B%5C%5D%5E_%60abcdefghijklmnopqrstuvwxyz%7B%7C%7D~%7F%E2%82%AC%C2%81%E2%80%9A%C6%92%E2%80%9E%E2%80%A6%E2%80%A0%E2%80%A1%CB%86%E2%80%B0%C5%A0%E2%80%B9%C5%92%C2%8D%C5%BD%C2%8F%C2%90%E2%80%98%E2%80%99%E2%80%9C%E2%80%9D%E2%80%A2%E2%80%93%E2%80%94%CB%9C%E2%84%A2%C5%A1%E2%80%BA%C5%93%C2%9D%C5%BE%C5%B8%C2%A0%C2%A1%C2%A2%C2%A3%C2%A4%C2%A5%C2%A6%C2%A7%C2%A8%C2%A9%C2%AA%C2%AB%C2%AC%C2%AD%C2%AE%C2%AF%C2%B0%C2%B1%C2%B2%C2%B3%C2%B4%C2%B5%C2%B6%C2%B7%C2%B8%C2%B9%C2%BA%C2%BB%C2%BC%C2%BD%C2%BE%C2%BF%C3%80%C3%81%C3%82%C3%83%C3%84%C3%85%C3%86%C3%87%C3%88%C3%89%C3%8A%C3%8B%C3%8C%C3%8D%C3%8E%C3%8F%C3%90%C3%91%C3%92%C3%93%C3%94%C3%95%C3%96%C3%97%C3%98%C3%99%C3%9A%C3%9B%C3%9C%C3%9D%C3%9E%C3%9F%C3%A0%C3%A1%C3%A2%C3%A3%C3%A4%C3%A5%C3%A6%C3%A7%C3%A8%C3%A9%C3%AA%C3%AB%C3%AC%C3%AD%C3%AE%C3%AF%C3%B0%C3%B1%C3%B2%C3%B3%C3%B4%C3%B5%C3%B6%C3%B7%C3%B8%C3%B9%C3%BA%C3%BB%C3%BC%C3%BD%C3%BE%C3%BF"
; reference text obtained from http://www.w3schools.com/tags/ref_urlencode.asp

Global i.i, a.s, b.s

Procedure.s ChangeWordToUTF8(LookupWord.s)
  Protected Inc.l, TempStr.s, Char.l
  Protected NewChar.s, LenLookUpWord.i
  LenLookUpWord = Len(LookupWord)
  For Inc = 1 To LenLookUpWord
    Char = Asc(Mid(LookupWord, Inc, 1))
    Select Char
      Case 192 To 255  ; $C0 (À)
        TempStr + "%C3%" + Hex(Char - 64)
      Case 160 To 191
        TempStr + "%C2%" + Hex(Char)
      Case 32, 34 To 38, 43, 44, 47, 58 To 64, 91, 92, 93, 94, 96, 123, 124, 125, 127
        TempStr + "%" + Hex(char)
      Case 128 ; €
        TempStr + "%E2%82%AC"
      Case 129
        TempStr + "%C2%81"
      Case 130
        TempStr + "%E2%80%9A"
      Case 131
        TempStr + "%C6%92"
      Case 132
        TempStr + "%E2%80%9E"
      Case 133
        TempStr + "%E2%80%A6"
      Case 134
        TempStr + "%E2%80%A0"
      Case 135
        TempStr + "%E2%80%A1"
      Case 136
        TempStr + "%CB%86"
      Case 137
        TempStr + "%E2%80%B0"
      Case 138
        TempStr + "%C5%A0"
      Case 139
        TempStr + "%E2%80%B9"
      Case 140 ; Œ
        TempStr + "%C5%92" 
      Case 141
        TempStr + "%C2%8D"
      Case 142 ; Ž
        TempStr + "%C5%BD"
      Case 143 ;
        TempStr + "%C2%8F"
      Case 144
        TempStr + "%C2%90"
      Case 145
        TempStr + "%E2%80%98"
      Case 146
        TempStr + "%E2%80%99"
      Case 147
        TempStr + "%E2%80%9C"
      Case 148
        TempStr + "%E2%80%9D"
      Case 149
        TempStr + "%E2%80%A2"
      Case 150
        TempStr + "%E2%80%93"
      Case 151
        TempStr + "%E2%80%94"
      Case 152
        TempStr + "%CB%9C"
      Case 153
        TempStr + "%E2%84%A2"
      Case 154
        TempStr + "%C5%A1"
      Case 155
        TempStr + "%E2%80%BA"
      Case 156  ; œ
        TempStr + "%C5%93"
      Case 157
        TempStr + "%C2%9D"
      Case 158
        TempStr + "%C5%BE"
      Case 159
        TempStr + "%C5%B8"
      Default
        TempStr + Chr(Char)
    EndSelect
  Next
  ProcedureReturn TempStr
EndProcedure

For i = 32 To 255
  a + Chr(i)
Next
b = ChangeWordToUTF8(a)

If b = ReferenceText
  Debug "Created text matches reference text"
EndIf

Debug "€ = " + ChangeWordToUTF8("€")
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: UTF-8 support for URLEncoder() and URLDecoder()

Post by Little John »

Hi CalamityJames,

again many thanks!
I think I'll have to do some more investigation.
I hope that I'll have the time to do so soon, maybe next week.

Thanks, again!
Nico
Enthusiast
Enthusiast
Posts: 274
Joined: Sun Jan 11, 2004 11:34 am
Location: France

Re: UTF-8 support for URLEncoder() and URLDecoder()

Post by Nico »

This is a translation of the javascript code here : http://www.hypergurl.com/urlencode.html

but there are not many ways to do that!

Explanation here:
http://en.wikipedia.org/wiki/Percent-encoding
http://tools.ietf.org/html/rfc3986

Code: Select all

Procedure.s EncodeURL_UTF8(Texte.s)
  Protected reserved.s = "!*'();:@&=+$,/?%#[]"
  Protected unreserved.s = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_.~"
  
  Protected charcode.l, encoded.s
  
  For i = 1  To Len(texte)
    ch.s = Mid(texte,i,1);
    ;// Check If character is an unreserved character:
    If FindString( unreserved, ch)
      encoded.s = encoded + ch;
    Else 
      
      ;// The position in the Unicode table tells us how many bytes are needed.
      ;// Note that If we talk about first, second, etc. in the following, we are
      ;// counting from left To right:
      ;//
      ;//   Position in   |  Bytes needed   | Binary representation
      ;//  Unicode table  |   For UTF-8     |       of UTF-8
      ;// ----------------------------------------------------------
      ;//     0 -     127 |    1 byte       | 0XXX.XXXX
      ;//   128 -    2047 |    2 bytes      | 110X.XXXX 10XX.XXXX
      ;//  2048 -   65535 |    3 bytes      | 1110.XXXX 10XX.XXXX 10XX.XXXX
      ;// 65536 - 2097151 |    4 bytes      | 1111.0XXX 10XX.XXXX 10XX.XXXX 10XX.XXXX
      
      charcode.l = Asc(ch)
      
      ;// Position 0 - 127 is equal To percent-encoding With an ASCII character encoding:
      If (charcode < 128) 
        encoded.s = encoded + "%"+ Hex(charcode,#PB_Ascii);
      EndIf 
      
      ;// Position 128 - 2047: two bytes For UTF-8 character encoding.
      If (charcode > 127 And charcode < 2048) 
        ;// First UTF byte: Mask the first five bits of charcode With binary 110X.XXXX:
        encoded = encoded + "%"+ Hex((charcode >> 6) | $C0,#PB_Ascii);
        ;// Second UTF byte: Get last six bits of charcode And mask them With binary 10XX.XXXX:
        encoded = encoded + "%"+ Hex((charcode & $3F) | $80,#PB_Ascii);
      EndIf 
      
      ;// Position 2048 - 65535: three bytes For UTF-8 character encoding.
      If (charcode > 2047 And charcode < 65536) 
        ;// First UTF byte: Mask the first four bits of charcode With binary 1110.XXXX:
        encoded = encoded + "%"+ Hex((charcode >> 12) | $E0,#PB_Ascii);
        ;// Second UTF byte: Get the Next six bits of charcode And mask them binary 10XX.XXXX:
        encoded = encoded + "%"+ Hex(((charcode >> 6) & $3F) | $80,#PB_Ascii);
        ;// Third UTF byte: Get the last six bits of charcode And mask them binary 10XX.XXXX:
        encoded = encoded + "%"+ Hex((charcode & $3F) | $80,#PB_Ascii);
      EndIf 
      
      ;// Position 65536 - : four bytes For UTF-8 character encoding.
      If (charcode > 65535) 
        ;// First UTF byte: Mask the first three bits of charcode With binary 1111.0XXX:
        encoded = encoded + "%"+ Hex((charcode >> 18) | $F0,#PB_Ascii);
        ;// Second UTF byte: Get the Next six bits of charcode And mask them binary 10XX.XXXX:
        encoded = encoded + "%"+ Hex(((charcode >> 12) & $3F) | $80,#PB_Ascii);
        ;// Third UTF byte: Get the last six bits of charcode And mask them binary 10XX.XXXX:
        encoded = encoded + "%"+ Hex(((charcode >> 6) & $3F) | $80,#PB_Ascii);
        ;// Fourth UTF byte: Get the last six bits of charcode And mask them binary 10XX.XXXX:
        encoded = encoded + "%"+ Hex((charcode & $3F) | $80,#PB_Ascii);
      EndIf 
      
    EndIf 
    
  Next  ;// End of For ...
  
  ProcedureReturn encoded
EndProcedure


Procedure.s DecodeURL_UTF8(Texte.s)
  Protected reserved.s = "!*'();:@&=+$,/?%#[]"
  Protected unreserved.s = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_.~"
  Protected allowed.s=reserved + unreserved
  
  ;// UTF-8 bytes from left To right:
  Protected  byte1.c, byte2.c, byte3.c, byte4.c
  Protected i.l, decoded.s, illegalencoding.s, notallowed.s, warning.s = "";
  
  ;application/x-www-form-urlencoded
  Texte=ReplaceString(Texte, "+", "%20")
  
  
  Texte=ReplaceString(Texte,"%","$")
  
  i = 1
  While (i < Len(Texte)+1) 
    ch.s = Mid(Texte,i,1);
    ;// Check For percent-encoded string:
    If (ch = "$") 
      
      ;// Check For legal percent-encoding of first byte:
      If  Val(Mid(Texte,i,3)) < 255
        
        ;// Get the decimal values of all (potential) UTF-bytes:
        byte1 = Val(Mid(Texte,i,3)) ;getdec(encoded.substr(i,3));
        byte2 = Val(Mid(Texte,i+3,3)) ;getdec(encoded.substr(i+3,3))
        byte3 = Val(Mid(Texte,i+6,3)) ;getdec(encoded.substr(i+6,3))
        byte4 = Val(Mid(Texte,i+9,3)) ;getdec(encoded.substr(i+9,3))
        
        ;// Check For one byte UTF-8 character encoding:
        If (byte1 < 128) 
          decoded = decoded + Chr(byte1);
          i = i + 3;
        EndIf 
        
        ;// Check For illegal one byte UTF-8 character encoding:
        If (byte1 > 127 And  byte1 < 192) 
          decoded = decoded + Mid(Texte,i,3);
          illegalencoding = illegalencoding + Mid(Texte,i,3) + " ";
          i = i + 3;
        EndIf 
        
        ;// Check For two byte UTF-8 character encoding:
        If (byte1 > 191 And byte1 < 224) 
          If (byte2 > 127 And byte2 < 192) 
            decoded = decoded + Chr(((byte1 & $1F) << 6) | (byte2 & $3F));
          Else 
            decoded = decoded + Mid(Texte,i,6);
            illegalencoding = illegalencoding + Mid(Texte,i,6) + " ";
          EndIf 
          i = i + 6;
        EndIf 
        
        ;// Check For three byte UTF-8 character encoding:
        If (byte1 > 223 And byte1 < 240) 
          If (byte2 > 127 And byte2 < 192) 
            If (byte3 > 127 And byte3 < 192) 
              decoded = decoded + Chr(((byte1 & $F) << 12) | ((byte2 & $3F) << 6) | (byte3 & $3F));
            Else 
              decoded = decoded + Mid(Texte,i,9);
              illegalencoding = illegalencoding + Mid(Texte,i,9) + " ";
            EndIf 
          Else 
            decoded = decoded + Mid(Texte,i,9);
            illegalencoding = illegalencoding + Mid(Texte,i,9) + " ";
          EndIf 
          i = i + 9;
        EndIf 
        
        ;// Check For four byte UTF-8 character encoding:
        If (byte1 > 239) 
          If (byte2 > 127 And byte2 < 192) 
            If (byte3 > 127 And byte3 < 192) 
              If (byte4 > 127 And byte4 < 192) 
                decoded = decoded + Chr(((byte1 & $7) << 18) | ((byte2 & $3F) << 12) | ((byte3 & $3F) << 6) | (byte4 & $3F));
              Else 
                decoded = decoded + Mid(Texte,i,12);
                illegalencoding = illegalencoding + Mid(Texte,i,12) + " ";
              EndIf 
            Else 
              decoded = decoded + Mid(Texte,i,12);
              illegalencoding = illegalencoding + Mid(Texte,i,12) + " ";
            EndIf 
          Else 
            decoded = decoded + Mid(Texte,i,12);
            illegalencoding = illegalencoding + Mid(Texte,i,12) + " ";
          EndIf 
          i = i + 12;
        EndIf 
        
      Else   ;// the first byte is Not legally percent-encoded
        decoded = decoded + Mid(Texte,i,3);
        illegalencoding = illegalencoding + Mid(Texte,i,3) + " ";
        i = i + 3;
      EndIf 
      
    Else ;  // the string is Not percent encoded
      ;// Check If character is an allowed character:
      If FindString(allowed, ch)=0 : notallowed = notallowed + ch + " " : EndIf ;
      decoded = decoded + ch;
      i=i+1
    EndIf 
  Wend  ;// End of While ...
  
  ;// Display warning message If necessary:
  
  If (notallowed <> ""): warning = warning + "Characters not allowed in a URL:"+ Chr(13) + notallowed +Chr(13)+Chr(13) : EndIf  ;
  If (illegalencoding <> ""): warning = warning + "Illegal percent-encoding (for UTF-8):" +Chr(13) + illegalencoding +Chr(13)+Chr(13) : EndIf  ;
  If (warning <> "") : MessageRequester("Erreur","Warning: Illegal characters/strings in encoded text!" +Chr(13)+Chr(13) + warning) : EndIf ;
  
  ProcedureReturn decoded
EndProcedure 

Ret.s =  EncodeURL_UTF8("Lübeck")
Debug Ret
Debug DecodeURL_UTF8(Ret)
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: UTF-8 support for URLEncoder() and URLDecoder()

Post by Little John »

Hi again,

I'm sorry for the delay in replying!
CalamityJames wrote:A bit more checking shows that it still isn't right - the € and quite a lot of minor characters are sill encoded wrongly. I found a website (http://www.w3schools.com/tags/ref_urlencode.asp) which appears to examine what your browser sends.

Code: Select all

Debug URLEncoderEx("€", #PB_UTF8)
yields %E2%82%AC here (source code in the IDE in UTF-8 format, and compiled in Unicode mode!).

Both forms on the webpage referenced above by you give the same result.
I currently can't see any problem with my function URLEncoderEx(). :-)


@Nico:
Many thanks for your code!

I think in EncodeURL_UTF8() there is a small glitch with the first characters:

Code: Select all

For c = 1 To $F
   Debug EncodeURL_UTF8(Chr(c))
Next
It returns
%1
%2
etc.
That should IMHO be
%01
%02
etc.
Apart from this, your and my encoding functions yield pretty similar results, with only the following few exceptions:

Code: Select all

Define c, a$, b$

For c = 16 To 10000
   a$ = EncodeURL_UTF8(Chr(c))
   b$ = URLEncoderEx(Chr(c), #PB_UTF8)
   If a$ <> b$
      Debug Hex(c) + " " + Chr(c) + "     " + a$ + "  " + b$
   EndIf   
Next
Output:

Code: Select all

21 !     %21  !
23 #     %23  #
24 $     %24  $
26 &     %26  &
27 '     %27  '
28 (     %28  (
29 )     %29  )
2A *     %2A  *
2B +     %2B  +
2C ,     %2C  ,
2F /     %2F  /
3A :     %3A  :
3B ;     %3B  ;
3D =     %3D  =
3F ?     %3F  ?
40 @     %40  @
All the characters that are handled differently by both functions are reserved characters.

The way my encoding function (as well as PB's built-in URLEncoder()) works can be useful, when encoding complete URIs.
So we'd get e.g.
where the characters : and / are not encoded, as expected.
However, if e.g. the characters : and / should not be used as component delimiters, but used as data, then they have to be encoded.

So how to handle this?
[url=http://www.faqs.org/rfcs/rfc3986.html][u]RFC 3986[/u][/url], section 2.4 wrote:Under normal circumstances, the only time when octets within a URI
are percent-encoded is during the process of producing the URI from
its component parts. This is when an implementation determines which
of the reserved characters are to be used as subcomponent delimiters
and which can be safely used as data.
That means, an encoding function should encode those reserved characters (not in a whole URI, but in each of its components).
I'll change my above encoding function accordingly.
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: UTF-8 support for encoding and decoding URLs

Post by Little John »

Code in the first post revised.
For details see remarks in that post.
Post Reply