in northern Germany, there is a nice old town called "Lübeck".
The URL of its entry in the English Wikipedia is
After saving this URL to a bookmark (using Firefox 32.0.3 on Windows 7 x64), it looks like this:
My problem was, that I could not convert "Lübeck" to "L%C3%BCbeck" or vice versa by using PB's built-in URLEncoder() and URLDecoder() (with PB 5.31 beta 1):
Code: Select all
Debug URLEncoder("Lübeck") ; -> L%FCbeck
Debug URLDecoder("L%C3%BCbeck") ; -> Lübeck
The following procedures add UTF-8 support for encoding and decoding URLs. They work correctly only when compiled in Unicode mode.
//edit 2014-12-20:
- renamed both functions
- removed the option for ASCII encoding in both functions
- EncodeURLComponent_UTF8() fixed, so that it now works correctly according to RFC 3986
(does not utilize PB's built-in function URLEncoder() anymore) - DecodeURLComponent_UTF8() can now handle '+' in encoded URLs
- added function FindBadURLCharacter()
- improved documentation
- slightly improved demo code
Code: Select all
; PB 5.31
EnableExplicit
#URLUnreservedCharacters$ = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~"
Procedure.s EncodeURLComponent_UTF8 (urlComponent$)
; in : urlComponent$: separate component of a URL, which is to be encoded
; (Do not encode a complete URL.)
; out: return value : percent-encoded URL component in UTF-8 format
;
; Under normal circumstances, the only time when octets within a URI
; are percent-encoded is during the process of producing the URI from
; its component parts. This is when an implementation determines which
; of the reserved characters are to be used as subcomponent delimiters
; and which can be safely used as data.
; Implementations must not percent-encode the same string more than
; once.
; [RFC 3986, section 2.4]
Protected *buffer, *char.Character, *a.Ascii, ret$
Protected numBytes.i, *fin
*buffer = AllocateMemory(5) ; in UTF-8, 1 character can take up to 4 bytes
If *buffer
*char = @urlComponent$
While *char\c
If FindString(#URLUnreservedCharacters$, Chr(*char\c))
ret$ + Chr(*char\c)
Else
numBytes = PokeS(*buffer, Chr(*char\c), 1, #PB_UTF8)
*fin = *buffer + numBytes - 1
For *a = *buffer To *fin
ret$ + "%" + RSet(Hex(*a\a), 2, "0")
Next
EndIf
*char + SizeOf(Character)
Wend
FreeMemory(*buffer)
EndIf
ProcedureReturn ret$
EndProcedure
Procedure.s DecodeURLComponent_UTF8 (urlComponent$)
; in : urlComponent$: separate component of a URL, which may contain
; percent-encoded characters in UTF-8 format
; (Do not decode a complete URL.)
; out: return value : decoded URL component
;
; When a URI is dereferenced, the components and subcomponents
; significant to the scheme-specific dereferencing process (if any)
; must be parsed and separated before the percent-encoded octets within
; those components can be safely decoded, as otherwise the data may be
; mistaken for component delimiters.
; Implementations must not percent-decode the same string more than
; once.
; [RFC 3986, section 2.4]
Protected *buffer, *char.Character, *a.Ascii, ret$
Protected length.i=Len(urlComponent$)
If length > 0
urlComponent$ = ReplaceString(urlComponent$, "+", "%20")
*buffer = AllocateMemory(length)
If *buffer
*char = @urlComponent$
*a = *buffer
While *char\c
If *char\c = '%'
*char\c = '$'
*a\a = Val(PeekS(*char, 3))
*char + 3*SizeOf(Character)
Else
*a\a = *char\c
*char + SizeOf(Character)
EndIf
*a + 1
Wend
ret$ = PeekS(*buffer, *a-*buffer, #PB_UTF8)
FreeMemory(*buffer)
EndIf
EndIf
ProcedureReturn ret$
EndProcedure
#URLBadCharacters$ = #DQUOTE$ + "<>\^`{|}"
Procedure.i FindBadURLCharacter (url$)
; in : URL to be checked
; out: leftmost position in 'url$' of any character that is *never* allowed in a URL;
; 0 if there is no such "bad" character in 'url$'
;
; [according to RFC 3986]
Protected *char.Character, posn.i
*char = @url$
posn = 1
While *char\c <> 0
If (*char\c < 33) Or (*char\c > 126) Or FindString(#URLBadCharacters$, Chr(*char\c))
ProcedureReturn posn
EndIf
*char + SizeOf(Character)
posn + 1
Wend
ProcedureReturn 0
EndProcedure
; -- Demo
Define c, a$, b$, c$
For c = 1 To 10000
a$ = EncodeURLComponent_UTF8(Chr(c))
If DecodeURLComponent_UTF8(a$) <> Chr(c)
Debug "Error: " + Hex(c) + " " + Chr(c) + " " + a$ + " " + DecodeURLComponent_UTF8(a$)
EndIf
Next
a$ = EncodeURLComponent_UTF8("Lübeck") ; -> L%C3%BCbeck
b$ = EncodeURLComponent_UTF8("cœur") ; -> c%C5%93ur
c$ = EncodeURLComponent_UTF8("Œ-æ-Æ-€") ; -> %C5%92-%C3%A6-%C3%86-%E2%82%AC
Debug a$
Debug b$
Debug c$
Debug ""
Debug DecodeURLComponent_UTF8(a$) ; -> Lübeck
Debug DecodeURLComponent_UTF8(b$) ; -> cœur
Debug DecodeURLComponent_UTF8(c$) ; -> Œ-æ-Æ-€
Debug ""
For c = 1 To 10000
a$ = EncodeURLComponent_UTF8(Chr(c))
If FindBadURLCharacter(a$)
Debug "Error: " + Hex(c) + " " + Chr(c) + " " + a$
EndIf
Next
Debug FindBadURLCharacter("A test")
Debug FindBadURLCharacter("A%20test")
Debug FindBadURLCharacter("A+test")
Debug FindBadURLCharacter("http://en.wikipedia.org/wiki/Lübeck")
Debug FindBadURLCharacter("http://en.wikipedia.org/wiki/L%C3%BCbeck")