Page 1 of 1

HTMLEncoder / HTMLDecoder

Posted: Thu Feb 05, 2009 9:03 pm
by luis
To complement UrlDecoder / UrlEncoder.

Encode / decode special chars from / to HTML.

Code: Select all

EnableExplicit

DataSection
 HTML_DECODER_DATA:
    
 Data.s ~"\""   : Data.i 34
 Data.s  "&"    : Data.i 38
 Data.s  "<"    : Data.i 60 
 Data.s  ">"    : Data.i 62    
    
 Data.s  "&nbsp;"    : Data.i 160
 Data.s  "&iexcl;"   : Data.i 161
 Data.s  "&cent;"    : Data.i 162
 Data.s  "&pound;"   : Data.i 163
 Data.s  "&curren;"  : Data.i 164
 Data.s  "&yen;"     : Data.i 165
 Data.s  "&brvbar;"  : Data.i 166
 Data.s  "&sect;"    : Data.i 167
 Data.s  "&uml;"     : Data.i 168
 Data.s  "&copy;"    : Data.i 169
 Data.s  "&ordf;"    : Data.i 170
 Data.s  "&laquo;"   : Data.i 171
 Data.s  "&not;"     : Data.i 172
 Data.s  "&shy;"     : Data.i 173
 Data.s  "&reg;"     : Data.i 174
 Data.s  "&macr;"    : Data.i 175
 Data.s  "&deg;"     : Data.i 176
 Data.s  "&plusmn;"  : Data.i 177
 Data.s  "&sup2;"    : Data.i 178
 Data.s  "&sup3;"    : Data.i 179
 Data.s  "&acute;"   : Data.i 180
 Data.s  "&micro;"   : Data.i 181
 Data.s  "&para;"    : Data.i 182
 Data.s  "&middot;"  : Data.i 183
 Data.s  "&cedil;"   : Data.i 184
 Data.s  "&sup1;"    : Data.i 185
 Data.s  "&ordm;"    : Data.i 186
 Data.s  "&raquo;"   : Data.i 187
 Data.s  "&frac14;"  : Data.i 188
 Data.s  "&frac12;"  : Data.i 189
 Data.s  "&frac34;"  : Data.i 190
 Data.s  "&iquest;"  : Data.i 191
 Data.s  "&Agrave;"  : Data.i 192
 Data.s  "&Aacute;"  : Data.i 193
 Data.s  "&Acirc;"   : Data.i 194
 Data.s  "&Atilde;"  : Data.i 195
 Data.s  "&Auml;"    : Data.i 196
 Data.s  "&Aring;"   : Data.i 197
 Data.s  "&AElig;"   : Data.i 198
 Data.s  "&Ccedil;"  : Data.i 199
 Data.s  "&Egrave;"  : Data.i 200
 Data.s  "&Eacute;"  : Data.i 201
 Data.s  "&Ecirc;"   : Data.i 202
 Data.s  "&Euml;"    : Data.i 203
 Data.s  "&Igrave;"  : Data.i 204
 Data.s  "&Iacute;"  : Data.i 205
 Data.s  "&Icirc;"   : Data.i 206
 Data.s  "&Iuml;"    : Data.i 207
 Data.s  "&ETH;"     : Data.i 208
 Data.s  "&Ntilde;"  : Data.i 209
 Data.s  "&Ograve;"  : Data.i 210
 Data.s  "&Oacute;"  : Data.i 211
 Data.s  "&Ocirc;"   : Data.i 212
 Data.s  "&Otilde;"  : Data.i 213
 Data.s  "&Ouml;"    : Data.i 214
 Data.s  "&times;"   : Data.i 215
 Data.s  "&Oslash;"  : Data.i 216
 Data.s  "&Ugrave;"  : Data.i 217
 Data.s  "&Uacute;"  : Data.i 218
 Data.s  "&Ucirc;"   : Data.i 219
 Data.s  "&Uuml;"    : Data.i 220
 Data.s  "&Yacute;"  : Data.i 221
 Data.s  "&THORN;"   : Data.i 222
 Data.s  "&szlig;"   : Data.i 223
 Data.s  "&agrave;"  : Data.i 224
 Data.s  "&aacute;"  : Data.i 225
 Data.s  "&acirc;"   : Data.i 226
 Data.s  "&atilde;"  : Data.i 227
 Data.s  "&auml;"    : Data.i 228
 Data.s  "&aring;"   : Data.i 229
 Data.s  "&aelig;"   : Data.i 230
 Data.s  "&ccedil;"  : Data.i 231
 Data.s  "&egrave;"  : Data.i 232
 Data.s  "&eacute;"  : Data.i 233
 Data.s  "&ecirc;"   : Data.i 234
 Data.s  "&euml;"    : Data.i 235
 Data.s  "&igrave;"  : Data.i 236
 Data.s  "&iacute;"  : Data.i 237
 Data.s  "&icirc;"   : Data.i 238
 Data.s  "&iuml;"    : Data.i 239    
 Data.s  "&eth;"     : Data.i 240
 Data.s  "&ntilde;"  : Data.i 241
 Data.s  "&ograve;"  : Data.i 242
 Data.s  "&oacute;"  : Data.i 243
 Data.s  "&ocirc;"   : Data.i 244
 Data.s  "&otilde;"  : Data.i 245
 Data.s  "&ouml;"    : Data.i 246
 Data.s  "&divide;"  : Data.i 247
 Data.s  "&oslash;"  : Data.i 248
 Data.s  "&ugrave;"  : Data.i 249
 Data.s  "&uacute;"  : Data.i 250
 Data.s  "&ucirc;"   : Data.i 251
 Data.s  "&uuml;"    : Data.i 252
 Data.s  "&yacute;"  : Data.i 253
 Data.s  "&thorn;"   : Data.i 254
 Data.s  "&yuml;"    : Data.i 255
    
 Data.s  ""          : Data.i 0
            
EndDataSection
 
Procedure.i IsDigit (sChar.s)
 Protected flgIsDigit = #False
 Protected iChar = Asc(sChar)
 
 If iChar >= '0' And iChar <= '9' : flgIsDigit = #True : EndIf
 
 ProcedureReturn flgIsDigit 
EndProcedure

Procedure.i IsAllDigit (sString.s)
 Protected flgIsAllDigit = #False
 Protected iLen = Len(sString)
 Protected k
 
 If iLen
    flgIsAllDigit = #True
    For k = 1 To iLen
        If IsDigit(Mid(sString, k, 1)) = #False 
            flgIsAllDigit = #False
            Break
        EndIf
    Next    
 EndIf
 
 ProcedureReturn flgIsAllDigit
EndProcedure
 
 
Procedure.s HTMLDecoder (sEncodedString.s)

 Protected sHTMLEntity.s, iHTMLChar, sOutString.s = sEncodedString
 Protected sTemp.s, iStart, iEnd 
 
 Restore HTML_DECODER_DATA:  Read.s sHTMLEntity: Read.i iHTMLChar 
 
 Repeat 
    sOutString = ReplaceString(sOutString, sHTMLEntity, Chr(iHTMLChar))    
    Read.s sHTMLEntity: Read.i iHTMLChar
 Until iHTMLChar = 0
 
 ; look for something similar to "&#39;" or "&#039;"
 
 iStart = FindString(sOutString, "&#", 1) 
 iEnd = FindString(sOutString, ";", iStart)
 
 If (iStart > 0) And (iEnd > 0) ; found something 
    sTemp = Mid(sOutString, iStart + 2, iEnd - iStart - 2)
    If IsAllDigit(sTemp)
        sOutString = ReplaceString(sOutString, "&#" + sTemp + ";", Chr(Val(sTemp)))
    EndIf
 EndIf
 
 ProcedureReturn sOutString
  
EndProcedure


Procedure.s HTMLEncoder (sString.s)

 Protected sHTMLEntity.s, iHTMLChar, sOutString.s = sString

 Restore HTML_DECODER_DATA:  Read.s sHTMLEntity: Read.i iHTMLChar 
 
 Repeat 
    sOutString = ReplaceString(sOutString, Chr(iHTMLChar), sHTMLEntity)    
    Read.s sHTMLEntity: Read.i iHTMLChar     
 Until iHTMLChar = 0
   
 ProcedureReturn sOutString
  
EndProcedure


Posted: Thu Feb 05, 2009 9:16 pm
by Joakim Christiansen
That is kinda sweet, will sure come in handy one day :D

Posted: Thu Feb 05, 2009 10:17 pm
by kenmo
Very useful, I'll need something like this soon!


One glaring optimization stands out at me though:

You're scanning through the entire HTML string (via ReplaceString()) for EVERY eligible character (almost 100!).

Seems like a single manual pass, de/encoding characters as you go, would perform much better. Might be plenty fast as it is, however.

Posted: Thu Feb 05, 2009 10:38 pm
by luis
kenmo wrote: Seems like a single manual pass, de/encoding characters as you go, would perform much better. Might be plenty fast as it is, however.
Yes, you are right.

I followed this path because in a real case scenario (for my uses anyway) I felt it was simply irrilevant. This was easy to code and quite fast.

Very often I prefer simple code over fast code.

For example, if some disk acces is involved, optimize this for speed is probably useless.

If your program interact with the user, again optimize this for speed is probably useless.

And so on.

All depend on what your program does and how much time it would spend inside this code.

Posted: Thu Feb 05, 2009 10:47 pm
by idle
thanks luis

Posted: Thu Feb 05, 2009 11:17 pm
by Joakim Christiansen
luis wrote: All depend on what your program does and how much time it would spend inside this code.
I agree. If you don't need a function to be ultra fast then there is no need to spend time optimizing it. But if you were to parse a lot of files and it needed to be very fast THEN you could try to spend an hour (or whatever) to make it faster.

But with that said don't get me wrong either, if someone posts an optimized version I would most likely use it (if it ain't >100 lines of code).

Re: HTMLEncoder / HTMLDecoder

Posted: Wed Jun 25, 2014 10:56 am
by es_91
I know, this thread is pretty old, but i got to say it:

Th- T-- come on, all together, Th-

THANX LUIS !!!

Re: HTMLEncoder / HTMLDecoder

Posted: Sat Jun 13, 2015 9:58 am
by c4s
Just needed something like this, thanks luis!

Honestly I wonder why it's not officially implemented yet as it's rather important when dealing with HTML. So I created a feature request: HTML Entities Encoder/Decoder

Re: HTMLEncoder / HTMLDecoder

Posted: Mon Aug 14, 2023 6:05 am
by BarryG
Luis' code fails with this:

Code: Select all

Debug HTMLDecoder("one two (three  <img src="+Chr(34)+"images/four.gif"+Chr(34)+" alt="+Chr(34)+":D"+Chr(34)+" /> five )")
Any updates?

Re: HTMLEncoder / HTMLDecoder

Posted: Mon Aug 14, 2023 6:26 am
by Little John
BarryG wrote: Mon Aug 14, 2023 6:05 am Luis' code fails with this:

Code: Select all

Debug HTMLDecoder("one two (three  <img src="+Chr(34)+"images/four.gif"+Chr(34)+" alt="+Chr(34)+":D"+Chr(34)+" /> five )")
Any updates?
Using PB 6.03 beta 4, Luis' above code (with a tiny syntax correction) gives the following result:
one two (three <img src="images/four.gif" alt=":D" /> five )
This is exactly what I expected. Your example text does not contain any HTML entity that needs to be decoded.

Re: HTMLEncoder / HTMLDecoder

Posted: Mon Aug 14, 2023 6:34 am
by BarryG
It's meant to remove the image data. There's an opening <img> tag, but no closing </img> one, so it fails. It closes with "/>" instead, which is valid HTML.

But never mind, I'm doing it like this now, which works for my case use:

Code: Select all

Procedure.s StripTags(text$)
  Repeat
    s=FindString(text$,"<")
    If s
      e=FindString(text$,">",s+1)
      If s And e
        text$=Left(text$,s-1)+Mid(text$,e+1)
      EndIf
    EndIf
  Until s=0
  ProcedureReturn text$
EndProcedure

Debug StripTags("one two (three  <img src="+Chr(34)+"images/four.gif"+Chr(34)+" alt="+Chr(34)+":D"+Chr(34)+" /> five )")

Re: HTMLEncoder / HTMLDecoder

Posted: Mon Aug 14, 2023 6:37 am
by Little John
BarryG wrote: Mon Aug 14, 2023 6:34 am It's meant to remove the image data. There's an opening <img> tag, but no closing </img> one, so it fails. It closes with "/>" instead, which is valid HTML.
You misunderstand, what the purpose of Luis' code is.
Your original question, that I answered, actually was asking about decoding HTML entities.
In this thread, now your are asking for code that removes HTML tags. That's something completely different.

Re: HTMLEncoder / HTMLDecoder

Posted: Mon Aug 14, 2023 8:29 am
by Little John
luis wrote:

Code: Select all

DataSection
 HTML_DECODER_DATA:
    
 Data.s  """    : Data.i 34
Hi Luis,

did that ever work with older PB versions? :o
Now, e.g. with PB 6.03 beta 4, we have to write it differently, for instance

Code: Select all

Data.s ~"\""

Re: HTMLEncoder / HTMLDecoder

Posted: Mon Aug 14, 2023 9:43 am
by BarryG
@Little John: Yeah, I had two sets of HTML-related decoding issues that I needed to fix. But both are solved now with some quick hacky code.

Re: HTMLEncoder / HTMLDecoder

Posted: Mon Aug 14, 2023 10:59 am
by luis
Little John wrote: Mon Aug 14, 2023 8:29 am did that ever work with older PB versions? :o
Hi, in 2009 it must have worked since now it's not even compilable and at the time obviously I compiled it :?
I've updated it with your patch, thanks.