HTMLEncoder / HTMLDecoder

Share your advanced PureBasic knowledge/code with the community.
User avatar
luis
Addict
Addict
Posts: 3895
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

HTMLEncoder / HTMLDecoder

Post by luis »

To complement UrlDecoder / UrlEncoder.

Encode / decode special chars from / to HTML.

Code: Select all

EnableExplicit

DataSection
 HTML_DECODER_DATA:
    
 Data.s ~"\""   : Data.i 34
 Data.s  "&"    : Data.i 38
 Data.s  "<"    : Data.i 60 
 Data.s  ">"    : Data.i 62    
    
 Data.s  "&nbsp;"    : Data.i 160
 Data.s  "&iexcl;"   : Data.i 161
 Data.s  "&cent;"    : Data.i 162
 Data.s  "&pound;"   : Data.i 163
 Data.s  "&curren;"  : Data.i 164
 Data.s  "&yen;"     : Data.i 165
 Data.s  "&brvbar;"  : Data.i 166
 Data.s  "&sect;"    : Data.i 167
 Data.s  "&uml;"     : Data.i 168
 Data.s  "&copy;"    : Data.i 169
 Data.s  "&ordf;"    : Data.i 170
 Data.s  "&laquo;"   : Data.i 171
 Data.s  "&not;"     : Data.i 172
 Data.s  "&shy;"     : Data.i 173
 Data.s  "&reg;"     : Data.i 174
 Data.s  "&macr;"    : Data.i 175
 Data.s  "&deg;"     : Data.i 176
 Data.s  "&plusmn;"  : Data.i 177
 Data.s  "&sup2;"    : Data.i 178
 Data.s  "&sup3;"    : Data.i 179
 Data.s  "&acute;"   : Data.i 180
 Data.s  "&micro;"   : Data.i 181
 Data.s  "&para;"    : Data.i 182
 Data.s  "&middot;"  : Data.i 183
 Data.s  "&cedil;"   : Data.i 184
 Data.s  "&sup1;"    : Data.i 185
 Data.s  "&ordm;"    : Data.i 186
 Data.s  "&raquo;"   : Data.i 187
 Data.s  "&frac14;"  : Data.i 188
 Data.s  "&frac12;"  : Data.i 189
 Data.s  "&frac34;"  : Data.i 190
 Data.s  "&iquest;"  : Data.i 191
 Data.s  "&Agrave;"  : Data.i 192
 Data.s  "&Aacute;"  : Data.i 193
 Data.s  "&Acirc;"   : Data.i 194
 Data.s  "&Atilde;"  : Data.i 195
 Data.s  "&Auml;"    : Data.i 196
 Data.s  "&Aring;"   : Data.i 197
 Data.s  "&AElig;"   : Data.i 198
 Data.s  "&Ccedil;"  : Data.i 199
 Data.s  "&Egrave;"  : Data.i 200
 Data.s  "&Eacute;"  : Data.i 201
 Data.s  "&Ecirc;"   : Data.i 202
 Data.s  "&Euml;"    : Data.i 203
 Data.s  "&Igrave;"  : Data.i 204
 Data.s  "&Iacute;"  : Data.i 205
 Data.s  "&Icirc;"   : Data.i 206
 Data.s  "&Iuml;"    : Data.i 207
 Data.s  "&ETH;"     : Data.i 208
 Data.s  "&Ntilde;"  : Data.i 209
 Data.s  "&Ograve;"  : Data.i 210
 Data.s  "&Oacute;"  : Data.i 211
 Data.s  "&Ocirc;"   : Data.i 212
 Data.s  "&Otilde;"  : Data.i 213
 Data.s  "&Ouml;"    : Data.i 214
 Data.s  "&times;"   : Data.i 215
 Data.s  "&Oslash;"  : Data.i 216
 Data.s  "&Ugrave;"  : Data.i 217
 Data.s  "&Uacute;"  : Data.i 218
 Data.s  "&Ucirc;"   : Data.i 219
 Data.s  "&Uuml;"    : Data.i 220
 Data.s  "&Yacute;"  : Data.i 221
 Data.s  "&THORN;"   : Data.i 222
 Data.s  "&szlig;"   : Data.i 223
 Data.s  "&agrave;"  : Data.i 224
 Data.s  "&aacute;"  : Data.i 225
 Data.s  "&acirc;"   : Data.i 226
 Data.s  "&atilde;"  : Data.i 227
 Data.s  "&auml;"    : Data.i 228
 Data.s  "&aring;"   : Data.i 229
 Data.s  "&aelig;"   : Data.i 230
 Data.s  "&ccedil;"  : Data.i 231
 Data.s  "&egrave;"  : Data.i 232
 Data.s  "&eacute;"  : Data.i 233
 Data.s  "&ecirc;"   : Data.i 234
 Data.s  "&euml;"    : Data.i 235
 Data.s  "&igrave;"  : Data.i 236
 Data.s  "&iacute;"  : Data.i 237
 Data.s  "&icirc;"   : Data.i 238
 Data.s  "&iuml;"    : Data.i 239    
 Data.s  "&eth;"     : Data.i 240
 Data.s  "&ntilde;"  : Data.i 241
 Data.s  "&ograve;"  : Data.i 242
 Data.s  "&oacute;"  : Data.i 243
 Data.s  "&ocirc;"   : Data.i 244
 Data.s  "&otilde;"  : Data.i 245
 Data.s  "&ouml;"    : Data.i 246
 Data.s  "&divide;"  : Data.i 247
 Data.s  "&oslash;"  : Data.i 248
 Data.s  "&ugrave;"  : Data.i 249
 Data.s  "&uacute;"  : Data.i 250
 Data.s  "&ucirc;"   : Data.i 251
 Data.s  "&uuml;"    : Data.i 252
 Data.s  "&yacute;"  : Data.i 253
 Data.s  "&thorn;"   : Data.i 254
 Data.s  "&yuml;"    : Data.i 255
    
 Data.s  ""          : Data.i 0
            
EndDataSection
 
Procedure.i IsDigit (sChar.s)
 Protected flgIsDigit = #False
 Protected iChar = Asc(sChar)
 
 If iChar >= '0' And iChar <= '9' : flgIsDigit = #True : EndIf
 
 ProcedureReturn flgIsDigit 
EndProcedure

Procedure.i IsAllDigit (sString.s)
 Protected flgIsAllDigit = #False
 Protected iLen = Len(sString)
 Protected k
 
 If iLen
    flgIsAllDigit = #True
    For k = 1 To iLen
        If IsDigit(Mid(sString, k, 1)) = #False 
            flgIsAllDigit = #False
            Break
        EndIf
    Next    
 EndIf
 
 ProcedureReturn flgIsAllDigit
EndProcedure
 
 
Procedure.s HTMLDecoder (sEncodedString.s)

 Protected sHTMLEntity.s, iHTMLChar, sOutString.s = sEncodedString
 Protected sTemp.s, iStart, iEnd 
 
 Restore HTML_DECODER_DATA:  Read.s sHTMLEntity: Read.i iHTMLChar 
 
 Repeat 
    sOutString = ReplaceString(sOutString, sHTMLEntity, Chr(iHTMLChar))    
    Read.s sHTMLEntity: Read.i iHTMLChar
 Until iHTMLChar = 0
 
 ; look for something similar to "&#39;" or "&#039;"
 
 iStart = FindString(sOutString, "&#", 1) 
 iEnd = FindString(sOutString, ";", iStart)
 
 If (iStart > 0) And (iEnd > 0) ; found something 
    sTemp = Mid(sOutString, iStart + 2, iEnd - iStart - 2)
    If IsAllDigit(sTemp)
        sOutString = ReplaceString(sOutString, "&#" + sTemp + ";", Chr(Val(sTemp)))
    EndIf
 EndIf
 
 ProcedureReturn sOutString
  
EndProcedure


Procedure.s HTMLEncoder (sString.s)

 Protected sHTMLEntity.s, iHTMLChar, sOutString.s = sString

 Restore HTML_DECODER_DATA:  Read.s sHTMLEntity: Read.i iHTMLChar 
 
 Repeat 
    sOutString = ReplaceString(sOutString, Chr(iHTMLChar), sHTMLEntity)    
    Read.s sHTMLEntity: Read.i iHTMLChar     
 Until iHTMLChar = 0
   
 ProcedureReturn sOutString
  
EndProcedure

Last edited by luis on Mon Aug 14, 2023 10:54 am, edited 1 time in total.
User avatar
Joakim Christiansen
Addict
Addict
Posts: 2452
Joined: Wed Dec 22, 2004 4:12 pm
Location: Norway
Contact:

Post by Joakim Christiansen »

That is kinda sweet, will sure come in handy one day :D
I like logic, hence I dislike humans but love computers.
User avatar
kenmo
Addict
Addict
Posts: 2045
Joined: Tue Dec 23, 2003 3:54 am

Post by kenmo »

Very useful, I'll need something like this soon!


One glaring optimization stands out at me though:

You're scanning through the entire HTML string (via ReplaceString()) for EVERY eligible character (almost 100!).

Seems like a single manual pass, de/encoding characters as you go, would perform much better. Might be plenty fast as it is, however.
User avatar
luis
Addict
Addict
Posts: 3895
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

Post by luis »

kenmo wrote: Seems like a single manual pass, de/encoding characters as you go, would perform much better. Might be plenty fast as it is, however.
Yes, you are right.

I followed this path because in a real case scenario (for my uses anyway) I felt it was simply irrilevant. This was easy to code and quite fast.

Very often I prefer simple code over fast code.

For example, if some disk acces is involved, optimize this for speed is probably useless.

If your program interact with the user, again optimize this for speed is probably useless.

And so on.

All depend on what your program does and how much time it would spend inside this code.
User avatar
idle
Always Here
Always Here
Posts: 5901
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Post by idle »

thanks luis
User avatar
Joakim Christiansen
Addict
Addict
Posts: 2452
Joined: Wed Dec 22, 2004 4:12 pm
Location: Norway
Contact:

Post by Joakim Christiansen »

luis wrote: All depend on what your program does and how much time it would spend inside this code.
I agree. If you don't need a function to be ultra fast then there is no need to spend time optimizing it. But if you were to parse a lot of files and it needed to be very fast THEN you could try to spend an hour (or whatever) to make it faster.

But with that said don't get me wrong either, if someone posts an optimized version I would most likely use it (if it ain't >100 lines of code).
I like logic, hence I dislike humans but love computers.
es_91
Enthusiast
Enthusiast
Posts: 298
Joined: Thu Jan 27, 2011 12:00 pm
Location: DE

Re: HTMLEncoder / HTMLDecoder

Post by es_91 »

I know, this thread is pretty old, but i got to say it:

Th- T-- come on, all together, Th-

THANX LUIS !!!
:mrgreen:
c4s
Addict
Addict
Posts: 1981
Joined: Thu Nov 01, 2007 5:37 pm
Location: Germany

Re: HTMLEncoder / HTMLDecoder

Post by c4s »

Just needed something like this, thanks luis!

Honestly I wonder why it's not officially implemented yet as it's rather important when dealing with HTML. So I created a feature request: HTML Entities Encoder/Decoder
If any of you native English speakers have any suggestions for the above text, please let me know (via PM). Thanks!
BarryG
Addict
Addict
Posts: 4174
Joined: Thu Apr 18, 2019 8:17 am

Re: HTMLEncoder / HTMLDecoder

Post by BarryG »

Luis' code fails with this:

Code: Select all

Debug HTMLDecoder("one two (three  <img src="+Chr(34)+"images/four.gif"+Chr(34)+" alt="+Chr(34)+":D"+Chr(34)+" /> five )")
Any updates?
Little John
Addict
Addict
Posts: 4789
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: HTMLEncoder / HTMLDecoder

Post by Little John »

BarryG wrote: Mon Aug 14, 2023 6:05 am Luis' code fails with this:

Code: Select all

Debug HTMLDecoder("one two (three  <img src="+Chr(34)+"images/four.gif"+Chr(34)+" alt="+Chr(34)+":D"+Chr(34)+" /> five )")
Any updates?
Using PB 6.03 beta 4, Luis' above code (with a tiny syntax correction) gives the following result:
one two (three <img src="images/four.gif" alt=":D" /> five )
This is exactly what I expected. Your example text does not contain any HTML entity that needs to be decoded.
BarryG
Addict
Addict
Posts: 4174
Joined: Thu Apr 18, 2019 8:17 am

Re: HTMLEncoder / HTMLDecoder

Post by BarryG »

It's meant to remove the image data. There's an opening <img> tag, but no closing </img> one, so it fails. It closes with "/>" instead, which is valid HTML.

But never mind, I'm doing it like this now, which works for my case use:

Code: Select all

Procedure.s StripTags(text$)
  Repeat
    s=FindString(text$,"<")
    If s
      e=FindString(text$,">",s+1)
      If s And e
        text$=Left(text$,s-1)+Mid(text$,e+1)
      EndIf
    EndIf
  Until s=0
  ProcedureReturn text$
EndProcedure

Debug StripTags("one two (three  <img src="+Chr(34)+"images/four.gif"+Chr(34)+" alt="+Chr(34)+":D"+Chr(34)+" /> five )")
Last edited by BarryG on Mon Aug 14, 2023 6:37 am, edited 1 time in total.
Little John
Addict
Addict
Posts: 4789
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: HTMLEncoder / HTMLDecoder

Post by Little John »

BarryG wrote: Mon Aug 14, 2023 6:34 am It's meant to remove the image data. There's an opening <img> tag, but no closing </img> one, so it fails. It closes with "/>" instead, which is valid HTML.
You misunderstand, what the purpose of Luis' code is.
Your original question, that I answered, actually was asking about decoding HTML entities.
In this thread, now your are asking for code that removes HTML tags. That's something completely different.
Little John
Addict
Addict
Posts: 4789
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: HTMLEncoder / HTMLDecoder

Post by Little John »

luis wrote:

Code: Select all

DataSection
 HTML_DECODER_DATA:
    
 Data.s  """    : Data.i 34
Hi Luis,

did that ever work with older PB versions? :o
Now, e.g. with PB 6.03 beta 4, we have to write it differently, for instance

Code: Select all

Data.s ~"\""
BarryG
Addict
Addict
Posts: 4174
Joined: Thu Apr 18, 2019 8:17 am

Re: HTMLEncoder / HTMLDecoder

Post by BarryG »

@Little John: Yeah, I had two sets of HTML-related decoding issues that I needed to fix. But both are solved now with some quick hacky code.
User avatar
luis
Addict
Addict
Posts: 3895
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

Re: HTMLEncoder / HTMLDecoder

Post by luis »

Little John wrote: Mon Aug 14, 2023 8:29 am did that ever work with older PB versions? :o
Hi, in 2009 it must have worked since now it's not even compilable and at the time obviously I compiled it :?
I've updated it with your patch, thanks.
"Have you tried turning it off and on again ?"
Post Reply