Share your advanced PureBasic knowledge/code with the community.
luis
Addict
Posts: 3895 Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy
Post
by luis » Thu Feb 05, 2009 9:03 pm
To complement UrlDecoder / UrlEncoder.
Encode / decode special chars from / to HTML.
Code: Select all
EnableExplicit
DataSection
HTML_DECODER_DATA:
Data.s ~"\"" : Data.i 34
Data.s "&" : Data.i 38
Data.s "<" : Data.i 60
Data.s ">" : Data.i 62
Data.s " " : Data.i 160
Data.s "¡" : Data.i 161
Data.s "¢" : Data.i 162
Data.s "£" : Data.i 163
Data.s "¤" : Data.i 164
Data.s "¥" : Data.i 165
Data.s "¦" : Data.i 166
Data.s "§" : Data.i 167
Data.s "¨" : Data.i 168
Data.s "©" : Data.i 169
Data.s "ª" : Data.i 170
Data.s "«" : Data.i 171
Data.s "¬" : Data.i 172
Data.s "­" : Data.i 173
Data.s "®" : Data.i 174
Data.s "¯" : Data.i 175
Data.s "°" : Data.i 176
Data.s "±" : Data.i 177
Data.s "²" : Data.i 178
Data.s "³" : Data.i 179
Data.s "´" : Data.i 180
Data.s "µ" : Data.i 181
Data.s "¶" : Data.i 182
Data.s "·" : Data.i 183
Data.s "¸" : Data.i 184
Data.s "¹" : Data.i 185
Data.s "º" : Data.i 186
Data.s "»" : Data.i 187
Data.s "¼" : Data.i 188
Data.s "½" : Data.i 189
Data.s "¾" : Data.i 190
Data.s "¿" : Data.i 191
Data.s "À" : Data.i 192
Data.s "Á" : Data.i 193
Data.s "Â" : Data.i 194
Data.s "Ã" : Data.i 195
Data.s "Ä" : Data.i 196
Data.s "Å" : Data.i 197
Data.s "Æ" : Data.i 198
Data.s "Ç" : Data.i 199
Data.s "È" : Data.i 200
Data.s "É" : Data.i 201
Data.s "Ê" : Data.i 202
Data.s "Ë" : Data.i 203
Data.s "Ì" : Data.i 204
Data.s "Í" : Data.i 205
Data.s "Î" : Data.i 206
Data.s "Ï" : Data.i 207
Data.s "Ð" : Data.i 208
Data.s "Ñ" : Data.i 209
Data.s "Ò" : Data.i 210
Data.s "Ó" : Data.i 211
Data.s "Ô" : Data.i 212
Data.s "Õ" : Data.i 213
Data.s "Ö" : Data.i 214
Data.s "×" : Data.i 215
Data.s "Ø" : Data.i 216
Data.s "Ù" : Data.i 217
Data.s "Ú" : Data.i 218
Data.s "Û" : Data.i 219
Data.s "Ü" : Data.i 220
Data.s "Ý" : Data.i 221
Data.s "Þ" : Data.i 222
Data.s "ß" : Data.i 223
Data.s "à" : Data.i 224
Data.s "á" : Data.i 225
Data.s "â" : Data.i 226
Data.s "ã" : Data.i 227
Data.s "ä" : Data.i 228
Data.s "å" : Data.i 229
Data.s "æ" : Data.i 230
Data.s "ç" : Data.i 231
Data.s "è" : Data.i 232
Data.s "é" : Data.i 233
Data.s "ê" : Data.i 234
Data.s "ë" : Data.i 235
Data.s "ì" : Data.i 236
Data.s "í" : Data.i 237
Data.s "î" : Data.i 238
Data.s "ï" : Data.i 239
Data.s "ð" : Data.i 240
Data.s "ñ" : Data.i 241
Data.s "ò" : Data.i 242
Data.s "ó" : Data.i 243
Data.s "ô" : Data.i 244
Data.s "õ" : Data.i 245
Data.s "ö" : Data.i 246
Data.s "÷" : Data.i 247
Data.s "ø" : Data.i 248
Data.s "ù" : Data.i 249
Data.s "ú" : Data.i 250
Data.s "û" : Data.i 251
Data.s "ü" : Data.i 252
Data.s "ý" : Data.i 253
Data.s "þ" : Data.i 254
Data.s "ÿ" : Data.i 255
Data.s "" : Data.i 0
EndDataSection
Procedure.i IsDigit (sChar.s)
Protected flgIsDigit = #False
Protected iChar = Asc(sChar)
If iChar >= '0' And iChar <= '9' : flgIsDigit = #True : EndIf
ProcedureReturn flgIsDigit
EndProcedure
Procedure.i IsAllDigit (sString.s)
Protected flgIsAllDigit = #False
Protected iLen = Len(sString)
Protected k
If iLen
flgIsAllDigit = #True
For k = 1 To iLen
If IsDigit(Mid(sString, k, 1)) = #False
flgIsAllDigit = #False
Break
EndIf
Next
EndIf
ProcedureReturn flgIsAllDigit
EndProcedure
Procedure.s HTMLDecoder (sEncodedString.s)
Protected sHTMLEntity.s, iHTMLChar, sOutString.s = sEncodedString
Protected sTemp.s, iStart, iEnd
Restore HTML_DECODER_DATA: Read.s sHTMLEntity: Read.i iHTMLChar
Repeat
sOutString = ReplaceString(sOutString, sHTMLEntity, Chr(iHTMLChar))
Read.s sHTMLEntity: Read.i iHTMLChar
Until iHTMLChar = 0
; look for something similar to "'" or "'"
iStart = FindString(sOutString, "&#", 1)
iEnd = FindString(sOutString, ";", iStart)
If (iStart > 0) And (iEnd > 0) ; found something
sTemp = Mid(sOutString, iStart + 2, iEnd - iStart - 2)
If IsAllDigit(sTemp)
sOutString = ReplaceString(sOutString, "&#" + sTemp + ";", Chr(Val(sTemp)))
EndIf
EndIf
ProcedureReturn sOutString
EndProcedure
Procedure.s HTMLEncoder (sString.s)
Protected sHTMLEntity.s, iHTMLChar, sOutString.s = sString
Restore HTML_DECODER_DATA: Read.s sHTMLEntity: Read.i iHTMLChar
Repeat
sOutString = ReplaceString(sOutString, Chr(iHTMLChar), sHTMLEntity)
Read.s sHTMLEntity: Read.i iHTMLChar
Until iHTMLChar = 0
ProcedureReturn sOutString
EndProcedure
Last edited by
luis on Mon Aug 14, 2023 10:54 am, edited 1 time in total.
Joakim Christiansen
Addict
Posts: 2452 Joined: Wed Dec 22, 2004 4:12 pm
Location: Norway
Contact:
Post
by Joakim Christiansen » Thu Feb 05, 2009 9:16 pm
That is kinda sweet, will sure come in handy one day
I like logic, hence I dislike humans but love computers.
kenmo
Addict
Posts: 2045 Joined: Tue Dec 23, 2003 3:54 am
Post
by kenmo » Thu Feb 05, 2009 10:17 pm
Very useful, I'll need something like this soon!
One glaring optimization stands out at me though:
You're scanning through the entire HTML string (via ReplaceString()) for EVERY eligible character (almost 100!).
Seems like a single manual pass, de/encoding characters as you go, would perform much better. Might be plenty fast as it is, however.
luis
Addict
Posts: 3895 Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy
Post
by luis » Thu Feb 05, 2009 10:38 pm
kenmo wrote:
Seems like a single manual pass, de/encoding characters as you go, would perform much better. Might be plenty fast as it is, however.
Yes, you are right.
I followed this path because in a real case scenario (for my uses anyway) I felt it was simply irrilevant. This was easy to code and quite fast.
Very often I prefer simple code over fast code.
For example, if some disk acces is involved, optimize this for speed is probably useless.
If your program interact with the user, again optimize this for speed is probably useless.
And so on.
All depend on what your program does and how much time it would spend inside this code.
idle
Always Here
Posts: 5901 Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand
Post
by idle » Thu Feb 05, 2009 10:47 pm
thanks luis
Joakim Christiansen
Addict
Posts: 2452 Joined: Wed Dec 22, 2004 4:12 pm
Location: Norway
Contact:
Post
by Joakim Christiansen » Thu Feb 05, 2009 11:17 pm
luis wrote:
All depend on what your program does and how much time it would spend inside this code.
I agree. If you don't need a function to be ultra fast then there is no need to spend time optimizing it. But if you were to parse a lot of files and it needed to be very fast THEN you could try to spend an hour (or whatever) to make it faster.
But with that said don't get me wrong either, if someone posts an optimized version I would most likely use it (if it ain't >100 lines of code).
I like logic, hence I dislike humans but love computers.
es_91
Enthusiast
Posts: 298 Joined: Thu Jan 27, 2011 12:00 pm
Location: DE
Post
by es_91 » Wed Jun 25, 2014 10:56 am
I know, this thread is pretty old, but i got to say it:
Th- T-- come on, all together, Th-
THANX LUIS !!!
c4s
Addict
Posts: 1981 Joined: Thu Nov 01, 2007 5:37 pm
Location: Germany
Post
by c4s » Sat Jun 13, 2015 9:58 am
Just needed something like this, thanks luis!
Honestly I wonder why it's not officially implemented yet as it's rather important when dealing with HTML. So I created a feature request:
HTML Entities Encoder/Decoder
If any of you native English speakers have any suggestions for the above text, please let me know (via PM). Thanks!
BarryG
Addict
Posts: 4174 Joined: Thu Apr 18, 2019 8:17 am
Post
by BarryG » Mon Aug 14, 2023 6:05 am
Luis' code fails with this:
Code: Select all
Debug HTMLDecoder("one two (three <img src="+Chr(34)+"images/four.gif"+Chr(34)+" alt="+Chr(34)+":D"+Chr(34)+" /> five )")
Any updates?
Little John
Addict
Posts: 4789 Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany
Post
by Little John » Mon Aug 14, 2023 6:26 am
BarryG wrote: Mon Aug 14, 2023 6:05 am
Luis' code fails with this:
Code: Select all
Debug HTMLDecoder("one two (three <img src="+Chr(34)+"images/four.gif"+Chr(34)+" alt="+Chr(34)+":D"+Chr(34)+" /> five )")
Any updates?
Using PB 6.03 beta 4, Luis' above code (with a tiny syntax correction) gives the following result:
one two (three <img src="images/four.gif" alt=":D" /> five )
This is exactly what I expected. Your example text does not contain any HTML entity that needs to be decoded.
BarryG
Addict
Posts: 4174 Joined: Thu Apr 18, 2019 8:17 am
Post
by BarryG » Mon Aug 14, 2023 6:34 am
It's meant to remove the image data. There's an opening <img> tag, but no closing </img> one, so it fails. It closes with "/>" instead, which is valid HTML.
But never mind, I'm doing it like this now, which works for my case use:
Code: Select all
Procedure.s StripTags(text$)
Repeat
s=FindString(text$,"<")
If s
e=FindString(text$,">",s+1)
If s And e
text$=Left(text$,s-1)+Mid(text$,e+1)
EndIf
EndIf
Until s=0
ProcedureReturn text$
EndProcedure
Debug StripTags("one two (three <img src="+Chr(34)+"images/four.gif"+Chr(34)+" alt="+Chr(34)+":D"+Chr(34)+" /> five )")
Last edited by
BarryG on Mon Aug 14, 2023 6:37 am, edited 1 time in total.
Little John
Addict
Posts: 4789 Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany
Post
by Little John » Mon Aug 14, 2023 6:37 am
BarryG wrote: Mon Aug 14, 2023 6:34 am
It's meant to remove the image data. There's an opening <img> tag, but no closing </img> one, so it fails. It closes with "/>" instead, which is valid HTML.
You misunderstand, what the purpose of Luis' code is.
Your original question , that I answered, actually was asking about decoding HTML entities.
In this thread, now your are asking for code that removes HTML tags. That's something completely different.
Little John
Addict
Posts: 4789 Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany
Post
by Little John » Mon Aug 14, 2023 8:29 am
luis wrote:
Code: Select all
DataSection
HTML_DECODER_DATA:
Data.s """ : Data.i 34
Hi Luis,
did that ever work with older PB versions?
Now, e.g. with PB 6.03 beta 4, we have to write it differently, for instance
BarryG
Addict
Posts: 4174 Joined: Thu Apr 18, 2019 8:17 am
Post
by BarryG » Mon Aug 14, 2023 9:43 am
@Little John: Yeah, I had two sets of HTML-related decoding issues that I needed to fix. But both are solved now with some quick hacky code.
luis
Addict
Posts: 3895 Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy
Post
by luis » Mon Aug 14, 2023 10:59 am
Little John wrote: Mon Aug 14, 2023 8:29 am
did that ever work with older PB versions?
Hi, in 2009 it must have worked since now it's not even compilable and at the time obviously I compiled it
I've updated it with your patch, thanks.
"Have you tried turning it off and on again ?"