Page 1 of 1
Question about HTML
Posted: Sun Apr 21, 2013 7:10 pm
by J@ckWhiteIII
Hello,
I have a question about HTML. I'd like to know if ALL tags have ALL the attributes that exist (height, margin, bgcolor, frontcolor).
I want to know this because I'd like to read HTML text using Purebasic. So I thought I'll need a tree structure to be able to handle all that. Now I thought I could make a structure for tags. This structure would contain things like width,height, x, y, margin etc. My approach so far:
Code: Select all
Global inputt.s = "<html><head><title>Titel</title></head><body><div><p>paragraph</p></div></body></html>"
Global i.i = -1
Global output.s = ""
Global *start.Character
Enumeration
#html
#body
#head
#footer
#p
#div
#link
#br
#script
#meta
#title
EndEnumeration
Structure tag
frontColor.i
backColor.i
type.i
EndStructure
Structure br
text.s
EndStructure
Structure p
text.s
List brs.br()
EndStructure
Structure link
text.s
href.s
EndStructure
Structure div
List ps.p()
List links.link()
EndStructure
Structure footer
List ps.p()
List links.link()
EndStructure
Structure body
List divs.div()
List footers.footer()
EndStructure
Structure meta
content.s
EndStructure
Structure title
text.s
EndStructure
Structure script
type.s
EndStructure
Structure head
List titles.title()
List metas.meta()
List scripts.script()
EndStructure
Structure html
List bodies.body()
List heads.head()
EndStructure
Procedure.i examineTag(output$,lookuponly.l = 0)
old_i.i = i
Select *start\c ;just a start
Case 'a' To 'z','A' To 'Z' ;etc
While *start\c >= '0' And *start\c <= '9'
output$ + Chr(*start\c)
*start + SizeOf(Character)
Wend
EndSelect
If lookuponly
i = old_i
EndIf
ProcedureReturn
EndProcedure
Procedure findTag(input$)
intag = #False
While i<Len(input$)
i=i+1
If Not intag And Mid(input$,i,1) = "<"
intag = #True
Continue
EndIf
If intag And Mid(input$,i,1) = ">"
intag = #False
Continue
EndIf
If Not intag
text.s = ""
text = text + Mid(input$,i,1)
output = output + text+#CRLF$
EndIf
If intag
text.s = ""
If Mid(input$,i,1) <> "<"
text = text + Mid(input$,i,1)
EndIf
output = output + text
EndIf
Wend
EndProcedure
findTag(inputt)
Debug output
I really don't know if what I did there actually makes sense at all or if there'd be a much easier way.
Maybe someone has an answer to my question and/or could help doing the structure.
Looking forward to hearing from you all
Re: Question about HTML
Posted: Mon Apr 22, 2013 10:03 am
by Mohawk70
J@ckWhiteIII wrote:Hello,
I have a question about HTML. I'd like to know if ALL tags have ALL the attributes that exist (height, margin, bgcolor, frontcolor).
I want to know this because I'd like to read HTML text using Purebasic. So I thought I'll need a tree structure to be able to handle all that. Now I thought I could make a structure for tags. This structure would contain things like width,height, x, y, margin etc. My approach so far:
See
http://www.w3schools.com/tags/ref_stand ... ibutes.asp for a list of global attributes shared by all tags. You will also find the additional attributes for each tag.
Re: Question about HTML
Posted: Mon Apr 22, 2013 3:35 pm
by helpy
Re: Question about HTML
Posted: Mon Apr 22, 2013 4:06 pm
by J@ckWhiteIII
Thank you a lot for those links, they gave me information I need.
But I have another question, this time about the PureBasic code. Does my first attempt make sense or is it a "fail" already? I don't really know if what I did makes sense or whether it is appropriate or not. I'd appreciate information and/or suggestions to improve that little code snippet.
Once again, thank you!
Re: Question about HTML
Posted: Mon Apr 22, 2013 9:02 pm
by Deluxe0321
I would do it like this:
Code: Select all
InitNetwork()
;// html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
Expression.s = "</?\w+((\s+\w+(\s*=\s*(?:"+#DQUOTE$+".*?"+#DQUOTE$+"|'.*?'|[^'"+#DQUOTE$+">\s]+))?)+\s*|\s*)/?>"
;// html attribute expression --> http://stackoverflow.com/a/317081
AExpression.s = "(\S+)=["+#DQUOTE$+"']?((?:.(?!["+#DQUOTE$+"']?\s+(?:\S+)=|[>"+#DQUOTE$+"']))+.)["+#DQUOTE$+"']?"
RegEx.i = CreateRegularExpression(#PB_Any,Expression.s)
ARegEx.i = CreateRegularExpression(#PB_Any,AExpression.s)
If RegEx.i and ARegEx.i
If ReceiveHTTPFile("http://www.purebasic.fr/english/", GetHomeDirectory()+"tmp.html")
If OpenFile(0,GetHomeDirectory()+"tmp.html")
*mem = AllocateMemory(Lof(0))
If *mem
ReadData(0,*mem,Lof(0))
CloseFile(0)
EndIf
DeleteFile(GetHomeDirectory()+"tmp.html")
EndIf
Content.s = PeekS(*mem,-1,#PB_Ascii)
FreeMemory(*mem)
Debug Content.s
Dim Arr.s(0)
Dim AArr.s(0)
ArrSize.i = ExtractRegularExpression(RegEx.i,Content.s,Arr())
If ArrSize.i
Debug "Found "+Str(ArrSize.i)+" HTML Tags:"
For i=0 To ArrSize.i - 1
ReDim AArr(0)
AArrSize.i = ExtractRegularExpression(ARegEx.i,Arr(i),AArr())
Debug Space(4) + "HTML-Tag:"
Debug Space(8) + Arr(i)
If AArrSize
Debug Space(4) + "Attributes: "
For i2 = 0 To AArrSize.i - 1
Debug Space(8) + AArr(i2)
Next
EndIf
Next
EndIf
EndIf
EndIf
Re: Question about HTML
Posted: Tue Apr 23, 2013 4:30 pm
by J@ckWhiteIII
Wow, thank you a lot for that demonstration! I'll have to do some research on Regular Expression to understand it, though.
Those links you added are full of useful information to me, thanks a lot.
But I must still ask: How did people find the RegEx's? I mean, looking at that I don't understand a word. Why is it exactly that combination of letters?
As I said, I'm gonna keep working on my project for now and (hopefully) understand Regular Expressions by time.
Thank you all.
Re: Question about HTML
Posted: Tue Apr 23, 2013 9:40 pm
by Deluxe0321
Understanding Regular Expressions is not that hard, simply use google to find good tutorials or - even easier - search for already finished expressions by others.
Ususally a search query like "regex html tag" or "regex XYZ" is enough to get the right Expression string.
An awesome (cheat) sheet, not only for RegEx, can be found here:
http://overapi.com/regex/
Have fun with your project,
Deluxe0321
Re: Question about HTML
Posted: Thu Jul 05, 2018 3:35 pm
by Kwai chang caine
I know there are several years
, but Deluxe0321 make a really nice code
Someone know how obtain the text of the tags instead of the Attributes
Code: Select all
AExpression.s = "(\S+)=["+#DQUOTE$+"']?((?:.(?!["+#DQUOTE$+"']?\s+(?:\S+)=|[>"+#DQUOTE$+"']))+.)["+#DQUOTE$+"']?"
Code: Select all
<TAG>I search to keep this text</TAG>
Re: Question about HTML
Posted: Thu Jul 05, 2018 9:17 pm
by Bitblazer
Just include and use the
HTML Agility pack.
ps: if a mod thinks its inappropriate due to the age of the original thread - feel free to erase this post.
Re: Question about HTML
Posted: Fri Jul 20, 2018 7:46 pm
by Little John
Thanks to Deluxe0321 for these two interesting Regular Expressions, and thanks to KCC for digging out this old thread!
Deluxe0321's code does do three different things:
- Create two Regular expressions.
- Fill a string with HTML code from a web page.
- Parse that string, using the Regular Expressions.
For more clarity, I separeted the parts (and changed some other small things).
My version looks like this:
Code: Select all
EnableExplicit
; html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
#Rex_HTML_Tags$ = "</?\w+((\s+\w+(\s*=\s*(?:" + #DQUOTE$ + ".*?" + #DQUOTE$ + "|'.*?'|[^'" + #DQUOTE$+">\s]+))?)+\s*|\s*)/?>"
; html attribute expression --> http://stackoverflow.com/a/317081
#Rex_HTML_Attributes$ = "(\S+)=[" + #DQUOTE$ + "']?((?:.(?![" + #DQUOTE$+"']?\s+(?:\S+)=|[>" + #DQUOTE$+"']))+.)[" + #DQUOTE$+"']?"
Define.i s_HTMLTags, s_HTML_Attr
Procedure InitParseHTML ()
Shared s_HTMLTags, s_HTML_Attr
s_HTMLTags = CreateRegularExpression(#PB_Any, #Rex_HTML_Tags$)
s_HTML_Attr = CreateRegularExpression(#PB_Any, #Rex_HTML_Attributes$)
EndProcedure
Procedure ParseHTMLString (html$)
Shared s_HTMLTags, s_HTML_Attr
Protected.i numTags, numAttr, t, a
Protected Dim tags$(0)
Protected Dim attr$(0)
If s_HTMLTags And s_HTML_Attr
numTags = ExtractRegularExpression(s_HTMLTags, html$, tags$())
Debug "Found " + numTags + " HTML tags:"
For t = 0 To numTags - 1
Debug Space(4) + "HTML tag:"
Debug Space(8) + tags$(t)
numAttr = ExtractRegularExpression(s_HTML_Attr, tags$(t), attr$())
If numAttr
Debug Space(4) + "Attributes: "
For a = 0 To numAttr - 1
Debug Space(8) + attr$(a)
Next
EndIf
Next
EndIf
EndProcedure
Procedure.s DownloadedHTMLPage (url$)
Protected *mem, ret$=""
If ReceiveHTTPFile(url$, GetHomeDirectory() + "tmp.html")
If OpenFile(0, GetHomeDirectory() + "tmp.html")
*mem = AllocateMemory(Lof(0))
If *mem
ReadData(0, *mem, Lof(0))
CloseFile(0)
ret$ = PeekS(*mem, -1, #PB_UTF8)
FreeMemory(*mem)
EndIf
DeleteFile(GetHomeDirectory() + "tmp.html")
EndIf
EndIf
ProcedureReturn ret$
EndProcedure
; -- Demo
Define page$
InitNetwork()
InitParseHTML()
page$ = DownloadedHTMLPage("http://www.purebasic.fr/english/")
Debug page$
Debug "----------------------------------------------------------------------"
If page$ <> ""
ParseHTMLString(page$)
EndIf
Re: Question about HTML
Posted: Fri Jul 20, 2018 7:51 pm
by Little John
@KCC:
In another thread, Little John wrote:Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.
That's why I wrote the procedure
SplitByRegEx(), which retrieves the matching parts AND the non-matching parts.
Using it in the context here, it looks like this:
Code: Select all
EnableExplicit
Structure SplitString
s$
match.i
EndStructure
Procedure.i SplitByRegEx (regEx.i, source$, List part.SplitString())
; -- split a string into parts, that match or don't match a Regular Expression
; in : regEx : number of a Regular Expression generated by CreateRegularExpression()
; source$: string to be split into parts
; out: part() : resulting list of parts
; return value: number of elements in part():
; 0 if source$ = "", > 0 otherwise;
; -1 on error
Protected.i left, right
If ExamineRegularExpression(regEx, source$) = 0
ProcedureReturn -1 ; error
EndIf
ClearList(part())
left = 1
While NextRegularExpressionMatch(regEx)
right = RegularExpressionMatchPosition(regEx)
If left < right
AddElement(part())
part()\s$ = Mid(source$, left, right-left)
part()\match = #False
EndIf
AddElement(part())
part()\s$ = RegularExpressionMatchString(regEx)
part()\match = #True
left = right + RegularExpressionMatchLength(regEx)
Wend
If left <= Len(source$)
AddElement(part())
part()\s$ = Mid(source$, left)
part()\match = #False
EndIf
ProcedureReturn ListSize(part()) ; success
EndProcedure
#WhiteSpace$ = ~" \t\r\n"
Procedure.i IsSolelyWhiteSpace (s$)
Protected.i last, i
last = Len(s$)
For i = 1 To last
If FindString(#WhiteSpace$, Mid(s$, i, 1)) = 0
ProcedureReturn #False
EndIf
Next
ProcedureReturn #True
EndProcedure
; html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
#Rex_HTML_Tags$ = "</?\w+((\s+\w+(\s*=\s*(?:" + #DQUOTE$ + ".*?" + #DQUOTE$ + "|'.*?'|[^'" + #DQUOTE$+">\s]+))?)+\s*|\s*)/?>"
; html attribute expression --> http://stackoverflow.com/a/317081
#Rex_HTML_Attributes$ = "(\S+)=[" + #DQUOTE$ + "']?((?:.(?![" + #DQUOTE$+"']?\s+(?:\S+)=|[>" + #DQUOTE$+"']))+.)[" + #DQUOTE$+"']?"
Define.i s_HTMLTags, s_HTML_Attr
Procedure InitParseHTML ()
Shared s_HTMLTags, s_HTML_Attr
s_HTMLTags = CreateRegularExpression(#PB_Any, #Rex_HTML_Tags$)
s_HTML_Attr = CreateRegularExpression(#PB_Any, #Rex_HTML_Attributes$)
EndProcedure
Procedure ParseHTMLString (html$)
Shared s_HTMLTags, s_HTML_Attr
Protected.i numPieces, numAttr, a
Protected NewList piece.SplitString()
Protected Dim attr$(0)
If s_HTMLTags And s_HTML_Attr
numPieces = SplitByRegEx(s_HTMLTags, html$, piece())
Debug "Found " + numPieces + " pieces (tags + stuff outside of tags):"
ForEach piece()
If piece()\match
Debug Space(4) + "HTML tag:"
Debug Space(8) + piece()\s$
numAttr = ExtractRegularExpression(s_HTML_Attr, piece()\s$, attr$())
If numAttr
Debug Space(4) + "Attributes: "
For a = 0 To numAttr - 1
Debug Space(8) + attr$(a)
Next
EndIf
ElseIf Not IsSolelyWhiteSpace(piece()\s$)
Debug Space(4) + "Outside of tags:"
Debug Space(8) + piece()\s$
EndIf
Next
EndIf
EndProcedure
Procedure.s DownloadedHTMLPage (url$)
Protected *mem, ret$=""
*mem = ReceiveHTTPMemory(url$)
If *mem
ret$ = PeekS(*mem, -1, #PB_UTF8)
FreeMemory(*mem)
EndIf
ProcedureReturn ret$
EndProcedure
; -- Demo
Define page$
InitNetwork()
InitParseHTML()
page$ = DownloadedHTMLPage("http://www.purebasic.fr/english/")
Debug page$
Debug "----------------------------------------------------------------------"
If page$ <> ""
ParseHTMLString(page$)
EndIf
Debug "=========================================================================="
page$ = ~"<!DOCTYPE html>\n" +
~"<html lang='de'>\n" +
~"<head>\n" +
~" <meta charset='utf-8'>\n" +
~" <title>Cool title</title>\n" +
~"</head>\n" +
~"<body>\n" +
~" <p>Uuuh, a parapgraph!</p>\n" +
~" <!--\n" +
~" Comment\n" +
~" -->\n" +
~"</body>\n" +
~"</html>"
Debug page$
Debug "----------------------------------------------------------------------"
ParseHTMLString(page$)
I also simplified the procedure DownloadedHTMLPage(), but that's not the point here.
The second example in this code shows, that the HTML comment tag
<!-- ... --> is not recognized as tag, but as text outside of tags. So there is some room for improvement in the
#Rex_HTML_Tags$ regex pattern.
Re: Question about HTML
Posted: Sat Jul 21, 2018 10:02 am
by Marc56us
Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.
Replaces the RegEx result with nothing and keeps the rest of the initial character string
ReplaceRegularExpression()
Re: Question about HTML
Posted: Sat Jul 21, 2018 10:40 am
by Little John
Marc56us wrote:Replaces the RegEx result with nothing and keeps the rest of the initial character string
I don't see how this will give the same result as my code above. Can you please give some working example code?
Re: Question about HTML
Posted: Sat Jul 21, 2018 2:31 pm
by Marc56us
I'm not trying to get the same result, I'm just indicating that if you want to retrieve the text that doesn't match the expression, just replace what matches with nothing (in other words delete text match regex)
So what's left is what doesn't match
Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.
Code: Select all
If CreateRegularExpression(0, <SomeThingToRemove>)
Text_left$ = ReplaceRegularExpression(0, Source_Text, "")
Re: Question about HTML
Posted: Sat Jul 21, 2018 3:12 pm
by Little John
Marc56us wrote:I'm not trying to get the same result
Aha.
Marc56us wrote:I'm just indicating that if you want to retrieve the text that doesn't match the expression, just replace what matches with nothing
That does not address the problem at hand and thus does not yield the desired result.