Page 1 of 1
					
				Question about HTML
				Posted: Sun Apr 21, 2013 7:10 pm
				by J@ckWhiteIII
				Hello,
I have a question about HTML. I'd like to know if ALL tags have ALL the attributes that exist (height, margin, bgcolor, frontcolor).
I want to know this because I'd like to read HTML text using Purebasic. So I thought I'll need a tree structure to be able to handle all that. Now I thought I could make a structure for tags. This structure would contain things like width,height, x, y, margin etc. My approach so far:
Code: Select all
Global inputt.s = "<html><head><title>Titel</title></head><body><div><p>paragraph</p></div></body></html>"
Global i.i = -1
Global output.s = ""
Global *start.Character
Enumeration
  #html
  #body
  #head
  #footer
  #p
  #div
  #link
  #br
  #script
  #meta
  #title
EndEnumeration
Structure tag
  frontColor.i
  backColor.i
  type.i
EndStructure
Structure br
  text.s
EndStructure
Structure p
  text.s
  List brs.br()
EndStructure
Structure link
  text.s
  href.s
EndStructure
Structure div
  List ps.p()
  List links.link()
EndStructure
Structure footer
  List ps.p()
  List links.link()
EndStructure
  
Structure body
  List divs.div()
  List footers.footer()
EndStructure
Structure meta
  content.s
EndStructure
Structure title
  text.s
EndStructure
Structure script
  type.s
EndStructure
Structure head
  List titles.title()
  List metas.meta()
  List scripts.script()
EndStructure
Structure html
  List bodies.body()
  List heads.head()
EndStructure
Procedure.i examineTag(output$,lookuponly.l = 0)
  old_i.i = i
  
  Select *start\c                                       ;just a start
    Case 'a' To 'z','A' To 'Z' ;etc
      While *start\c >= '0' And *start\c <= '9'
        output$ + Chr(*start\c)
        *start + SizeOf(Character)
      Wend
  EndSelect
  
  If lookuponly
    i = old_i
  EndIf
  ProcedureReturn 
EndProcedure
Procedure findTag(input$)
  intag = #False
  While i<Len(input$)
    i=i+1
    If Not intag And Mid(input$,i,1) = "<"
      intag = #True
      Continue
    EndIf
    If intag And Mid(input$,i,1) = ">"
      intag = #False
      Continue
    EndIf
    If Not intag
      text.s = ""
      text = text + Mid(input$,i,1)
      output = output + text+#CRLF$
    EndIf
    If intag
      text.s = ""
      If Mid(input$,i,1) <> "<"
        text = text + Mid(input$,i,1)
      EndIf
      output = output + text
    EndIf
  Wend
EndProcedure
findTag(inputt)
Debug output
I really don't know if what I did there actually makes sense at all or if there'd be a much easier way. 
Maybe someone has an answer to my question and/or could help doing the structure.
Looking forward to hearing from you all
 
			 
			
					
				Re: Question about HTML
				Posted: Mon Apr 22, 2013 10:03 am
				by Mohawk70
				J@ckWhiteIII wrote:Hello,
I have a question about HTML. I'd like to know if ALL tags have ALL the attributes that exist (height, margin, bgcolor, frontcolor).
I want to know this because I'd like to read HTML text using Purebasic. So I thought I'll need a tree structure to be able to handle all that. Now I thought I could make a structure for tags. This structure would contain things like width,height, x, y, margin etc. My approach so far:
See 
http://www.w3schools.com/tags/ref_stand ... ibutes.asp for a list of global attributes shared by all tags. You will also find the additional attributes for each tag.
 
			 
			
					
				Re: Question about HTML
				Posted: Mon Apr 22, 2013 3:35 pm
				by helpy
				
			 
			
					
				Re: Question about HTML
				Posted: Mon Apr 22, 2013 4:06 pm
				by J@ckWhiteIII
				Thank you a lot for those links, they gave me information I need.
But I have another question, this time about the PureBasic code. Does my first attempt make sense or is it a "fail" already? I don't really know if what I did makes sense or whether it is appropriate or not. I'd appreciate information and/or suggestions to improve that little code snippet.
Once again, thank you!
			 
			
					
				Re: Question about HTML
				Posted: Mon Apr 22, 2013 9:02 pm
				by Deluxe0321
				I would do it like this:
Code: Select all
InitNetwork()
;// html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
Expression.s = "</?\w+((\s+\w+(\s*=\s*(?:"+#DQUOTE$+".*?"+#DQUOTE$+"|'.*?'|[^'"+#DQUOTE$+">\s]+))?)+\s*|\s*)/?>"
;// html attribute expression -->  http://stackoverflow.com/a/317081
AExpression.s = "(\S+)=["+#DQUOTE$+"']?((?:.(?!["+#DQUOTE$+"']?\s+(?:\S+)=|[>"+#DQUOTE$+"']))+.)["+#DQUOTE$+"']?"
RegEx.i  = CreateRegularExpression(#PB_Any,Expression.s)
ARegEx.i = CreateRegularExpression(#PB_Any,AExpression.s) 
If RegEx.i and ARegEx.i
  
  If ReceiveHTTPFile("http://www.purebasic.fr/english/", GetHomeDirectory()+"tmp.html")
    
    If OpenFile(0,GetHomeDirectory()+"tmp.html")
      *mem = AllocateMemory(Lof(0))
      If *mem
        ReadData(0,*mem,Lof(0))
        CloseFile(0)
      EndIf
      DeleteFile(GetHomeDirectory()+"tmp.html")
    EndIf
    
    Content.s = PeekS(*mem,-1,#PB_Ascii)
    FreeMemory(*mem)
    Debug Content.s
    
    Dim Arr.s(0)
    Dim AArr.s(0)
    
    ArrSize.i = ExtractRegularExpression(RegEx.i,Content.s,Arr())
    If ArrSize.i
      Debug "Found "+Str(ArrSize.i)+" HTML Tags:"
      
      For i=0 To ArrSize.i - 1
        
        ReDim AArr(0)
        
        AArrSize.i = ExtractRegularExpression(ARegEx.i,Arr(i),AArr())
        
        Debug Space(4) + "HTML-Tag:"
        Debug Space(8) + Arr(i)
        
        If AArrSize
          Debug Space(4) + "Attributes: "
          For i2 = 0 To AArrSize.i - 1
            Debug Space(8) + AArr(i2)  
          Next
        EndIf
        
      Next
    EndIf
  EndIf
EndIf
 
			 
			
					
				Re: Question about HTML
				Posted: Tue Apr 23, 2013 4:30 pm
				by J@ckWhiteIII
				Wow, thank you a lot for that demonstration! I'll have to do some research on Regular Expression to understand it, though.
Those links you added are full of useful information to me, thanks a lot.
But I must still ask: How did people find the RegEx's? I mean, looking at that I don't understand a word. Why is it exactly that combination of letters?
As I said, I'm gonna keep working on my project for now and (hopefully) understand Regular Expressions by time.
Thank you all.
			 
			
					
				Re: Question about HTML
				Posted: Tue Apr 23, 2013 9:40 pm
				by Deluxe0321
				Understanding Regular Expressions is not that hard, simply use google to find good tutorials or - even easier - search for already finished expressions by others.
Ususally a search query like "regex html tag" or "regex XYZ" is enough to get the right Expression string. 
An awesome (cheat) sheet, not only for RegEx, can be found here: 
http://overapi.com/regex/
Have fun with your project,
Deluxe0321
 
			 
			
					
				Re: Question about HTML
				Posted: Thu Jul 05, 2018 3:35 pm
				by Kwai chang caine
				I know there are several years  

 , but Deluxe0321 make a really nice code  
Someone know how obtain the text of the tags instead of the Attributes
Code: Select all
AExpression.s = "(\S+)=["+#DQUOTE$+"']?((?:.(?!["+#DQUOTE$+"']?\s+(?:\S+)=|[>"+#DQUOTE$+"']))+.)["+#DQUOTE$+"']?"
Code: Select all
<TAG>I search to keep this text</TAG>
 
			 
			
					
				Re: Question about HTML
				Posted: Thu Jul 05, 2018 9:17 pm
				by Bitblazer
				Just include and use the 
HTML Agility pack.
ps: if a mod thinks its inappropriate due to the age of the original thread - feel free to erase this post. 
			 
			
					
				Re: Question about HTML
				Posted: Fri Jul 20, 2018 7:46 pm
				by Little John
				Thanks to Deluxe0321 for these two interesting Regular Expressions, and thanks to KCC for digging out this old thread! 
 
Deluxe0321's code does do three different things:
- Create two Regular expressions.
 
- Fill a string with HTML code from a web page.
 
- Parse that string, using the Regular Expressions.
 
For more clarity, I separeted the parts (and changed some other small things).
My version looks like this:
Code: Select all
EnableExplicit
; html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
#Rex_HTML_Tags$ = "</?\w+((\s+\w+(\s*=\s*(?:" + #DQUOTE$ + ".*?" + #DQUOTE$ + "|'.*?'|[^'" + #DQUOTE$+">\s]+))?)+\s*|\s*)/?>"
; html attribute expression -->  http://stackoverflow.com/a/317081
#Rex_HTML_Attributes$ = "(\S+)=[" + #DQUOTE$ + "']?((?:.(?![" + #DQUOTE$+"']?\s+(?:\S+)=|[>" + #DQUOTE$+"']))+.)[" + #DQUOTE$+"']?"
Define.i s_HTMLTags, s_HTML_Attr
Procedure InitParseHTML ()
   Shared s_HTMLTags, s_HTML_Attr
   
   s_HTMLTags  = CreateRegularExpression(#PB_Any, #Rex_HTML_Tags$)
   s_HTML_Attr = CreateRegularExpression(#PB_Any, #Rex_HTML_Attributes$)
EndProcedure
Procedure ParseHTMLString (html$)
   Shared s_HTMLTags, s_HTML_Attr
   Protected.i numTags, numAttr, t, a
   Protected Dim tags$(0)
   Protected Dim attr$(0)
   
   If s_HTMLTags And s_HTML_Attr
      numTags = ExtractRegularExpression(s_HTMLTags, html$, tags$())
      Debug "Found " + numTags + " HTML tags:"
      
      For t = 0 To numTags - 1
         Debug Space(4) + "HTML tag:"
         Debug Space(8) + tags$(t)
         
         numAttr = ExtractRegularExpression(s_HTML_Attr, tags$(t), attr$())
         If numAttr
            Debug Space(4) + "Attributes: "
            For a = 0 To numAttr - 1
               Debug Space(8) + attr$(a) 
            Next
         EndIf
      Next
      
   EndIf
EndProcedure
Procedure.s DownloadedHTMLPage (url$)
   Protected *mem, ret$=""
   
   If ReceiveHTTPFile(url$, GetHomeDirectory() + "tmp.html")
      If OpenFile(0, GetHomeDirectory() + "tmp.html")
         *mem = AllocateMemory(Lof(0))
         If *mem
            ReadData(0, *mem, Lof(0))
            CloseFile(0)
            ret$ = PeekS(*mem, -1, #PB_UTF8)
            FreeMemory(*mem)
         EndIf
         DeleteFile(GetHomeDirectory() + "tmp.html")
      EndIf
   EndIf
   
   ProcedureReturn ret$   
EndProcedure
; -- Demo
Define page$
InitNetwork()
InitParseHTML()
page$ = DownloadedHTMLPage("http://www.purebasic.fr/english/")
Debug page$
Debug "----------------------------------------------------------------------"
If page$ <> ""
   ParseHTMLString(page$)
EndIf
 
			 
			
					
				Re: Question about HTML
				Posted: Fri Jul 20, 2018 7:51 pm
				by Little John
				@KCC:
In another thread, Little John wrote:Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.
 That's why I wrote the procedure 
SplitByRegEx(), which retrieves the matching parts AND the non-matching parts.
Using it in the context here, it looks like this:
Code: Select all
EnableExplicit
Structure SplitString
   s$
   match.i
EndStructure
Procedure.i SplitByRegEx (regEx.i, source$, List part.SplitString())
   ; -- split a string into parts, that match or don't match a Regular Expression
   ; in : regEx  : number of a Regular Expression generated by CreateRegularExpression()
   ;      source$: string to be split into parts
   ; out: part()      : resulting list of parts
   ;      return value: number of elements in part():
   ;                    0 if source$ = "", > 0 otherwise;
   ;                   -1 on error
   Protected.i left, right
   
   If ExamineRegularExpression(regEx, source$) = 0
      ProcedureReturn -1              ; error
   EndIf
   
   ClearList(part())
   
   left = 1
   While NextRegularExpressionMatch(regEx)
      right = RegularExpressionMatchPosition(regEx)
      If left < right
         AddElement(part())
         part()\s$ = Mid(source$, left, right-left)
         part()\match = #False
      EndIf
      AddElement(part())
      part()\s$ = RegularExpressionMatchString(regEx)
      part()\match = #True
      left = right + RegularExpressionMatchLength(regEx)
   Wend
   
   If left <= Len(source$)
      AddElement(part())
      part()\s$ = Mid(source$, left)
      part()\match = #False
   EndIf
   
   ProcedureReturn ListSize(part())   ; success
EndProcedure
#WhiteSpace$ = ~" \t\r\n"
Procedure.i IsSolelyWhiteSpace (s$)
   Protected.i last, i
   
   last = Len(s$)
   For i = 1 To last
      If FindString(#WhiteSpace$, Mid(s$, i, 1)) = 0
         ProcedureReturn #False
      EndIf   
   Next   
   
   ProcedureReturn #True
EndProcedure
; html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
#Rex_HTML_Tags$ = "</?\w+((\s+\w+(\s*=\s*(?:" + #DQUOTE$ + ".*?" + #DQUOTE$ + "|'.*?'|[^'" + #DQUOTE$+">\s]+))?)+\s*|\s*)/?>"
; html attribute expression -->  http://stackoverflow.com/a/317081
#Rex_HTML_Attributes$ = "(\S+)=[" + #DQUOTE$ + "']?((?:.(?![" + #DQUOTE$+"']?\s+(?:\S+)=|[>" + #DQUOTE$+"']))+.)[" + #DQUOTE$+"']?"
Define.i s_HTMLTags, s_HTML_Attr
Procedure InitParseHTML ()
   Shared s_HTMLTags, s_HTML_Attr
   
   s_HTMLTags  = CreateRegularExpression(#PB_Any, #Rex_HTML_Tags$)
   s_HTML_Attr = CreateRegularExpression(#PB_Any, #Rex_HTML_Attributes$)
EndProcedure
Procedure ParseHTMLString (html$)
   Shared s_HTMLTags, s_HTML_Attr
   Protected.i numPieces, numAttr, a
   Protected NewList piece.SplitString()
   Protected Dim attr$(0)
   
   If s_HTMLTags And s_HTML_Attr
      numPieces = SplitByRegEx(s_HTMLTags, html$, piece())
      Debug "Found " + numPieces + " pieces (tags + stuff outside of tags):"
      
      ForEach piece()
         If piece()\match
            Debug Space(4) + "HTML tag:"
            Debug Space(8) + piece()\s$
            
            numAttr = ExtractRegularExpression(s_HTML_Attr, piece()\s$, attr$())
            If numAttr
               Debug Space(4) + "Attributes: "
               For a = 0 To numAttr - 1
                  Debug Space(8) + attr$(a) 
               Next
            EndIf
            
         ElseIf Not IsSolelyWhiteSpace(piece()\s$)
            Debug Space(4) + "Outside of tags:"
            Debug Space(8) + piece()\s$
         EndIf
      Next
      
   EndIf
EndProcedure
Procedure.s DownloadedHTMLPage (url$)
   Protected *mem, ret$=""
   
   *mem = ReceiveHTTPMemory(url$)
   If *mem
      ret$ = PeekS(*mem, -1, #PB_UTF8)
      FreeMemory(*mem)
   EndIf
   
   ProcedureReturn ret$   
EndProcedure
; -- Demo
Define page$
InitNetwork()
InitParseHTML()
page$ = DownloadedHTMLPage("http://www.purebasic.fr/english/")
Debug page$
Debug "----------------------------------------------------------------------"
If page$ <> ""
   ParseHTMLString(page$)
EndIf
Debug "=========================================================================="
page$ = ~"<!DOCTYPE html>\n" +
        ~"<html lang='de'>\n" +
        ~"<head>\n" +
        ~"  <meta charset='utf-8'>\n" +
        ~"  <title>Cool title</title>\n" +
        ~"</head>\n" +
        ~"<body>\n" +
        ~"  <p>Uuuh, a parapgraph!</p>\n" +
        ~"  <!--\n" +
        ~"  Comment\n" +
        ~"  -->\n" +
        ~"</body>\n" +
        ~"</html>"
Debug page$
Debug "----------------------------------------------------------------------"
ParseHTMLString(page$)
  I also simplified the procedure DownloadedHTMLPage(), but that's not the point here.
The second example in this code shows, that the HTML comment tag 
<!-- ... --> is not recognized as tag, but as text outside of tags. So there is some room for improvement in the 
#Rex_HTML_Tags$ regex pattern. 

 
			 
			
					
				Re: Question about HTML
				Posted: Sat Jul 21, 2018 10:02 am
				by Marc56us
				Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.
Replaces the RegEx result with nothing and keeps the rest of the initial character string
ReplaceRegularExpression()
 
 
			 
			
					
				Re: Question about HTML
				Posted: Sat Jul 21, 2018 10:40 am
				by Little John
				Marc56us wrote:Replaces the RegEx result with nothing and keeps the rest of the initial character string
I don't see how this will give the same result as my code above. Can you please give some working example code?
 
			 
			
					
				Re: Question about HTML
				Posted: Sat Jul 21, 2018 2:31 pm
				by Marc56us
				I'm not trying to get the same result, I'm just indicating that if you want to retrieve the text that doesn't match the expression, just replace what matches with nothing (in other words delete text match regex)
So what's left is what doesn't match
Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.
Code: Select all
If CreateRegularExpression(0, <SomeThingToRemove>)
    Text_left$ = ReplaceRegularExpression(0, Source_Text, "")
 
 
 
			 
			
					
				Re: Question about HTML
				Posted: Sat Jul 21, 2018 3:12 pm
				by Little John
				Marc56us wrote:I'm not trying to get the same result
Aha.
Marc56us wrote:I'm just indicating that if you want to retrieve the text that doesn't match the expression, just replace what matches with nothing
That does not address the problem at hand and thus does not yield the desired result.