Question about HTML

Everything else that doesn't fall into one of the other PB categories.
User avatar
J@ckWhiteIII
Enthusiast
Enthusiast
Posts: 183
Joined: Fri May 25, 2012 7:39 pm

Question about HTML

Post by J@ckWhiteIII »

Hello,
I have a question about HTML. I'd like to know if ALL tags have ALL the attributes that exist (height, margin, bgcolor, frontcolor).
I want to know this because I'd like to read HTML text using Purebasic. So I thought I'll need a tree structure to be able to handle all that. Now I thought I could make a structure for tags. This structure would contain things like width,height, x, y, margin etc. My approach so far:

Code: Select all

Global inputt.s = "<html><head><title>Titel</title></head><body><div><p>paragraph</p></div></body></html>"
Global i.i = -1
Global output.s = ""
Global *start.Character

Enumeration
  #html
  #body
  #head
  #footer
  #p
  #div
  #link
  #br
  #script
  #meta
  #title
EndEnumeration

Structure tag
  frontColor.i
  backColor.i
  type.i
EndStructure

Structure br
  text.s
EndStructure

Structure p
  text.s
  List brs.br()
EndStructure

Structure link
  text.s
  href.s
EndStructure

Structure div
  List ps.p()
  List links.link()
EndStructure

Structure footer
  List ps.p()
  List links.link()
EndStructure
  
Structure body
  List divs.div()
  List footers.footer()
EndStructure

Structure meta
  content.s
EndStructure

Structure title
  text.s
EndStructure

Structure script
  type.s
EndStructure

Structure head
  List titles.title()
  List metas.meta()
  List scripts.script()
EndStructure

Structure html
  List bodies.body()
  List heads.head()
EndStructure

Procedure.i examineTag(output$,lookuponly.l = 0)
  old_i.i = i
  
  Select *start\c                                       ;just a start
    Case 'a' To 'z','A' To 'Z' ;etc
      While *start\c >= '0' And *start\c <= '9'
        output$ + Chr(*start\c)
        *start + SizeOf(Character)
      Wend
  EndSelect
  
  If lookuponly
    i = old_i
  EndIf
  ProcedureReturn 
EndProcedure

Procedure findTag(input$)
  intag = #False
  While i<Len(input$)
    i=i+1
    If Not intag And Mid(input$,i,1) = "<"
      intag = #True
      Continue
    EndIf
    If intag And Mid(input$,i,1) = ">"
      intag = #False
      Continue
    EndIf
    If Not intag
      text.s = ""
      text = text + Mid(input$,i,1)
      output = output + text+#CRLF$
    EndIf
    If intag
      text.s = ""
      If Mid(input$,i,1) <> "<"
        text = text + Mid(input$,i,1)
      EndIf
      output = output + text
    EndIf
  Wend
EndProcedure

findTag(inputt)
Debug output
I really don't know if what I did there actually makes sense at all or if there'd be a much easier way.
Maybe someone has an answer to my question and/or could help doing the structure.
Looking forward to hearing from you all
User avatar
Mohawk70
Enthusiast
Enthusiast
Posts: 400
Joined: Thu May 11, 2006 1:04 am
Location: Florida, USA

Re: Question about HTML

Post by Mohawk70 »

J@ckWhiteIII wrote:Hello,
I have a question about HTML. I'd like to know if ALL tags have ALL the attributes that exist (height, margin, bgcolor, frontcolor).
I want to know this because I'd like to read HTML text using Purebasic. So I thought I'll need a tree structure to be able to handle all that. Now I thought I could make a structure for tags. This structure would contain things like width,height, x, y, margin etc. My approach so far:
See http://www.w3schools.com/tags/ref_stand ... ibutes.asp for a list of global attributes shared by all tags. You will also find the additional attributes for each tag.
User avatar
helpy
Enthusiast
Enthusiast
Posts: 552
Joined: Sat Jun 28, 2003 12:01 am

Re: Question about HTML

Post by helpy »

It also depends on the doctype (HTML5, HTML 4.01, XHTML 1.0, XHTML 1.1, ...).

==> http://www.w3schools.com/tags/ref_html_dtd.asp
==> http://www.w3schools.com/tags/ref_stand ... ibutes.asp
Windows 10 / Windows 7
PB Last Final / Last Beta Testing
User avatar
J@ckWhiteIII
Enthusiast
Enthusiast
Posts: 183
Joined: Fri May 25, 2012 7:39 pm

Re: Question about HTML

Post by J@ckWhiteIII »

Thank you a lot for those links, they gave me information I need.

But I have another question, this time about the PureBasic code. Does my first attempt make sense or is it a "fail" already? I don't really know if what I did makes sense or whether it is appropriate or not. I'd appreciate information and/or suggestions to improve that little code snippet.
Once again, thank you!
Deluxe0321
User
User
Posts: 69
Joined: Tue Sep 16, 2008 6:11 am
Location: ger

Re: Question about HTML

Post by Deluxe0321 »

I would do it like this:

Code: Select all

InitNetwork()


;// html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
Expression.s = "</?\w+((\s+\w+(\s*=\s*(?:"+#DQUOTE$+".*?"+#DQUOTE$+"|'.*?'|[^'"+#DQUOTE$+">\s]+))?)+\s*|\s*)/?>"

;// html attribute expression -->  http://stackoverflow.com/a/317081
AExpression.s = "(\S+)=["+#DQUOTE$+"']?((?:.(?!["+#DQUOTE$+"']?\s+(?:\S+)=|[>"+#DQUOTE$+"']))+.)["+#DQUOTE$+"']?"

RegEx.i  = CreateRegularExpression(#PB_Any,Expression.s)
ARegEx.i = CreateRegularExpression(#PB_Any,AExpression.s) 

If RegEx.i and ARegEx.i
  
  If ReceiveHTTPFile("http://www.purebasic.fr/english/", GetHomeDirectory()+"tmp.html")
    
    If OpenFile(0,GetHomeDirectory()+"tmp.html")
      *mem = AllocateMemory(Lof(0))
      If *mem
        ReadData(0,*mem,Lof(0))
        CloseFile(0)
      EndIf
      DeleteFile(GetHomeDirectory()+"tmp.html")
    EndIf
    
    Content.s = PeekS(*mem,-1,#PB_Ascii)
    FreeMemory(*mem)
    Debug Content.s
    
    Dim Arr.s(0)
    Dim AArr.s(0)
    
    ArrSize.i = ExtractRegularExpression(RegEx.i,Content.s,Arr())
    If ArrSize.i
      Debug "Found "+Str(ArrSize.i)+" HTML Tags:"
      
      For i=0 To ArrSize.i - 1
        
        ReDim AArr(0)
        
        AArrSize.i = ExtractRegularExpression(ARegEx.i,Arr(i),AArr())
        
        Debug Space(4) + "HTML-Tag:"
        Debug Space(8) + Arr(i)
        
        If AArrSize
          Debug Space(4) + "Attributes: "
          For i2 = 0 To AArrSize.i - 1
            Debug Space(8) + AArr(i2)  
          Next
        EndIf
        
      Next
    EndIf
  EndIf
EndIf
Last edited by Deluxe0321 on Fri Nov 28, 2014 6:21 am, edited 1 time in total.
User avatar
J@ckWhiteIII
Enthusiast
Enthusiast
Posts: 183
Joined: Fri May 25, 2012 7:39 pm

Re: Question about HTML

Post by J@ckWhiteIII »

Wow, thank you a lot for that demonstration! I'll have to do some research on Regular Expression to understand it, though.
Those links you added are full of useful information to me, thanks a lot.

But I must still ask: How did people find the RegEx's? I mean, looking at that I don't understand a word. Why is it exactly that combination of letters?
As I said, I'm gonna keep working on my project for now and (hopefully) understand Regular Expressions by time.
Thank you all.
Deluxe0321
User
User
Posts: 69
Joined: Tue Sep 16, 2008 6:11 am
Location: ger

Re: Question about HTML

Post by Deluxe0321 »

Understanding Regular Expressions is not that hard, simply use google to find good tutorials or - even easier - search for already finished expressions by others.
Ususally a search query like "regex html tag" or "regex XYZ" is enough to get the right Expression string. ;)

An awesome (cheat) sheet, not only for RegEx, can be found here: http://overapi.com/regex/

Have fun with your project,
Deluxe0321
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5342
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: Question about HTML

Post by Kwai chang caine »

I know there are several years :oops: , but Deluxe0321 make a really nice code 8)

Someone know how obtain the text of the tags instead of the Attributes

Code: Select all

AExpression.s = "(\S+)=["+#DQUOTE$+"']?((?:.(?!["+#DQUOTE$+"']?\s+(?:\S+)=|[>"+#DQUOTE$+"']))+.)["+#DQUOTE$+"']?"

Code: Select all

<TAG>I search to keep this text</TAG>
ImageThe happiness is a road...
Not a destination
Bitblazer
Enthusiast
Enthusiast
Posts: 733
Joined: Mon Apr 10, 2017 6:17 pm
Location: Germany
Contact:

Re: Question about HTML

Post by Bitblazer »

Just include and use the HTML Agility pack.

ps: if a mod thinks its inappropriate due to the age of the original thread - feel free to erase this post.
webpage - discord chat links -> purebasic GPT4All
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Question about HTML

Post by Little John »

Thanks to Deluxe0321 for these two interesting Regular Expressions, and thanks to KCC for digging out this old thread! :D

Deluxe0321's code does do three different things:
  • Create two Regular expressions.
  • Fill a string with HTML code from a web page.
  • Parse that string, using the Regular Expressions.
For more clarity, I separeted the parts (and changed some other small things).
My version looks like this:

Code: Select all

EnableExplicit

; html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
#Rex_HTML_Tags$ = "</?\w+((\s+\w+(\s*=\s*(?:" + #DQUOTE$ + ".*?" + #DQUOTE$ + "|'.*?'|[^'" + #DQUOTE$+">\s]+))?)+\s*|\s*)/?>"

; html attribute expression -->  http://stackoverflow.com/a/317081
#Rex_HTML_Attributes$ = "(\S+)=[" + #DQUOTE$ + "']?((?:.(?![" + #DQUOTE$+"']?\s+(?:\S+)=|[>" + #DQUOTE$+"']))+.)[" + #DQUOTE$+"']?"

Define.i s_HTMLTags, s_HTML_Attr

Procedure InitParseHTML ()
   Shared s_HTMLTags, s_HTML_Attr
   
   s_HTMLTags  = CreateRegularExpression(#PB_Any, #Rex_HTML_Tags$)
   s_HTML_Attr = CreateRegularExpression(#PB_Any, #Rex_HTML_Attributes$)
EndProcedure


Procedure ParseHTMLString (html$)
   Shared s_HTMLTags, s_HTML_Attr
   Protected.i numTags, numAttr, t, a
   Protected Dim tags$(0)
   Protected Dim attr$(0)
   
   If s_HTMLTags And s_HTML_Attr
      numTags = ExtractRegularExpression(s_HTMLTags, html$, tags$())
      Debug "Found " + numTags + " HTML tags:"
      
      For t = 0 To numTags - 1
         Debug Space(4) + "HTML tag:"
         Debug Space(8) + tags$(t)
         
         numAttr = ExtractRegularExpression(s_HTML_Attr, tags$(t), attr$())
         If numAttr
            Debug Space(4) + "Attributes: "
            For a = 0 To numAttr - 1
               Debug Space(8) + attr$(a) 
            Next
         EndIf
      Next
      
   EndIf
EndProcedure


Procedure.s DownloadedHTMLPage (url$)
   Protected *mem, ret$=""
   
   If ReceiveHTTPFile(url$, GetHomeDirectory() + "tmp.html")
      If OpenFile(0, GetHomeDirectory() + "tmp.html")
         *mem = AllocateMemory(Lof(0))
         If *mem
            ReadData(0, *mem, Lof(0))
            CloseFile(0)
            ret$ = PeekS(*mem, -1, #PB_UTF8)
            FreeMemory(*mem)
         EndIf
         DeleteFile(GetHomeDirectory() + "tmp.html")
      EndIf
   EndIf
   
   ProcedureReturn ret$   
EndProcedure


; -- Demo
Define page$

InitNetwork()
InitParseHTML()

page$ = DownloadedHTMLPage("http://www.purebasic.fr/english/")

Debug page$
Debug "----------------------------------------------------------------------"

If page$ <> ""
   ParseHTMLString(page$)
EndIf
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Question about HTML

Post by Little John »

@KCC:
In another thread, Little John wrote:Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.
That's why I wrote the procedure SplitByRegEx(), which retrieves the matching parts AND the non-matching parts.

Using it in the context here, it looks like this:

Code: Select all

EnableExplicit

Structure SplitString
   s$
   match.i
EndStructure

Procedure.i SplitByRegEx (regEx.i, source$, List part.SplitString())
   ; -- split a string into parts, that match or don't match a Regular Expression
   ; in : regEx  : number of a Regular Expression generated by CreateRegularExpression()
   ;      source$: string to be split into parts
   ; out: part()      : resulting list of parts
   ;      return value: number of elements in part():
   ;                    0 if source$ = "", > 0 otherwise;
   ;                   -1 on error
   Protected.i left, right
   
   If ExamineRegularExpression(regEx, source$) = 0
      ProcedureReturn -1              ; error
   EndIf
   
   ClearList(part())
   
   left = 1
   While NextRegularExpressionMatch(regEx)
      right = RegularExpressionMatchPosition(regEx)
      If left < right
         AddElement(part())
         part()\s$ = Mid(source$, left, right-left)
         part()\match = #False
      EndIf
      AddElement(part())
      part()\s$ = RegularExpressionMatchString(regEx)
      part()\match = #True
      left = right + RegularExpressionMatchLength(regEx)
   Wend
   
   If left <= Len(source$)
      AddElement(part())
      part()\s$ = Mid(source$, left)
      part()\match = #False
   EndIf
   
   ProcedureReturn ListSize(part())   ; success
EndProcedure


#WhiteSpace$ = ~" \t\r\n"

Procedure.i IsSolelyWhiteSpace (s$)
   Protected.i last, i
   
   last = Len(s$)
   For i = 1 To last
      If FindString(#WhiteSpace$, Mid(s$, i, 1)) = 0
         ProcedureReturn #False
      EndIf   
   Next   
   
   ProcedureReturn #True
EndProcedure


; html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
#Rex_HTML_Tags$ = "</?\w+((\s+\w+(\s*=\s*(?:" + #DQUOTE$ + ".*?" + #DQUOTE$ + "|'.*?'|[^'" + #DQUOTE$+">\s]+))?)+\s*|\s*)/?>"

; html attribute expression -->  http://stackoverflow.com/a/317081
#Rex_HTML_Attributes$ = "(\S+)=[" + #DQUOTE$ + "']?((?:.(?![" + #DQUOTE$+"']?\s+(?:\S+)=|[>" + #DQUOTE$+"']))+.)[" + #DQUOTE$+"']?"

Define.i s_HTMLTags, s_HTML_Attr

Procedure InitParseHTML ()
   Shared s_HTMLTags, s_HTML_Attr
   
   s_HTMLTags  = CreateRegularExpression(#PB_Any, #Rex_HTML_Tags$)
   s_HTML_Attr = CreateRegularExpression(#PB_Any, #Rex_HTML_Attributes$)
EndProcedure


Procedure ParseHTMLString (html$)
   Shared s_HTMLTags, s_HTML_Attr
   Protected.i numPieces, numAttr, a
   Protected NewList piece.SplitString()
   Protected Dim attr$(0)
   
   If s_HTMLTags And s_HTML_Attr
      numPieces = SplitByRegEx(s_HTMLTags, html$, piece())
      Debug "Found " + numPieces + " pieces (tags + stuff outside of tags):"
      
      ForEach piece()
         If piece()\match
            Debug Space(4) + "HTML tag:"
            Debug Space(8) + piece()\s$
            
            numAttr = ExtractRegularExpression(s_HTML_Attr, piece()\s$, attr$())
            If numAttr
               Debug Space(4) + "Attributes: "
               For a = 0 To numAttr - 1
                  Debug Space(8) + attr$(a) 
               Next
            EndIf
            
         ElseIf Not IsSolelyWhiteSpace(piece()\s$)
            Debug Space(4) + "Outside of tags:"
            Debug Space(8) + piece()\s$
         EndIf
      Next
      
   EndIf
EndProcedure


Procedure.s DownloadedHTMLPage (url$)
   Protected *mem, ret$=""
   
   *mem = ReceiveHTTPMemory(url$)
   If *mem
      ret$ = PeekS(*mem, -1, #PB_UTF8)
      FreeMemory(*mem)
   EndIf
   
   ProcedureReturn ret$   
EndProcedure


; -- Demo
Define page$

InitNetwork()
InitParseHTML()

page$ = DownloadedHTMLPage("http://www.purebasic.fr/english/")

Debug page$
Debug "----------------------------------------------------------------------"

If page$ <> ""
   ParseHTMLString(page$)
EndIf

Debug "=========================================================================="

page$ = ~"<!DOCTYPE html>\n" +
        ~"<html lang='de'>\n" +
        ~"<head>\n" +
        ~"  <meta charset='utf-8'>\n" +
        ~"  <title>Cool title</title>\n" +
        ~"</head>\n" +
        ~"<body>\n" +
        ~"  <p>Uuuh, a parapgraph!</p>\n" +
        ~"  <!--\n" +
        ~"  Comment\n" +
        ~"  -->\n" +
        ~"</body>\n" +
        ~"</html>"

Debug page$
Debug "----------------------------------------------------------------------"
ParseHTMLString(page$)
I also simplified the procedure DownloadedHTMLPage(), but that's not the point here.

The second example in this code shows, that the HTML comment tag <!-- ... --> is not recognized as tag, but as text outside of tags. So there is some room for improvement in the #Rex_HTML_Tags$ regex pattern. ;-)
Marc56us
Addict
Addict
Posts: 1477
Joined: Sat Feb 08, 2014 3:26 pm

Re: Question about HTML

Post by Marc56us »

Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.
Replaces the RegEx result with nothing and keeps the rest of the initial character string
ReplaceRegularExpression()

:wink:
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Question about HTML

Post by Little John »

Marc56us wrote:Replaces the RegEx result with nothing and keeps the rest of the initial character string
I don't see how this will give the same result as my code above. Can you please give some working example code?
Marc56us
Addict
Addict
Posts: 1477
Joined: Sat Feb 08, 2014 3:26 pm

Re: Question about HTML

Post by Marc56us »

I'm not trying to get the same result, I'm just indicating that if you want to retrieve the text that doesn't match the expression, just replace what matches with nothing (in other words delete text match regex)

So what's left is what doesn't match
Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.

Code: Select all

If CreateRegularExpression(0, <SomeThingToRemove>)
    Text_left$ = ReplaceRegularExpression(0, Source_Text, "")
:wink:
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Question about HTML

Post by Little John »

Marc56us wrote:I'm not trying to get the same result
Aha.
Marc56us wrote:I'm just indicating that if you want to retrieve the text that doesn't match the expression, just replace what matches with nothing
That does not address the problem at hand and thus does not yield the desired result.
Post Reply