PureBasic Forums - English

Posted: **Sun Apr 21, 2013 7:10 pm**

Hello,
I have a question about HTML. I'd like to know if ALL tags have ALL the attributes that exist (height, margin, bgcolor, frontcolor).
I want to know this because I'd like to read HTML text using Purebasic. So I thought I'll need a tree structure to be able to handle all that. Now I thought I could make a structure for tags. This structure would contain things like width,height, x, y, margin etc. My approach so far:

Code: Select all

Global inputt.s = "<html><head><title>Titel</title></head><body><div><p>paragraph</p></div></body></html>"
Global i.i = -1
Global output.s = ""
Global *start.Character

Enumeration
  #html
  #body
  #head
  #footer
  #p
  #div
  #link
  #br
  #script
  #meta
  #title
EndEnumeration

Structure tag
  frontColor.i
  backColor.i
  type.i
EndStructure

Structure br
  text.s
EndStructure

Structure p
  text.s
  List brs.br()
EndStructure

Structure link
  text.s
  href.s
EndStructure

Structure div
  List ps.p()
  List links.link()
EndStructure

Structure footer
  List ps.p()
  List links.link()
EndStructure
  
Structure body
  List divs.div()
  List footers.footer()
EndStructure

Structure meta
  content.s
EndStructure

Structure title
  text.s
EndStructure

Structure script
  type.s
EndStructure

Structure head
  List titles.title()
  List metas.meta()
  List scripts.script()
EndStructure

Structure html
  List bodies.body()
  List heads.head()
EndStructure

Procedure.i examineTag(output$,lookuponly.l = 0)
  old_i.i = i
  
  Select *start\c                                       ;just a start
    Case 'a' To 'z','A' To 'Z' ;etc
      While *start\c >= '0' And *start\c <= '9'
        output$ + Chr(*start\c)
        *start + SizeOf(Character)
      Wend
  EndSelect
  
  If lookuponly
    i = old_i
  EndIf
  ProcedureReturn 
EndProcedure

Procedure findTag(input$)
  intag = #False
  While i<Len(input$)
    i=i+1
    If Not intag And Mid(input$,i,1) = "<"
      intag = #True
      Continue
    EndIf
    If intag And Mid(input$,i,1) = ">"
      intag = #False
      Continue
    EndIf
    If Not intag
      text.s = ""
      text = text + Mid(input$,i,1)
      output = output + text+#CRLF$
    EndIf
    If intag
      text.s = ""
      If Mid(input$,i,1) <> "<"
        text = text + Mid(input$,i,1)
      EndIf
      output = output + text
    EndIf
  Wend
EndProcedure

findTag(inputt)
Debug output

I really don't know if what I did there actually makes sense at all or if there'd be a much easier way.
Maybe someone has an answer to my question and/or could help doing the structure.
Looking forward to hearing from you all

Posted: **Mon Apr 22, 2013 10:03 am**

J@ckWhiteIII wrote:Hello,
I have a question about HTML. I'd like to know if ALL tags have ALL the attributes that exist (height, margin, bgcolor, frontcolor).
I want to know this because I'd like to read HTML text using Purebasic. So I thought I'll need a tree structure to be able to handle all that. Now I thought I could make a structure for tags. This structure would contain things like width,height, x, y, margin etc. My approach so far:

See http://www.w3schools.com/tags/ref_stand ... ibutes.asp for a list of global attributes shared by all tags. You will also find the additional attributes for each tag.

Posted: **Mon Apr 22, 2013 3:35 pm**

It also depends on the doctype (HTML5, HTML 4.01, XHTML 1.0, XHTML 1.1, ...).

==> http://www.w3schools.com/tags/ref_html_dtd.asp
==> http://www.w3schools.com/tags/ref_stand ... ibutes.asp

Posted: **Mon Apr 22, 2013 4:06 pm**

Thank you a lot for those links, they gave me information I need.

But I have another question, this time about the PureBasic code. Does my first attempt make sense or is it a "fail" already? I don't really know if what I did makes sense or whether it is appropriate or not. I'd appreciate information and/or suggestions to improve that little code snippet.
Once again, thank you!

Posted: **Mon Apr 22, 2013 9:02 pm**

I would do it like this:

Code: Select all

InitNetwork()


;// html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
Expression.s = "</?\w+((\s+\w+(\s*=\s*(?:"+#DQUOTE$+".*?"+#DQUOTE$+"|'.*?'|[^'"+#DQUOTE$+">\s]+))?)+\s*|\s*)/?>"

;// html attribute expression -->  http://stackoverflow.com/a/317081
AExpression.s = "(\S+)=["+#DQUOTE$+"']?((?:.(?!["+#DQUOTE$+"']?\s+(?:\S+)=|[>"+#DQUOTE$+"']))+.)["+#DQUOTE$+"']?"

RegEx.i  = CreateRegularExpression(#PB_Any,Expression.s)
ARegEx.i = CreateRegularExpression(#PB_Any,AExpression.s) 

If RegEx.i and ARegEx.i
  
  If ReceiveHTTPFile("http://www.purebasic.fr/english/", GetHomeDirectory()+"tmp.html")
    
    If OpenFile(0,GetHomeDirectory()+"tmp.html")
      *mem = AllocateMemory(Lof(0))
      If *mem
        ReadData(0,*mem,Lof(0))
        CloseFile(0)
      EndIf
      DeleteFile(GetHomeDirectory()+"tmp.html")
    EndIf
    
    Content.s = PeekS(*mem,-1,#PB_Ascii)
    FreeMemory(*mem)
    Debug Content.s
    
    Dim Arr.s(0)
    Dim AArr.s(0)
    
    ArrSize.i = ExtractRegularExpression(RegEx.i,Content.s,Arr())
    If ArrSize.i
      Debug "Found "+Str(ArrSize.i)+" HTML Tags:"
      
      For i=0 To ArrSize.i - 1
        
        ReDim AArr(0)
        
        AArrSize.i = ExtractRegularExpression(ARegEx.i,Arr(i),AArr())
        
        Debug Space(4) + "HTML-Tag:"
        Debug Space(8) + Arr(i)
        
        If AArrSize
          Debug Space(4) + "Attributes: "
          For i2 = 0 To AArrSize.i - 1
            Debug Space(8) + AArr(i2)  
          Next
        EndIf
        
      Next
    EndIf
  EndIf
EndIf

Posted: **Tue Apr 23, 2013 4:30 pm**

Wow, thank you a lot for that demonstration! I'll have to do some research on Regular Expression to understand it, though.
Those links you added are full of useful information to me, thanks a lot.

But I must still ask: How did people find the RegEx's? I mean, looking at that I don't understand a word. Why is it exactly that combination of letters?
As I said, I'm gonna keep working on my project for now and (hopefully) understand Regular Expressions by time.
Thank you all.

Posted: **Tue Apr 23, 2013 9:40 pm**

Understanding Regular Expressions is not that hard, simply use google to find good tutorials or - even easier - search for already finished expressions by others.
Ususally a search query like "regex html tag" or "regex XYZ" is enough to get the right Expression string.

An awesome (cheat) sheet, not only for RegEx, can be found here: http://overapi.com/regex/

Have fun with your project,
Deluxe0321

Posted: **Thu Jul 05, 2018 3:35 pm**

I know there are several years

, but Deluxe0321 make a really nice code

Someone know how obtain the text of the tags instead of the Attributes

Code: Select all

AExpression.s = "(\S+)=["+#DQUOTE$+"']?((?:.(?!["+#DQUOTE$+"']?\s+(?:\S+)=|[>"+#DQUOTE$+"']))+.)["+#DQUOTE$+"']?"

Code: Select all

<TAG>I search to keep this text</TAG>

Posted: **Thu Jul 05, 2018 9:17 pm**

Just include and use the HTML Agility pack.

ps: if a mod thinks its inappropriate due to the age of the original thread - feel free to erase this post.

Posted: **Fri Jul 20, 2018 7:46 pm**

Thanks to Deluxe0321 for these two interesting Regular Expressions, and thanks to KCC for digging out this old thread!

Deluxe0321's code does do three different things:

Create two Regular expressions.
Fill a string with HTML code from a web page.
Parse that string, using the Regular Expressions.

For more clarity, I separeted the parts (and changed some other small things).
My version looks like this:

Code: Select all

EnableExplicit

; html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
#Rex_HTML_Tags$ = "</?\w+((\s+\w+(\s*=\s*(?:" + #DQUOTE$ + ".*?" + #DQUOTE$ + "|'.*?'|[^'" + #DQUOTE$+">\s]+))?)+\s*|\s*)/?>"

; html attribute expression -->  http://stackoverflow.com/a/317081
#Rex_HTML_Attributes$ = "(\S+)=[" + #DQUOTE$ + "']?((?:.(?![" + #DQUOTE$+"']?\s+(?:\S+)=|[>" + #DQUOTE$+"']))+.)[" + #DQUOTE$+"']?"

Define.i s_HTMLTags, s_HTML_Attr

Procedure InitParseHTML ()
   Shared s_HTMLTags, s_HTML_Attr
   
   s_HTMLTags  = CreateRegularExpression(#PB_Any, #Rex_HTML_Tags$)
   s_HTML_Attr = CreateRegularExpression(#PB_Any, #Rex_HTML_Attributes$)
EndProcedure


Procedure ParseHTMLString (html$)
   Shared s_HTMLTags, s_HTML_Attr
   Protected.i numTags, numAttr, t, a
   Protected Dim tags$(0)
   Protected Dim attr$(0)
   
   If s_HTMLTags And s_HTML_Attr
      numTags = ExtractRegularExpression(s_HTMLTags, html$, tags$())
      Debug "Found " + numTags + " HTML tags:"
      
      For t = 0 To numTags - 1
         Debug Space(4) + "HTML tag:"
         Debug Space(8) + tags$(t)
         
         numAttr = ExtractRegularExpression(s_HTML_Attr, tags$(t), attr$())
         If numAttr
            Debug Space(4) + "Attributes: "
            For a = 0 To numAttr - 1
               Debug Space(8) + attr$(a) 
            Next
         EndIf
      Next
      
   EndIf
EndProcedure


Procedure.s DownloadedHTMLPage (url$)
   Protected *mem, ret$=""
   
   If ReceiveHTTPFile(url$, GetHomeDirectory() + "tmp.html")
      If OpenFile(0, GetHomeDirectory() + "tmp.html")
         *mem = AllocateMemory(Lof(0))
         If *mem
            ReadData(0, *mem, Lof(0))
            CloseFile(0)
            ret$ = PeekS(*mem, -1, #PB_UTF8)
            FreeMemory(*mem)
         EndIf
         DeleteFile(GetHomeDirectory() + "tmp.html")
      EndIf
   EndIf
   
   ProcedureReturn ret$   
EndProcedure


; -- Demo
Define page$

InitNetwork()
InitParseHTML()

page$ = DownloadedHTMLPage("http://www.purebasic.fr/english/")

Debug page$
Debug "----------------------------------------------------------------------"

If page$ <> ""
   ParseHTMLString(page$)
EndIf

Posted: **Fri Jul 20, 2018 7:51 pm**

@KCC:

In another thread, Little John wrote:Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.

That's why I wrote the procedure SplitByRegEx(), which retrieves the matching parts AND the non-matching parts.

Using it in the context here, it looks like this:

Code: Select all

EnableExplicit

Structure SplitString
   s$
   match.i
EndStructure

Procedure.i SplitByRegEx (regEx.i, source$, List part.SplitString())
   ; -- split a string into parts, that match or don't match a Regular Expression
   ; in : regEx  : number of a Regular Expression generated by CreateRegularExpression()
   ;      source$: string to be split into parts
   ; out: part()      : resulting list of parts
   ;      return value: number of elements in part():
   ;                    0 if source$ = "", > 0 otherwise;
   ;                   -1 on error
   Protected.i left, right
   
   If ExamineRegularExpression(regEx, source$) = 0
      ProcedureReturn -1              ; error
   EndIf
   
   ClearList(part())
   
   left = 1
   While NextRegularExpressionMatch(regEx)
      right = RegularExpressionMatchPosition(regEx)
      If left < right
         AddElement(part())
         part()\s$ = Mid(source$, left, right-left)
         part()\match = #False
      EndIf
      AddElement(part())
      part()\s$ = RegularExpressionMatchString(regEx)
      part()\match = #True
      left = right + RegularExpressionMatchLength(regEx)
   Wend
   
   If left <= Len(source$)
      AddElement(part())
      part()\s$ = Mid(source$, left)
      part()\match = #False
   EndIf
   
   ProcedureReturn ListSize(part())   ; success
EndProcedure


#WhiteSpace$ = ~" \t\r\n"

Procedure.i IsSolelyWhiteSpace (s$)
   Protected.i last, i
   
   last = Len(s$)
   For i = 1 To last
      If FindString(#WhiteSpace$, Mid(s$, i, 1)) = 0
         ProcedureReturn #False
      EndIf   
   Next   
   
   ProcedureReturn #True
EndProcedure


; html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
#Rex_HTML_Tags$ = "</?\w+((\s+\w+(\s*=\s*(?:" + #DQUOTE$ + ".*?" + #DQUOTE$ + "|'.*?'|[^'" + #DQUOTE$+">\s]+))?)+\s*|\s*)/?>"

; html attribute expression -->  http://stackoverflow.com/a/317081
#Rex_HTML_Attributes$ = "(\S+)=[" + #DQUOTE$ + "']?((?:.(?![" + #DQUOTE$+"']?\s+(?:\S+)=|[>" + #DQUOTE$+"']))+.)[" + #DQUOTE$+"']?"

Define.i s_HTMLTags, s_HTML_Attr

Procedure InitParseHTML ()
   Shared s_HTMLTags, s_HTML_Attr
   
   s_HTMLTags  = CreateRegularExpression(#PB_Any, #Rex_HTML_Tags$)
   s_HTML_Attr = CreateRegularExpression(#PB_Any, #Rex_HTML_Attributes$)
EndProcedure


Procedure ParseHTMLString (html$)
   Shared s_HTMLTags, s_HTML_Attr
   Protected.i numPieces, numAttr, a
   Protected NewList piece.SplitString()
   Protected Dim attr$(0)
   
   If s_HTMLTags And s_HTML_Attr
      numPieces = SplitByRegEx(s_HTMLTags, html$, piece())
      Debug "Found " + numPieces + " pieces (tags + stuff outside of tags):"
      
      ForEach piece()
         If piece()\match
            Debug Space(4) + "HTML tag:"
            Debug Space(8) + piece()\s$
            
            numAttr = ExtractRegularExpression(s_HTML_Attr, piece()\s$, attr$())
            If numAttr
               Debug Space(4) + "Attributes: "
               For a = 0 To numAttr - 1
                  Debug Space(8) + attr$(a) 
               Next
            EndIf
            
         ElseIf Not IsSolelyWhiteSpace(piece()\s$)
            Debug Space(4) + "Outside of tags:"
            Debug Space(8) + piece()\s$
         EndIf
      Next
      
   EndIf
EndProcedure


Procedure.s DownloadedHTMLPage (url$)
   Protected *mem, ret$=""
   
   *mem = ReceiveHTTPMemory(url$)
   If *mem
      ret$ = PeekS(*mem, -1, #PB_UTF8)
      FreeMemory(*mem)
   EndIf
   
   ProcedureReturn ret$   
EndProcedure


; -- Demo
Define page$

InitNetwork()
InitParseHTML()

page$ = DownloadedHTMLPage("http://www.purebasic.fr/english/")

Debug page$
Debug "----------------------------------------------------------------------"

If page$ <> ""
   ParseHTMLString(page$)
EndIf

Debug "=========================================================================="

page$ = ~"<!DOCTYPE html>\n" +
        ~"<html lang='de'>\n" +
        ~"<head>\n" +
        ~"  <meta charset='utf-8'>\n" +
        ~"  <title>Cool title</title>\n" +
        ~"</head>\n" +
        ~"<body>\n" +
        ~"  <p>Uuuh, a parapgraph!</p>\n" +
        ~"  <!--\n" +
        ~"  Comment\n" +
        ~"  -->\n" +
        ~"</body>\n" +
        ~"</html>"

Debug page$
Debug "----------------------------------------------------------------------"
ParseHTMLString(page$)

I also simplified the procedure DownloadedHTMLPage(), but that's not the point here.

The second example in this code shows, that the HTML comment tag  is not recognized as tag, but as text outside of tags. So there is some room for improvement in the #Rex_HTML_Tags$ regex pattern.

Posted: **Sat Jul 21, 2018 10:02 am**

Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.

Replaces the RegEx result with nothing and keeps the rest of the initial character string
ReplaceRegularExpression()

Posted: **Sat Jul 21, 2018 10:40 am**

Marc56us wrote:Replaces the RegEx result with nothing and keeps the rest of the initial character string

I don't see how this will give the same result as my code above. Can you please give some working example code?

Posted: **Sat Jul 21, 2018 2:31 pm**

I'm not trying to get the same result, I'm just indicating that if you want to retrieve the text that doesn't match the expression, just replace what matches with nothing (in other words delete text match regex)

So what's left is what doesn't match

Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.

Code: Select all

If CreateRegularExpression(0, <SomeThingToRemove>)
    Text_left$ = ReplaceRegularExpression(0, Source_Text, "")

Posted: **Sat Jul 21, 2018 3:12 pm**

Marc56us wrote:I'm not trying to get the same result

Aha.

Marc56us wrote:I'm just indicating that if you want to retrieve the text that doesn't match the expression, just replace what matches with nothing

That does not address the problem at hand and thus does not yield the desired result.

PureBasic Forums - English

Question about HTML

Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML

Re: Question about HTML