It is currently Tue Nov 12, 2019 7:34 am

All times are UTC + 1 hour




Post new topic Reply to topic  [ 15 posts ] 
Author Message
 Post subject: Question about HTML
PostPosted: Sun Apr 21, 2013 7:10 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Fri May 25, 2012 7:39 pm
Posts: 183
Hello,
I have a question about HTML. I'd like to know if ALL tags have ALL the attributes that exist (height, margin, bgcolor, frontcolor).
I want to know this because I'd like to read HTML text using Purebasic. So I thought I'll need a tree structure to be able to handle all that. Now I thought I could make a structure for tags. This structure would contain things like width,height, x, y, margin etc. My approach so far:

Code:
Global inputt.s = "<html><head><title>Titel</title></head><body><div><p>paragraph</p></div></body></html>"
Global i.i = -1
Global output.s = ""
Global *start.Character

Enumeration
  #html
  #body
  #head
  #footer
  #p
  #div
  #link
  #br
  #script
  #meta
  #title
EndEnumeration

Structure tag
  frontColor.i
  backColor.i
  type.i
EndStructure

Structure br
  text.s
EndStructure

Structure p
  text.s
  List brs.br()
EndStructure

Structure link
  text.s
  href.s
EndStructure

Structure div
  List ps.p()
  List links.link()
EndStructure

Structure footer
  List ps.p()
  List links.link()
EndStructure
 
Structure body
  List divs.div()
  List footers.footer()
EndStructure

Structure meta
  content.s
EndStructure

Structure title
  text.s
EndStructure

Structure script
  type.s
EndStructure

Structure head
  List titles.title()
  List metas.meta()
  List scripts.script()
EndStructure

Structure html
  List bodies.body()
  List heads.head()
EndStructure

Procedure.i examineTag(output$,lookuponly.l = 0)
  old_i.i = i
 
  Select *start\c                                       ;just a start
    Case 'a' To 'z','A' To 'Z' ;etc
      While *start\c >= '0' And *start\c <= '9'
        output$ + Chr(*start\c)
        *start + SizeOf(Character)
      Wend
  EndSelect
 
  If lookuponly
    i = old_i
  EndIf
  ProcedureReturn
EndProcedure

Procedure findTag(input$)
  intag = #False
  While i<Len(input$)
    i=i+1
    If Not intag And Mid(input$,i,1) = "<"
      intag = #True
      Continue
    EndIf
    If intag And Mid(input$,i,1) = ">"
      intag = #False
      Continue
    EndIf
    If Not intag
      text.s = ""
      text = text + Mid(input$,i,1)
      output = output + text+#CRLF$
    EndIf
    If intag
      text.s = ""
      If Mid(input$,i,1) <> "<"
        text = text + Mid(input$,i,1)
      EndIf
      output = output + text
    EndIf
  Wend
EndProcedure

findTag(inputt)
Debug output


I really don't know if what I did there actually makes sense at all or if there'd be a much easier way.
Maybe someone has an answer to my question and/or could help doing the structure.
Looking forward to hearing from you all


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Mon Apr 22, 2013 10:03 am 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Thu May 11, 2006 1:04 am
Posts: 388
Location: Mullica Hill, NJ USA
J@ckWhiteIII wrote:
Hello,
I have a question about HTML. I'd like to know if ALL tags have ALL the attributes that exist (height, margin, bgcolor, frontcolor).
I want to know this because I'd like to read HTML text using Purebasic. So I thought I'll need a tree structure to be able to handle all that. Now I thought I could make a structure for tags. This structure would contain things like width,height, x, y, margin etc. My approach so far:


See http://www.w3schools.com/tags/ref_standardattributes.asp for a list of global attributes shared by all tags. You will also find the additional attributes for each tag.

_________________
There's no such thing as free time.
There's no such thing as spare time.
There's no such thing as down time.
All you have is life time.
Go!
- Henry Rollins

Tupponce Hosting http://www.TupponceHosting.net


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Mon Apr 22, 2013 3:35 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Sat Jun 28, 2003 12:01 am
Posts: 490
It also depends on the doctype (HTML5, HTML 4.01, XHTML 1.0, XHTML 1.1, ...).

==> http://www.w3schools.com/tags/ref_html_dtd.asp
==> http://www.w3schools.com/tags/ref_stand ... ibutes.asp

_________________
Windows 10 / Windows 7
PB Last Final / Last Beta Testing


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Mon Apr 22, 2013 4:06 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Fri May 25, 2012 7:39 pm
Posts: 183
Thank you a lot for those links, they gave me information I need.

But I have another question, this time about the PureBasic code. Does my first attempt make sense or is it a "fail" already? I don't really know if what I did makes sense or whether it is appropriate or not. I'd appreciate information and/or suggestions to improve that little code snippet.
Once again, thank you!


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Mon Apr 22, 2013 9:02 pm 
Offline
User
User
User avatar

Joined: Tue Sep 16, 2008 6:11 am
Posts: 69
Location: ger
I would do it like this:
Code:
InitNetwork()


;// html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
Expression.s = "</?\w+((\s+\w+(\s*=\s*(?:"+#DQUOTE$+".*?"+#DQUOTE$+"|'.*?'|[^'"+#DQUOTE$+">\s]+))?)+\s*|\s*)/?>"

;// html attribute expression -->  http://stackoverflow.com/a/317081
AExpression.s = "(\S+)=["+#DQUOTE$+"']?((?:.(?!["+#DQUOTE$+"']?\s+(?:\S+)=|[>"+#DQUOTE$+"']))+.)["+#DQUOTE$+"']?"

RegEx.i  = CreateRegularExpression(#PB_Any,Expression.s)
ARegEx.i = CreateRegularExpression(#PB_Any,AExpression.s)

If RegEx.i and ARegEx.i
 
  If ReceiveHTTPFile("http://www.purebasic.fr/english/", GetHomeDirectory()+"tmp.html")
   
    If OpenFile(0,GetHomeDirectory()+"tmp.html")
      *mem = AllocateMemory(Lof(0))
      If *mem
        ReadData(0,*mem,Lof(0))
        CloseFile(0)
      EndIf
      DeleteFile(GetHomeDirectory()+"tmp.html")
    EndIf
   
    Content.s = PeekS(*mem,-1,#PB_Ascii)
    FreeMemory(*mem)
    Debug Content.s
   
    Dim Arr.s(0)
    Dim AArr.s(0)
   
    ArrSize.i = ExtractRegularExpression(RegEx.i,Content.s,Arr())
    If ArrSize.i
      Debug "Found "+Str(ArrSize.i)+" HTML Tags:"
     
      For i=0 To ArrSize.i - 1
       
        ReDim AArr(0)
       
        AArrSize.i = ExtractRegularExpression(ARegEx.i,Arr(i),AArr())
       
        Debug Space(4) + "HTML-Tag:"
        Debug Space(8) + Arr(i)
       
        If AArrSize
          Debug Space(4) + "Attributes: "
          For i2 = 0 To AArrSize.i - 1
            Debug Space(8) + AArr(i2) 
          Next
        EndIf
       
      Next
    EndIf
  EndIf
EndIf


Last edited by Deluxe0321 on Fri Nov 28, 2014 6:21 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Tue Apr 23, 2013 4:30 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Fri May 25, 2012 7:39 pm
Posts: 183
Wow, thank you a lot for that demonstration! I'll have to do some research on Regular Expression to understand it, though.
Those links you added are full of useful information to me, thanks a lot.

But I must still ask: How did people find the RegEx's? I mean, looking at that I don't understand a word. Why is it exactly that combination of letters?
As I said, I'm gonna keep working on my project for now and (hopefully) understand Regular Expressions by time.
Thank you all.


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Tue Apr 23, 2013 9:40 pm 
Offline
User
User
User avatar

Joined: Tue Sep 16, 2008 6:11 am
Posts: 69
Location: ger
Understanding Regular Expressions is not that hard, simply use google to find good tutorials or - even easier - search for already finished expressions by others.
Ususally a search query like "regex html tag" or "regex XYZ" is enough to get the right Expression string. ;)

An awesome (cheat) sheet, not only for RegEx, can be found here: http://overapi.com/regex/

Have fun with your project,
Deluxe0321


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Thu Jul 05, 2018 3:35 pm 
Offline
Addict
Addict
User avatar

Joined: Sun Nov 05, 2006 11:42 pm
Posts: 4520
Location: Lyon - France
I know there are several years :oops: , but Deluxe0321 make a really nice code 8)

Someone know how obtain the text of the tags instead of the Attributes
Code:
AExpression.s = "(\S+)=["+#DQUOTE$+"']?((?:.(?!["+#DQUOTE$+"']?\s+(?:\S+)=|[>"+#DQUOTE$+"']))+.)["+#DQUOTE$+"']?"
Code:
<TAG>I search to keep this text</TAG>

_________________
ImageThe happiness is a road...
Not a destination


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Thu Jul 05, 2018 9:17 pm 
Offline
Enthusiast
Enthusiast

Joined: Mon Apr 10, 2017 6:17 pm
Posts: 285
Location: Germany
Just include and use the HTML Agility pack.

ps: if a mod thinks its inappropriate due to the age of the original thread - feel free to erase this post.

_________________
webpage


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Fri Jul 20, 2018 7:46 pm 
Offline
Addict
Addict
User avatar

Joined: Thu Jun 07, 2007 3:25 pm
Posts: 3698
Location: Berlin, Germany
Thanks to Deluxe0321 for these two interesting Regular Expressions, and thanks to KCC for digging out this old thread! :D

Deluxe0321's code does do three different things:
  • Create two Regular expressions.
  • Fill a string with HTML code from a web page.
  • Parse that string, using the Regular Expressions.

For more clarity, I separeted the parts (and changed some other small things).
My version looks like this:
Code:
EnableExplicit

; html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
#Rex_HTML_Tags$ = "</?\w+((\s+\w+(\s*=\s*(?:" + #DQUOTE$ + ".*?" + #DQUOTE$ + "|'.*?'|[^'" + #DQUOTE$+">\s]+))?)+\s*|\s*)/?>"

; html attribute expression -->  http://stackoverflow.com/a/317081
#Rex_HTML_Attributes$ = "(\S+)=[" + #DQUOTE$ + "']?((?:.(?![" + #DQUOTE$+"']?\s+(?:\S+)=|[>" + #DQUOTE$+"']))+.)[" + #DQUOTE$+"']?"

Define.i s_HTMLTags, s_HTML_Attr

Procedure InitParseHTML ()
   Shared s_HTMLTags, s_HTML_Attr
   
   s_HTMLTags  = CreateRegularExpression(#PB_Any, #Rex_HTML_Tags$)
   s_HTML_Attr = CreateRegularExpression(#PB_Any, #Rex_HTML_Attributes$)
EndProcedure


Procedure ParseHTMLString (html$)
   Shared s_HTMLTags, s_HTML_Attr
   Protected.i numTags, numAttr, t, a
   Protected Dim tags$(0)
   Protected Dim attr$(0)
   
   If s_HTMLTags And s_HTML_Attr
      numTags = ExtractRegularExpression(s_HTMLTags, html$, tags$())
      Debug "Found " + numTags + " HTML tags:"
     
      For t = 0 To numTags - 1
         Debug Space(4) + "HTML tag:"
         Debug Space(8) + tags$(t)
         
         numAttr = ExtractRegularExpression(s_HTML_Attr, tags$(t), attr$())
         If numAttr
            Debug Space(4) + "Attributes: "
            For a = 0 To numAttr - 1
               Debug Space(8) + attr$(a)
            Next
         EndIf
      Next
     
   EndIf
EndProcedure


Procedure.s DownloadedHTMLPage (url$)
   Protected *mem, ret$=""
   
   If ReceiveHTTPFile(url$, GetHomeDirectory() + "tmp.html")
      If OpenFile(0, GetHomeDirectory() + "tmp.html")
         *mem = AllocateMemory(Lof(0))
         If *mem
            ReadData(0, *mem, Lof(0))
            CloseFile(0)
            ret$ = PeekS(*mem, -1, #PB_UTF8)
            FreeMemory(*mem)
         EndIf
         DeleteFile(GetHomeDirectory() + "tmp.html")
      EndIf
   EndIf
   
   ProcedureReturn ret$   
EndProcedure


; -- Demo
Define page$

InitNetwork()
InitParseHTML()

page$ = DownloadedHTMLPage("http://www.purebasic.fr/english/")

Debug page$
Debug "----------------------------------------------------------------------"

If page$ <> ""
   ParseHTMLString(page$)
EndIf

_________________
Please excuse my flawed English. My native language is PureBasic.
Search
RSBasic's backups


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Fri Jul 20, 2018 7:51 pm 
Offline
Addict
Addict
User avatar

Joined: Thu Jun 07, 2007 3:25 pm
Posts: 3698
Location: Berlin, Germany
@KCC:

In another thread, Little John wrote:
Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.
That's why I wrote the procedure SplitByRegEx(), which retrieves the matching parts AND the non-matching parts.

Using it in the context here, it looks like this:
Code:
EnableExplicit

Structure SplitString
   s$
   match.i
EndStructure

Procedure.i SplitByRegEx (regEx.i, source$, List part.SplitString())
   ; -- split a string into parts, that match or don't match a Regular Expression
   ; in : regEx  : number of a Regular Expression generated by CreateRegularExpression()
   ;      source$: string to be split into parts
   ; out: part()      : resulting list of parts
   ;      return value: number of elements in part():
   ;                    0 if source$ = "", > 0 otherwise;
   ;                   -1 on error
   Protected.i left, right
   
   If ExamineRegularExpression(regEx, source$) = 0
      ProcedureReturn -1              ; error
   EndIf
   
   ClearList(part())
   
   left = 1
   While NextRegularExpressionMatch(regEx)
      right = RegularExpressionMatchPosition(regEx)
      If left < right
         AddElement(part())
         part()\s$ = Mid(source$, left, right-left)
         part()\match = #False
      EndIf
      AddElement(part())
      part()\s$ = RegularExpressionMatchString(regEx)
      part()\match = #True
      left = right + RegularExpressionMatchLength(regEx)
   Wend
   
   If left <= Len(source$)
      AddElement(part())
      part()\s$ = Mid(source$, left)
      part()\match = #False
   EndIf
   
   ProcedureReturn ListSize(part())   ; success
EndProcedure


#WhiteSpace$ = ~" \t\r\n"

Procedure.i IsSolelyWhiteSpace (s$)
   Protected.i last, i
   
   last = Len(s$)
   For i = 1 To last
      If FindString(#WhiteSpace$, Mid(s$, i, 1)) = 0
         ProcedureReturn #False
      EndIf   
   Next   
   
   ProcedureReturn #True
EndProcedure


; html tag expression --> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
#Rex_HTML_Tags$ = "</?\w+((\s+\w+(\s*=\s*(?:" + #DQUOTE$ + ".*?" + #DQUOTE$ + "|'.*?'|[^'" + #DQUOTE$+">\s]+))?)+\s*|\s*)/?>"

; html attribute expression -->  http://stackoverflow.com/a/317081
#Rex_HTML_Attributes$ = "(\S+)=[" + #DQUOTE$ + "']?((?:.(?![" + #DQUOTE$+"']?\s+(?:\S+)=|[>" + #DQUOTE$+"']))+.)[" + #DQUOTE$+"']?"

Define.i s_HTMLTags, s_HTML_Attr

Procedure InitParseHTML ()
   Shared s_HTMLTags, s_HTML_Attr
   
   s_HTMLTags  = CreateRegularExpression(#PB_Any, #Rex_HTML_Tags$)
   s_HTML_Attr = CreateRegularExpression(#PB_Any, #Rex_HTML_Attributes$)
EndProcedure


Procedure ParseHTMLString (html$)
   Shared s_HTMLTags, s_HTML_Attr
   Protected.i numPieces, numAttr, a
   Protected NewList piece.SplitString()
   Protected Dim attr$(0)
   
   If s_HTMLTags And s_HTML_Attr
      numPieces = SplitByRegEx(s_HTMLTags, html$, piece())
      Debug "Found " + numPieces + " pieces (tags + stuff outside of tags):"
     
      ForEach piece()
         If piece()\match
            Debug Space(4) + "HTML tag:"
            Debug Space(8) + piece()\s$
           
            numAttr = ExtractRegularExpression(s_HTML_Attr, piece()\s$, attr$())
            If numAttr
               Debug Space(4) + "Attributes: "
               For a = 0 To numAttr - 1
                  Debug Space(8) + attr$(a)
               Next
            EndIf
           
         ElseIf Not IsSolelyWhiteSpace(piece()\s$)
            Debug Space(4) + "Outside of tags:"
            Debug Space(8) + piece()\s$
         EndIf
      Next
     
   EndIf
EndProcedure


Procedure.s DownloadedHTMLPage (url$)
   Protected *mem, ret$=""
   
   *mem = ReceiveHTTPMemory(url$)
   If *mem
      ret$ = PeekS(*mem, -1, #PB_UTF8)
      FreeMemory(*mem)
   EndIf
   
   ProcedureReturn ret$   
EndProcedure


; -- Demo
Define page$

InitNetwork()
InitParseHTML()

page$ = DownloadedHTMLPage("http://www.purebasic.fr/english/")

Debug page$
Debug "----------------------------------------------------------------------"

If page$ <> ""
   ParseHTMLString(page$)
EndIf

Debug "=========================================================================="

page$ = ~"<!DOCTYPE html>\n" +
        ~"<html lang='de'>\n" +
        ~"<head>\n" +
        ~"  <meta charset='utf-8'>\n" +
        ~"  <title>Cool title</title>\n" +
        ~"</head>\n" +
        ~"<body>\n" +
        ~"  <p>Uuuh, a parapgraph!</p>\n" +
        ~"  <!--\n" +
        ~"  Comment\n" +
        ~"  -->\n" +
        ~"</body>\n" +
        ~"</html>"

Debug page$
Debug "----------------------------------------------------------------------"
ParseHTMLString(page$)
I also simplified the procedure DownloadedHTMLPage(), but that's not the point here.

The second example in this code shows, that the HTML comment tag <!-- ... --> is not recognized as tag, but as text outside of tags. So there is some room for improvement in the #Rex_HTML_Tags$ regex pattern. ;-)

_________________
Please excuse my flawed English. My native language is PureBasic.
Search
RSBasic's backups


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Sat Jul 21, 2018 10:02 am 
Offline
Enthusiast
Enthusiast

Joined: Sat Feb 08, 2014 3:26 pm
Posts: 685
Quote:
Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.

Replaces the RegEx result with nothing and keeps the rest of the initial character string
ReplaceRegularExpression()

:wink:

_________________
(English is not my native language, I use an online translator)
Windows 10 Family x64 + Linux (Slackware, Debian on Oracle VirtualBox 6.0) + Raspberry Pi


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Sat Jul 21, 2018 10:40 am 
Offline
Addict
Addict
User avatar

Joined: Thu Jun 07, 2007 3:25 pm
Posts: 3698
Location: Berlin, Germany
Marc56us wrote:
Replaces the RegEx result with nothing and keeps the rest of the initial character string

I don't see how this will give the same result as my code above. Can you please give some working example code?

_________________
Please excuse my flawed English. My native language is PureBasic.
Search
RSBasic's backups


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Sat Jul 21, 2018 2:31 pm 
Offline
Enthusiast
Enthusiast

Joined: Sat Feb 08, 2014 3:26 pm
Posts: 685
I'm not trying to get the same result, I'm just indicating that if you want to retrieve the text that doesn't match the expression, just replace what matches with nothing (in other words delete text match regex)

So what's left is what doesn't match

Quote:
Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved.

Code:
If CreateRegularExpression(0, <SomeThingToRemove>)
    Text_left$ = ReplaceRegularExpression(0, Source_Text, "")

:wink:

_________________
(English is not my native language, I use an online translator)
Windows 10 Family x64 + Linux (Slackware, Debian on Oracle VirtualBox 6.0) + Raspberry Pi


Top
 Profile  
Reply with quote  
 Post subject: Re: Question about HTML
PostPosted: Sat Jul 21, 2018 3:12 pm 
Offline
Addict
Addict
User avatar

Joined: Thu Jun 07, 2007 3:25 pm
Posts: 3698
Location: Berlin, Germany
Marc56us wrote:
I'm not trying to get the same result

Aha.

Marc56us wrote:
I'm just indicating that if you want to retrieve the text that doesn't match the expression, just replace what matches with nothing

That does not address the problem at hand and thus does not yield the desired result.

_________________
Please excuse my flawed English. My native language is PureBasic.
Search
RSBasic's backups


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 15 posts ] 

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

 


Powered by phpBB © 2008 phpBB Group
subSilver+ theme by Canver Software, sponsor Sanal Modifiye