I need to do a fast string search.
Re: I need to do a fast string search.
Just to throw a spanner in the works: don't forget that closing tags
sometimes (very often) have a space in them. So searching solely
for "</a>" and "</p>" will fail when they're "< /a>" and "< /p>".
In other words, you've always got to take that space into account.
sometimes (very often) have a space in them. So searching solely
for "</a>" and "</p>" will fail when they're "< /a>" and "< /p>".
In other words, you've always got to take that space into account.
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
"PureBasic won't be object oriented, period" - Fred.
Re: I need to do a fast string search.
Great, i will take a loook tonight.
One thing i found is that somtimes a tag contains the same tag inside, so i need to be sure that im finding the closing tag and not some tag that is part of the contnt, per example
I learn to be carefull to find, in this case, the second </div> as the closing tag and not the first one. That makes me add more code that make a bit slower my code, also i need to strip some code like scripts, strip th opening, the content, and the closing tag.
One thing i found is that somtimes a tag contains the same tag inside, so i need to be sure that im finding the closing tag and not some tag that is part of the contnt, per example
Code: Select all
<div id="1">This is the <div id=2>content</div>of the tag</div>
ARGENTINA WORLD CHAMPION
Re: I need to do a fast string search.
Don't forget there's tips in this forum that converts raw HTML
to plain text, so that might save some time? I don't know how
that affects your searches, though; but at least it bulk-removes
all the tags for you.
to plain text, so that might save some time? I don't know how
that affects your searches, though; but at least it bulk-removes
all the tags for you.
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
"PureBasic won't be object oriented, period" - Fred.
Re: I need to do a fast string search.
What result would you want for the following example, and what do you want to do with the result(s)?
The question is, how do you want to handle/extract/replace nested tags?
Code: Select all
<div id="1">This is <b><u>the</u></b> <div id=2>content</div>of <i>the</i> tag</div>
Re: I need to do a fast string search.
Desired result if we are searching DIV innerText: This is the content of the tag.Danilo wrote:What result would you want for the following example, and what do you want to do with the result(s)?The question is, how do you want to handle/extract/replace nested tags?Code: Select all
<div id="1">This is <b><u>the</u></b> <div id=2>content</div>of <i>the</i> tag</div>
If we search P innerText and the text was:
Code: Select all
<p><div id="1">This is <b><u>the</u></b> <div id=2>content</div>of <i>the</i> tag</div></p>
But the case could be different:
Code: Select all
<p><div id="1">This is <b><u>the</u></b> <div id=2>content</div>of <i>the</i> tag.</div><br />
<div id="1">This is <b><u>the</u></b> <div id=2>second content</div>of <i>the</i> tag</div></p>
This is the content of the tag.
This is the second content of the tag.
Lets say: we ignore the inner tags, even nested tags. Just extract innerText (and not innerHTML).
Last edited by ricardo on Mon Aug 11, 2014 3:13 am, edited 1 time in total.
ARGENTINA WORLD CHAMPION
Re: I need to do a fast string search.
So you just want to remove all <tag> and </tag> from the text, including <tag with more stuff>, right?
Re: I need to do a fast string search.
Danilo wrote:So you just want to remove all <tag> and </tag> from the text, including <tag with more stuff>, right?
I want to find the content inside any tag i want an d remove the tags, even the inner tags. Including tags with attributes or more stuff.
I want the innerText that is exactly that. The text without tags contained inside some tags.
Its important to ignore included tags, because just searching opening and closing tags could drive to mess and think that the closing tag of the nested tag is the closing tag of the container tag.
Like in this example, some code could detects the blue divs are the opening and closing tags for the same div and its not.
<div>This is<div>the content</div>of the div</div>
My code identify nested tags without confusion, but all this takes so long. Because i run it manytimes in a short time.
Thats why i start looking some optimized way.
As far as i find is very hard to manage nested tags (because there could be 0,1 or 75 nested tags, any number) using RegEx.
The problem is that real world HTML is not very clean and its not XML, and i am scraping real world pages that i dont develope.
ARGENTINA WORLD CHAMPION
Re: I need to do a fast string search.
Maybe a silly question but what do you consider fast and what do you consider slow ?
Do you have any test document and an indication of how long your current routine takes ?
What usually helps to optimize code is something to start with so you can check improvement.
Do you have any test document and an indication of how long your current routine takes ?
What usually helps to optimize code is something to start with so you can check improvement.
Windows (x64)
Raspberry Pi OS (Arm64)
Raspberry Pi OS (Arm64)
Re: I need to do a fast string search.
Small start for strings:
EDIT: changed it, see last example (get all <div> elements)
EDIT2: added some comments
Code: Select all
EnableExplicit
Global NewMap allowedTags()
Macro AddTag(string):allowedTags(string) = 1:EndMacro
AddTag( "h1") : AddTag("h2") : AddTag( "h3")
AddTag( "b" ) : AddTag("i" ) : AddTag( "u" ) : AddTag("p")
AddTag("div") : ;AddTag("br")
#RemoveOnlyAllowedTags = 0
Macro EatWhiteSpace(charPointer)
While charPointer\c And ( charPointer\c = ' ' Or charPointer\c = #TAB ) ; remove SPACE and TAB
charPointer + SizeOf(Character)
Wend
EndMacro
Macro EatEverythingIncludingChar(charPointer, charToSearch)
While charPointer\c And charPointer\c <> charToSearch ; remove any chars until
charPointer + SizeOf(Character) ; charToSearch was found
Wend
If charPointer\c = charToSearch : charPointer + SizeOf(Character) : EndIf ; remove charToSearch, too
EndMacro
Procedure.s GetTagFromPointer(*pInput.Character,*foundSlash.Integer) ; extracts an identifier/name/word
Protected result.s, slash ;
If *pInput ;
If *foundSlash : *foundSlash\i = 0 : EndIf ; reset: *foundSlash\i = 0
EatWhiteSpace(*pInput) ; remove whitespace
If *pInput\c = '/' ;
slash = #True ; check for optional '/' in front
*pInput + SizeOf(Character) ; of the identifier
EndIf
If (*pInput\c >= 'a' And *pInput\c <= 'z') Or (*pInput\c >= 'A' And *pInput\c <= 'Z') ; identifiers start with a-zA-Z,
While *pInput\c And ((*pInput\c >= 'a' And *pInput\c <= 'z') Or ; followed by a-zA-Z0-9
(*pInput\c >= 'A' And *pInput\c <= 'Z') Or ;
(*pInput\c >= '0' And *pInput\c <= '9')) ;
result + Chr( *pInput\c ) ; extract the identifier (tag name)
*pInput + SizeOf(Character)
Wend
result = LCase(result) ; we work internally with lowercase
; tag names
CompilerIf #RemoveOnlyAllowedTags <> 0 ; CompilerIf to check if extracted
If allowedTags(result)=0 ; tag name is within the list of
ProcedureReturn "" ; allowed tags
EndIf
CompilerEndIf
If slash And *foundSlash ; write slash state to output variable
*foundSlash\i = #True
EndIf
ProcedureReturn result ; return the result
EndIf
EndIf
EndProcedure
Procedure.s ExtractTag(inputText.s,tagToExtract.s,*foundAtPos.Integer,startPos=0)
Protected result.s, recording, tag.s, foundSlash, startPosWritten
Protected NewMap tagCount()
Protected *pInput.Character = @inputText
If *foundAtPos : *foundAtPos\i = -1 : EndIf
tagToExtract = LCase(tagToExtract)
If *pInput
*pInput + startPos*SizeOf(Character) ; begin search at 'inputText' + 'startPos' chars
Repeat
If *pInput\c = '<' ; found tag start char '<'
tag = GetTagFromPointer(*pInput+SizeOf(Character),@foundSlash) ; get the identifier/name of the tag
If tag
If foundSlash ; ending tag was found
If FindMapElement(tagCount(),tag)
If tagCount(tag) > 0
tagCount(tag) - 1
EndIf
If tag = tagToExtract And tagCount(tag) = 0
ProcedureReturn result
EndIf
EndIf
Else ; start tag was found
tagCount(tag) + 1
If tag = tagToExtract
If startPosWritten=0 And *foundAtPos
*foundAtPos\i = (*pInput - @inputText)/SizeOf(Character)
startPosWritten = 1
EndIf
recording = #True ; we found our start tag, start recording
EndIf
EndIf
EatEverythingIncludingChar(*pInput,'>') ; ignore all other stuff after the tag name
Else ; no allowed or valid tag name was found:
If recording ;
result + Chr(*pInput\c) ; add chars to the result
EndIf
*pInput + SizeOf(Character)
EndIf
Else
If recording ; we are after the start tag, so
result + Chr(*pInput\c) ; add all chars outside tags to the result
EndIf
*pInput + SizeOf(Character)
EndIf
Until *pInput\c = 0
EndIf
ProcedureReturn result.s
EndProcedure
Define text.s, pos
text = ExtractTag("<div>This is <div>the content</div> of the div</div>","div",@pos)
If text = ""
Debug "nothing found."
Else
Debug "Extracted text (pos:"+pos+") "+text
EndIf
Debug "---------------"
text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag</div></p>","div",@pos)
If pos = -1
Debug "nothing found."
Else
Debug "Extracted text (pos:"+pos+") "+text
EndIf
Debug "---------------"
text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+
"<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","p",@pos)
If pos = -1
Debug "nothing found."
Else
Debug "Extracted text (pos:"+pos+") "+text
EndIf
Debug "---------------"
text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+
"<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","br",@pos)
If pos = -1
Debug "nothing found."
Else
Debug "Extracted text (pos:"+pos+") "+text
EndIf
Debug "---------------"
;
; find all <div>
;
Define startpos = 0
Repeat
text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+
"<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","div",@pos,startpos)
If pos <> -1
Debug "Extracted text (pos:"+pos+") "+text
startpos = pos+1
EndIf
Until pos = -1
EDIT2: added some comments
Last edited by Danilo on Mon Aug 11, 2014 8:01 am, edited 2 times in total.
Re: I need to do a fast string search.
My attempt 
Hopefully fast enough ...
Parser and strip tags procedure
Example 1
Example 2
Edit : My first code contained a bug so I updated it.

Hopefully fast enough ...
Parser and strip tags procedure
Code: Select all
; Simple HTML parser v1.4
#CharSize = SizeOf(Character)
#CharSizeShift = #CharSize - 1
Structure TagValue
q1.q
q2.q
EndStructure
Structure Tag
Name.s ; tag name
NameValue.q ; lowercase ascii representation of first 8 characters
Start.l ; inner text start pos
Length.l ; length of inner text
EndStructure
Global Dim CharacterType.a($ffff)
Procedure.s StripHTMLTags(Source.s)
Protected *src.Character = @Source
Protected *dst.Character = @Source
Protected Ignore.i
If *src\c
Repeat
Select *src\c
Case 60; '<'
Ignore = #True
Case 62; '>'
Ignore = #False
Default
If Ignore = #False
*dst\c = *src\c
*dst + #CharSize
EndIf
EndSelect
*src + #CharSize
Until *src\c = 0
*dst\c = 0
EndIf
ProcedureReturn PeekS(@Source)
EndProcedure
Procedure.q TagNameValue(TagName.s)
Protected Value.q
PokeS(@Value, LCase(TagName), 8, #PB_Ascii)
ProcedureReturn Value
EndProcedure
Procedure InitHTMLParser()
Protected c.c
CharacterType( 9) = 1 ; Whitespace
CharacterType(10) = 1 ; Whitespace
CharacterType(13) = 1 ; Whitespace
CharacterType(32) = 1 ; Whitespace
CharacterType(33) = 2 ; '!'
CharacterType(47) = 3 ; '/'
CharacterType(60) = 4 ; '<'
CharacterType(62) = 5 ; '>'
For c = 48 To 57
CharacterType(c) = 6 ; 0-9
Next
For c = 65 To 90
CharacterType(c) = 7 ; A-Z and a-z
CharacterType(c | 32) = 7
Next
EndProcedure
InitHTMLParser()
Procedure ParseHTML(Source.s, List TagList.Tag())
Protected *Tag.Tag, *TagName.Ascii, TagValue.TagValue
Protected.i ProcessingTagName, ContinueTagName, IsClosingTag, InnerEndPos, Pos
Protected *c.Character = @Source
ClearList(TagList())
If *c\c
Repeat
If ProcessingTagName
Select CharacterType(*c\c)
Case 1 ; Whitespace
ContinueTagName = #False
Case 2 ; '!'
If ContinueTagName
ProcessingTagName = #False
If TagValue\q1 = 0 And Mid(Source, InnerEndPos, 4) = "<!--"
; Skip comment
Pos = FindString (Source, "-->", InnerEndPos + 4)
If Pos
*c = @Source + (Pos + 1) << #CharSizeShift
Else
Break
EndIf
EndIf
EndIf
Case 3 ; '/'
If TagValue\q1 = 0
IsClosingTag = #True
EndIf
Case 5 ; '>'
ProcessingTagName = #False
*TagName\a = 0
; *** Process tag code ***
If IsClosingTag
*Tag = LastElement(TagList())
While *Tag
If *Tag\Length = 0 And *Tag\NameValue = TagValue\q1
*Tag\Length = InnerEndPos - *Tag\Start
If *Tag\Length = 0
*Tag\Length = -1
EndIf
Break
EndIf
*Tag = PreviousElement(TagList())
Wend
LastElement(TagList())
Else
*Tag = AddElement(TagList())
*Tag\Name = PeekS(@TagValue, 16, #PB_Ascii)
*Tag\NameValue = TagValue\q1
*Tag\Start = (*c - @Source) >> #CharSizeShift + 2
If TagValue\q1 = $747069726373
; Handle script tag
Pos = FindString(Source, "</script", InnerEndPos + 8, #PB_String_NoCase)
If Pos
*c = @Source + (Pos - 2) << #CharSizeShift
Else
Break
EndIf
EndIf
EndIf
; *** End of process tag code ***
Case 6 ; 0-9
If TagValue\q1 = 0
ProcessingTagName = #False
ElseIf ContinueTagName And *TagName - @TagValue < 15
*TagName\a = *c\c
*TagName + 1
EndIf
Case 7 ; A-Z a-z
If ContinueTagName And *TagName - @TagValue < 15
*TagName\a = *c\c | $20
*TagName + 1
EndIf
Default
If ContinueTagName
ProcessingTagName = #False
EndIf
EndSelect
ElseIf *c\c = 60
InnerEndPos = (*c - @Source) >> #CharSizeShift + 1
ProcessingTagName = #True
ContinueTagName = #True
IsClosingTag = #False
*TagName = @TagValue
TagValue\q1 = 0
EndIf
*c + #CharSize
Until *c\c = 0
; remove empty tags
ResetList(TagList())
While NextElement(TagList())
If TagList()\Length < 1
DeleteElement(TagList())
EndIf
Wend
EndIf
EndProcedure
Code: Select all
Source.s = "<div>This is<Div align='left'> a</div> test</div>"
NewList TagList.Tag()
ParseHTML(Source, TagList())
ResetList(TagList())
While NextElement(TagList())
*Tag.Tag = TagList()
Debug *Tag\Name
Debug "Start pos : " + *Tag\Start + " length : " + *Tag\Length
Debug Mid(Source, *Tag\Start, *Tag\Length)
Debug StripHTMLTags(Mid(Source, *Tag\Start, *Tag\Length))
Debug "---"
Wend
Code: Select all
H1TagValue.q = TagNameValue("h1")
Source.s = "<h1>This<br> is<div align='left'> a test</div> to parse html</h1>"
NewList TagList.Tag()
ParseHTML(Source, TagList())
ResetList(TagList())
While NextElement(TagList())
*Tag.Tag = TagList()
If *Tag\NameValue = H1TagValue; Show only H1 tags
Debug *Tag\Name
Debug "Start pos : " + *Tag\Start + " length : " + *Tag\Length
Debug Mid(Source, *Tag\Start, *Tag\Length)
Debug StripHTMLTags(Mid(Source, *Tag\Start, *Tag\Length))
Debug "---"
EndIf
Wend
Windows (x64)
Raspberry Pi OS (Arm64)
Raspberry Pi OS (Arm64)
Re: I need to do a fast string search.
Danilo wrote:Small start for strings:EDIT: changed it, see last example (get all <div> elements)Code: Select all
EnableExplicit Global NewMap allowedTags() Macro AddTag(string):allowedTags(string) = 1:EndMacro AddTag( "h1") : AddTag("h2") : AddTag( "h3") AddTag( "b" ) : AddTag("i" ) : AddTag( "u" ) : AddTag("p") AddTag("div") : ;AddTag("br") #RemoveOnlyAllowedTags = 0 Macro EatWhiteSpace(charPointer) While charPointer\c And ( charPointer\c = ' ' Or charPointer\c = #TAB ) ; remove SPACE and TAB charPointer + SizeOf(Character) Wend EndMacro Macro EatEverythingIncludingChar(charPointer, charToSearch) While charPointer\c And charPointer\c <> charToSearch ; remove any chars until charPointer + SizeOf(Character) ; charToSearch was found Wend If charPointer\c = charToSearch : charPointer + SizeOf(Character) : EndIf ; remove charToSearch, too EndMacro Procedure.s GetTagFromPointer(*pInput.Character,*foundSlash.Integer) ; extracts an identifier/name/word Protected result.s, slash ; If *pInput ; If *foundSlash : *foundSlash\i = 0 : EndIf ; reset: *foundSlash\i = 0 EatWhiteSpace(*pInput) ; remove whitespace If *pInput\c = '/' ; slash = #True ; check for optional '/' in front *pInput + SizeOf(Character) ; of the identifier EndIf If (*pInput\c >= 'a' And *pInput\c <= 'z') Or (*pInput\c >= 'A' And *pInput\c <= 'Z') ; identifiers start with a-zA-Z, While *pInput\c And ((*pInput\c >= 'a' And *pInput\c <= 'z') Or ; followed by a-zA-Z0-9 (*pInput\c >= 'A' And *pInput\c <= 'Z') Or ; (*pInput\c >= '0' And *pInput\c <= '9')) ; result + Chr( *pInput\c ) ; extract the identifier (tag name) *pInput + SizeOf(Character) Wend result = LCase(result) ; we work internally with lowercase ; tag names CompilerIf #RemoveOnlyAllowedTags <> 0 ; CompilerIf to check if extracted If allowedTags(result)=0 ; tag name is within the list of ProcedureReturn "" ; allowed tags EndIf CompilerEndIf If slash And *foundSlash ; write slash state to output variable *foundSlash\i = #True EndIf ProcedureReturn result ; return the result EndIf EndIf EndProcedure Procedure.s ExtractTag(inputText.s,tagToExtract.s,*foundAtPos.Integer,startPos=0) Protected result.s, recording, tag.s, foundSlash, startPosWritten Protected NewMap tagCount() Protected *pInput.Character = @inputText If *foundAtPos : *foundAtPos\i = -1 : EndIf tagToExtract = LCase(tagToExtract) If *pInput *pInput + startPos*SizeOf(Character) ; begin search at 'inputText' + 'startPos' chars Repeat If *pInput\c = '<' ; found tag start char '<' tag = GetTagFromPointer(*pInput+SizeOf(Character),@foundSlash) ; get the identifier/name of the tag If tag If foundSlash ; ending tag was found If FindMapElement(tagCount(),tag) If tagCount(tag) > 0 tagCount(tag) - 1 EndIf If tag = tagToExtract And tagCount(tag) = 0 ProcedureReturn result EndIf EndIf Else ; start tag was found tagCount(tag) + 1 If tag = tagToExtract If startPosWritten=0 And *foundAtPos *foundAtPos\i = (*pInput - @inputText)/SizeOf(Character) startPosWritten = 1 EndIf recording = #True ; we found our start tag, start recording EndIf EndIf EatEverythingIncludingChar(*pInput,'>') ; ignore all other stuff after the tag name Else ; no allowed or valid tag name was found: If recording ; result + Chr(*pInput\c) ; add chars to the result EndIf *pInput + SizeOf(Character) EndIf Else If recording ; we are after the start tag, so result + Chr(*pInput\c) ; add all chars outside tags to the result EndIf *pInput + SizeOf(Character) EndIf Until *pInput\c = 0 EndIf ProcedureReturn result.s EndProcedure Define text.s, pos text = ExtractTag("<div>This is <div>the content</div> of the div</div>","div",@pos) If text = "" Debug "nothing found." Else Debug "Extracted text (pos:"+pos+") "+text EndIf Debug "---------------" text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag</div></p>","div",@pos) If pos = -1 Debug "nothing found." Else Debug "Extracted text (pos:"+pos+") "+text EndIf Debug "---------------" text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+ "<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","p",@pos) If pos = -1 Debug "nothing found." Else Debug "Extracted text (pos:"+pos+") "+text EndIf Debug "---------------" text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+ "<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","br",@pos) If pos = -1 Debug "nothing found." Else Debug "Extracted text (pos:"+pos+") "+text EndIf Debug "---------------" ; ; find all <div> ; Define startpos = 0 Repeat text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+ "<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","div",@pos,startpos) If pos <> -1 Debug "Extracted text (pos:"+pos+") "+text startpos = pos+1 EndIf Until pos = -1
EDIT2: added some comments
Excellent!! Thanks Danilo, i will test it right now.
This can be runned in threads (each thread for a different content), right?
Best Regards
ARGENTINA WORLD CHAMPION
Re: I need to do a fast string search.
wilbert wrote:My attempt
Hopefully fast enough ...
Thanks!! I will run it now.
This can run in threads (one thread for each different content), right?
ARGENTINA WORLD CHAMPION
Re: I need to do a fast string search.
It should be safe to let each thread process it's own html document provided you give every thread it's own LinkedList to fill with tag information.ricardo wrote:This can run in threads (one thread for each different content), right?
The parsing itself of a 2KB html file to get all tag information probably takes a millisecond or so. After that you can process the tags that were found as you wish.
Windows (x64)
Raspberry Pi OS (Arm64)
Raspberry Pi OS (Arm64)
Re: I need to do a fast string search.
Works great, thank you.wilbert wrote:My attempt
Hopefully fast enough ...
Now I also would like to process the tag attributes, so I need the whole contents of the tag itself
Code: Select all
Source.s = "<div id='1'>This is<Div align='left' id='2'> a</div> test</div>"
*Tag\Contents -> id='1'
*Tag\Contents -> align='left' id='2'
TIA
Re: I need to do a fast string search.
You can modify the code a bit to extract the tag with all attributes as a single string but you will have to process that further to get the information you need.
Example of extracting tag with attributes
Example of extracting tag with attributes
Code: Select all
#CharSize = SizeOf(Character)
#CharSizeShift = #CharSize - 1
Structure TagValue
q1.q
q2.q
EndStructure
Structure Tag
Name.s ; tag name
NameValue.q ; lowercase ascii representation of first 8 characters
Start.l ; inner text start pos
Length.l ; length of inner text
EndStructure
Global Dim CharacterType.a($ffff)
Procedure.s StripHTMLTags(Source.s)
Protected *src.Character = @Source
Protected *dst.Character = @Source
Protected Ignore.i
If *src\c
Repeat
Select *src\c
Case 60; '<'
Ignore = #True
Case 62; '>'
Ignore = #False
Default
If Ignore = #False
*dst\c = *src\c
*dst + #CharSize
EndIf
EndSelect
*src + #CharSize
Until *src\c = 0
*dst\c = 0
EndIf
ProcedureReturn PeekS(@Source)
EndProcedure
Procedure.q TagNameValue(TagName.s)
Protected Value.q
PokeS(@Value, LCase(TagName), 8, #PB_Ascii)
ProcedureReturn Value
EndProcedure
Procedure InitHTMLParser()
Protected c.c
CharacterType( 9) = 1 ; Whitespace
CharacterType(10) = 1 ; Whitespace
CharacterType(13) = 1 ; Whitespace
CharacterType(32) = 1 ; Whitespace
CharacterType(33) = 2 ; '!'
CharacterType(47) = 3 ; '/'
CharacterType(60) = 4 ; '<'
CharacterType(62) = 5 ; '>'
For c = 48 To 57
CharacterType(c) = 6 ; 0-9
Next
For c = 65 To 90
CharacterType(c) = 7 ; A-Z and a-z
CharacterType(c | 32) = 7
Next
EndProcedure
InitHTMLParser()
Procedure ParseHTML(Source.s, List TagList.Tag())
Protected *Tag.Tag, *TagName.Ascii, TagValue.TagValue
Protected.i ProcessingTagName, ContinueTagName, IsClosingTag, InnerEndPos, Pos
Protected *c.Character = @Source
ClearList(TagList())
If *c\c
Repeat
If ProcessingTagName
Select CharacterType(*c\c)
Case 1 ; Whitespace
ContinueTagName = #False
Case 2 ; '!'
If ContinueTagName
ProcessingTagName = #False
If TagValue\q1 = 0 And Mid(Source, InnerEndPos, 4) = "<!--"
; Skip comment
Pos = FindString (Source, "-->", InnerEndPos + 4)
If Pos
*c = @Source + (Pos + 1) << #CharSizeShift
Else
Break
EndIf
EndIf
EndIf
Case 3 ; '/'
If TagValue\q1 = 0
IsClosingTag = #True
EndIf
Case 5 ; '>'
ProcessingTagName = #False
*TagName\a = 0
; *** Process tag code ***
If IsClosingTag
*Tag = LastElement(TagList())
While *Tag
If *Tag\Length = 0 And *Tag\NameValue = TagValue\q1
*Tag\Length = InnerEndPos - *Tag\Start
If *Tag\Length = 0
*Tag\Length = -1
EndIf
Break
EndIf
*Tag = PreviousElement(TagList())
Wend
LastElement(TagList())
Else
*Tag = AddElement(TagList())
*Tag\NameValue = TagValue\q1
*Tag\Start = (*c - @Source) >> #CharSizeShift + 2
*Tag\Name = Mid(Source, InnerEndPos, *Tag\Start - InnerEndPos)
If TagValue\q1 = $747069726373
; Handle script tag
Pos = FindString(Source, "</script", InnerEndPos + 8, #PB_String_NoCase)
If Pos
*c = @Source + (Pos - 2) << #CharSizeShift
Else
Break
EndIf
EndIf
EndIf
; *** End of process tag code ***
Case 6 ; 0-9
If TagValue\q1 = 0
ProcessingTagName = #False
ElseIf ContinueTagName And *TagName - @TagValue < 15
*TagName\a = *c\c
*TagName + 1
EndIf
Case 7 ; A-Z a-z
If ContinueTagName And *TagName - @TagValue < 15
*TagName\a = *c\c | $20
*TagName + 1
EndIf
Default
If ContinueTagName
ProcessingTagName = #False
EndIf
EndSelect
ElseIf *c\c = 60
InnerEndPos = (*c - @Source) >> #CharSizeShift + 1
ProcessingTagName = #True
ContinueTagName = #True
IsClosingTag = #False
*TagName = @TagValue
TagValue\q1 = 0
EndIf
*c + #CharSize
Until *c\c = 0
; remove empty tags
ResetList(TagList())
While NextElement(TagList())
If TagList()\Length < 1
DeleteElement(TagList())
EndIf
Wend
EndIf
EndProcedure
Windows (x64)
Raspberry Pi OS (Arm64)
Raspberry Pi OS (Arm64)