I need to do a fast string search.

Just starting out? Need help? Post your questions and find answers here.
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Re: I need to do a fast string search.

Post by PB »

Just to throw a spanner in the works: don't forget that closing tags
sometimes (very often) have a space in them. So searching solely
for "</a>" and "</p>" will fail when they're "< /a>" and "< /p>".
In other words, you've always got to take that space into account.
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
ricardo
Addict
Addict
Posts: 2438
Joined: Fri Apr 25, 2003 7:06 pm
Location: Argentina

Re: I need to do a fast string search.

Post by ricardo »

Great, i will take a loook tonight.
One thing i found is that somtimes a tag contains the same tag inside, so i need to be sure that im finding the closing tag and not some tag that is part of the contnt, per example

Code: Select all

<div id="1">This is the <div id=2>content</div>of the tag</div>
I learn to be carefull to find, in this case, the second </div> as the closing tag and not the first one. That makes me add more code that make a bit slower my code, also i need to strip some code like scripts, strip th opening, the content, and the closing tag.
ARGENTINA WORLD CHAMPION
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Re: I need to do a fast string search.

Post by PB »

Don't forget there's tips in this forum that converts raw HTML
to plain text, so that might save some time? I don't know how
that affects your searches, though; but at least it bulk-removes
all the tags for you.
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
User avatar
Danilo
Addict
Addict
Posts: 3036
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: I need to do a fast string search.

Post by Danilo »

What result would you want for the following example, and what do you want to do with the result(s)?

Code: Select all

<div id="1">This is <b><u>the</u></b> <div id=2>content</div>of <i>the</i> tag</div>
The question is, how do you want to handle/extract/replace nested tags?
ricardo
Addict
Addict
Posts: 2438
Joined: Fri Apr 25, 2003 7:06 pm
Location: Argentina

Re: I need to do a fast string search.

Post by ricardo »

Danilo wrote:What result would you want for the following example, and what do you want to do with the result(s)?

Code: Select all

<div id="1">This is <b><u>the</u></b> <div id=2>content</div>of <i>the</i> tag</div>
The question is, how do you want to handle/extract/replace nested tags?
Desired result if we are searching DIV innerText: This is the content of the tag.

If we search P innerText and the text was:

Code: Select all

<p><div id="1">This is <b><u>the</u></b> <div id=2>content</div>of <i>the</i> tag</div></p>
The desire content is the same: This is the content of the tag.
But the case could be different:

Code: Select all

<p><div id="1">This is <b><u>the</u></b> <div id=2>content</div>of <i>the</i> tag.</div><br />
<div id="1">This is <b><u>the</u></b> <div id=2>second content</div>of <i>the</i> tag</div></p>
In this case the desire content for the P is:
This is the content of the tag.
This is the second content of the tag.


Lets say: we ignore the inner tags, even nested tags. Just extract innerText (and not innerHTML).
Last edited by ricardo on Mon Aug 11, 2014 3:13 am, edited 1 time in total.
ARGENTINA WORLD CHAMPION
User avatar
Danilo
Addict
Addict
Posts: 3036
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: I need to do a fast string search.

Post by Danilo »

So you just want to remove all <tag> and </tag> from the text, including <tag with more stuff>, right?
ricardo
Addict
Addict
Posts: 2438
Joined: Fri Apr 25, 2003 7:06 pm
Location: Argentina

Re: I need to do a fast string search.

Post by ricardo »

Danilo wrote:So you just want to remove all <tag> and </tag> from the text, including <tag with more stuff>, right?

I want to find the content inside any tag i want an d remove the tags, even the inner tags. Including tags with attributes or more stuff.
I want the innerText that is exactly that. The text without tags contained inside some tags.
Its important to ignore included tags, because just searching opening and closing tags could drive to mess and think that the closing tag of the nested tag is the closing tag of the container tag.

Like in this example, some code could detects the blue divs are the opening and closing tags for the same div and its not.

<div>This is<div>the content</div>of the div</div>

My code identify nested tags without confusion, but all this takes so long. Because i run it manytimes in a short time.
Thats why i start looking some optimized way.
As far as i find is very hard to manage nested tags (because there could be 0,1 or 75 nested tags, any number) using RegEx.

The problem is that real world HTML is not very clean and its not XML, and i am scraping real world pages that i dont develope.
ARGENTINA WORLD CHAMPION
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: I need to do a fast string search.

Post by wilbert »

Maybe a silly question but what do you consider fast and what do you consider slow ?
Do you have any test document and an indication of how long your current routine takes ?
What usually helps to optimize code is something to start with so you can check improvement.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Danilo
Addict
Addict
Posts: 3036
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: I need to do a fast string search.

Post by Danilo »

Small start for strings:

Code: Select all

EnableExplicit

Global NewMap allowedTags()
Macro AddTag(string):allowedTags(string) = 1:EndMacro

AddTag( "h1") : AddTag("h2") : AddTag( "h3")
AddTag( "b" ) : AddTag("i" ) : AddTag( "u" ) : AddTag("p")
AddTag("div") : ;AddTag("br")

#RemoveOnlyAllowedTags = 0

Macro EatWhiteSpace(charPointer)
        While charPointer\c And ( charPointer\c = ' ' Or charPointer\c = #TAB )   ; remove SPACE and TAB
                charPointer + SizeOf(Character)
        Wend
EndMacro

Macro EatEverythingIncludingChar(charPointer, charToSearch)
        While charPointer\c And charPointer\c <> charToSearch                     ; remove any chars until
                charPointer + SizeOf(Character)                                   ; charToSearch was found
        Wend
        If charPointer\c = charToSearch : charPointer + SizeOf(Character) : EndIf ; remove charToSearch, too
EndMacro

Procedure.s GetTagFromPointer(*pInput.Character,*foundSlash.Integer)              ; extracts an identifier/name/word
    Protected result.s, slash                                                     ;
    If *pInput                                                                    ;
        If *foundSlash : *foundSlash\i = 0 : EndIf                                ; reset: *foundSlash\i = 0
        EatWhiteSpace(*pInput)                                                    ; remove whitespace
        If *pInput\c = '/'                                                        ;
            slash = #True                                                         ; check for optional '/' in front
            *pInput + SizeOf(Character)                                           ; of the identifier
        EndIf
        If (*pInput\c >= 'a' And *pInput\c <= 'z') Or (*pInput\c >= 'A' And *pInput\c <= 'Z') ; identifiers start with a-zA-Z,
            While *pInput\c And ((*pInput\c >= 'a' And *pInput\c <= 'z') Or                   ; followed by a-zA-Z0-9
                  (*pInput\c >= 'A' And *pInput\c <= 'Z') Or                                  ;
                  (*pInput\c >= '0' And *pInput\c <= '9'))                                    ;
                result + Chr( *pInput\c )                                                     ; extract the identifier (tag name)
                *pInput + SizeOf(Character)
            Wend
            result = LCase(result)                                                            ; we work internally with lowercase
                                                                                              ; tag names
            CompilerIf #RemoveOnlyAllowedTags <> 0                                ; CompilerIf to check if extracted
                If allowedTags(result)=0                                          ; tag name is within the list of
                    ProcedureReturn ""                                            ; allowed tags
                EndIf
            CompilerEndIf
            If slash And *foundSlash                                              ; write slash state to output variable
                *foundSlash\i = #True
            EndIf
            ProcedureReturn result                                                ; return the result
        EndIf
    EndIf
EndProcedure

Procedure.s ExtractTag(inputText.s,tagToExtract.s,*foundAtPos.Integer,startPos=0)
    Protected result.s, recording, tag.s, foundSlash, startPosWritten
    Protected NewMap tagCount()
    Protected *pInput.Character = @inputText
    If *foundAtPos : *foundAtPos\i = -1 : EndIf
    tagToExtract = LCase(tagToExtract)
    If *pInput
        *pInput + startPos*SizeOf(Character)                                      ; begin search at 'inputText' + 'startPos' chars
        Repeat
            If *pInput\c = '<'                                                    ; found tag start char '<'
                tag = GetTagFromPointer(*pInput+SizeOf(Character),@foundSlash)    ; get the identifier/name of the tag
                If tag
                    If foundSlash                                                 ; ending tag was found
                        If FindMapElement(tagCount(),tag)
                            If tagCount(tag) > 0
                                tagCount(tag) - 1
                            EndIf
                            If tag = tagToExtract And tagCount(tag) = 0
                                ProcedureReturn result
                            EndIf
                        EndIf
                    Else                                                          ; start tag was found
                        tagCount(tag) + 1
                        If tag = tagToExtract
                            If startPosWritten=0 And *foundAtPos
                                *foundAtPos\i = (*pInput - @inputText)/SizeOf(Character)
                                startPosWritten = 1
                            EndIf
                            recording = #True                                     ; we found our start tag, start recording
                        EndIf
                    EndIf
                    EatEverythingIncludingChar(*pInput,'>')                       ; ignore all other stuff after the tag name
                Else                                                              ; no allowed or valid tag name was found:
                    If recording                                                  ;
                        result + Chr(*pInput\c)                                   ; add chars to the result
                    EndIf
                    *pInput + SizeOf(Character)
                EndIf
            Else
                If recording                                                      ; we are after the start tag, so
                    result + Chr(*pInput\c)                                       ; add all chars outside tags to the result
                EndIf
                *pInput + SizeOf(Character)
            EndIf
        Until *pInput\c = 0
    EndIf
    ProcedureReturn result.s
EndProcedure

Define text.s, pos

text = ExtractTag("<div>This is <div>the content</div> of the div</div>","div",@pos)
If text = ""
    Debug "nothing found."
Else
    Debug "Extracted text (pos:"+pos+") "+text
EndIf

Debug "---------------"

text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag</div></p>","div",@pos)
If pos = -1
    Debug "nothing found."
Else
    Debug "Extracted text (pos:"+pos+") "+text
EndIf

Debug "---------------"

text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+
                  "<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","p",@pos)
If pos = -1
    Debug "nothing found."
Else
    Debug "Extracted text (pos:"+pos+") "+text
EndIf

Debug "---------------"

text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+
                  "<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","br",@pos)
If pos = -1
    Debug "nothing found."
Else
    Debug "Extracted text (pos:"+pos+") "+text
EndIf

Debug "---------------"

;
; find all <div>
;
Define startpos = 0

Repeat
    text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+
                      "<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","div",@pos,startpos)
    If pos <> -1
        Debug "Extracted text (pos:"+pos+") "+text
        startpos = pos+1
    EndIf
Until pos = -1
EDIT: changed it, see last example (get all <div> elements)
EDIT2: added some comments
Last edited by Danilo on Mon Aug 11, 2014 8:01 am, edited 2 times in total.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: I need to do a fast string search.

Post by wilbert »

My attempt :D
Hopefully fast enough ...

Parser and strip tags procedure

Code: Select all

; Simple HTML parser v1.4

#CharSize = SizeOf(Character)
#CharSizeShift = #CharSize - 1

Structure TagValue
  q1.q
  q2.q
EndStructure

Structure Tag
  Name.s      ; tag name
  NameValue.q ; lowercase ascii representation of first 8 characters
  Start.l     ; inner text start pos 
  Length.l    ; length of inner text
EndStructure

Global Dim CharacterType.a($ffff)

Procedure.s StripHTMLTags(Source.s)
  Protected *src.Character = @Source
  Protected *dst.Character = @Source
  Protected Ignore.i
  If *src\c
    Repeat
      Select *src\c
        Case 60; '<'
          Ignore = #True
        Case 62; '>'
          Ignore = #False
        Default
          If Ignore = #False
            *dst\c = *src\c
            *dst + #CharSize
          EndIf
      EndSelect
      *src + #CharSize
    Until *src\c = 0
    *dst\c = 0
  EndIf
  ProcedureReturn PeekS(@Source)  
EndProcedure

Procedure.q TagNameValue(TagName.s)
  Protected Value.q
  PokeS(@Value, LCase(TagName), 8, #PB_Ascii)
  ProcedureReturn Value
EndProcedure

Procedure InitHTMLParser()
  Protected c.c
  CharacterType( 9) = 1       ; Whitespace
  CharacterType(10) = 1       ; Whitespace
  CharacterType(13) = 1       ; Whitespace
  CharacterType(32) = 1       ; Whitespace
  CharacterType(33) = 2       ; '!'
  CharacterType(47) = 3       ; '/'
  CharacterType(60) = 4       ; '<'
  CharacterType(62) = 5       ; '>'
  For c = 48 To 57
    CharacterType(c) = 6      ; 0-9
  Next
  For c = 65 To 90
    CharacterType(c) = 7      ; A-Z and a-z
    CharacterType(c | 32) = 7
  Next
EndProcedure

InitHTMLParser()

Procedure ParseHTML(Source.s, List TagList.Tag())
  Protected *Tag.Tag, *TagName.Ascii, TagValue.TagValue
  Protected.i ProcessingTagName, ContinueTagName, IsClosingTag, InnerEndPos, Pos
  Protected *c.Character = @Source
  ClearList(TagList())
  If *c\c
    Repeat
      If ProcessingTagName
        Select CharacterType(*c\c)
          Case 1              ; Whitespace
            ContinueTagName = #False
          Case 2              ; '!'
            If ContinueTagName
              ProcessingTagName = #False
              If TagValue\q1 = 0 And Mid(Source, InnerEndPos, 4) = "<!--"
                ; Skip comment
                Pos = FindString (Source, "-->", InnerEndPos + 4)
                If Pos
                  *c = @Source + (Pos + 1) << #CharSizeShift
                Else
                  Break  
                EndIf
              EndIf
            EndIf
          Case 3              ; '/'
            If TagValue\q1 = 0
              IsClosingTag = #True
            EndIf
          Case 5              ; '>'
            ProcessingTagName = #False
            *TagName\a = 0
            ; *** Process tag code ***
            If IsClosingTag
              *Tag = LastElement(TagList())
              While *Tag
                If *Tag\Length = 0 And *Tag\NameValue = TagValue\q1
                  *Tag\Length = InnerEndPos - *Tag\Start
                  If *Tag\Length = 0
                    *Tag\Length = -1
                  EndIf
                  Break
                EndIf
                *Tag = PreviousElement(TagList())
              Wend
              LastElement(TagList())
            Else
              *Tag = AddElement(TagList())
              *Tag\Name = PeekS(@TagValue, 16, #PB_Ascii)
              *Tag\NameValue = TagValue\q1
              *Tag\Start = (*c - @Source) >> #CharSizeShift + 2
              If TagValue\q1 = $747069726373
                ; Handle script tag
                Pos = FindString(Source, "</script", InnerEndPos + 8, #PB_String_NoCase)
                If Pos
                  *c = @Source + (Pos - 2) << #CharSizeShift
                Else
                  Break  
                EndIf
              EndIf
            EndIf
            ; *** End of process tag code ***
          Case 6              ; 0-9
            If TagValue\q1 = 0
              ProcessingTagName = #False
            ElseIf ContinueTagName And *TagName - @TagValue < 15
              *TagName\a = *c\c
              *TagName + 1
            EndIf
          Case 7              ; A-Z a-z
            If ContinueTagName And *TagName - @TagValue < 15
              *TagName\a = *c\c | $20
              *TagName + 1
            EndIf
          Default
            If ContinueTagName
              ProcessingTagName = #False
            EndIf
        EndSelect
      ElseIf *c\c = 60
        InnerEndPos = (*c - @Source) >> #CharSizeShift + 1
        ProcessingTagName = #True
        ContinueTagName = #True
        IsClosingTag = #False
        *TagName = @TagValue
        TagValue\q1 = 0
      EndIf
      *c + #CharSize
    Until *c\c = 0
    ; remove empty tags
    ResetList(TagList())
    While NextElement(TagList())
      If TagList()\Length < 1
        DeleteElement(TagList())
      EndIf
    Wend
  EndIf
EndProcedure
Example 1

Code: Select all

Source.s = "<div>This is<Div align='left'> a</div> test</div>"
NewList TagList.Tag()

ParseHTML(Source, TagList())

ResetList(TagList())
While NextElement(TagList())
  *Tag.Tag = TagList()
  Debug *Tag\Name
  Debug "Start pos : " + *Tag\Start + " length : " + *Tag\Length
  Debug Mid(Source, *Tag\Start, *Tag\Length)
  Debug StripHTMLTags(Mid(Source, *Tag\Start, *Tag\Length))
  Debug "---"
Wend
Example 2

Code: Select all

H1TagValue.q = TagNameValue("h1") 

Source.s = "<h1>This<br> is<div align='left'> a test</div> to parse html</h1>"
NewList TagList.Tag()

ParseHTML(Source, TagList())

ResetList(TagList())
While NextElement(TagList())
  *Tag.Tag = TagList()
  If *Tag\NameValue = H1TagValue; Show only H1 tags
    Debug *Tag\Name
    Debug "Start pos : " + *Tag\Start + " length : " + *Tag\Length
    Debug Mid(Source, *Tag\Start, *Tag\Length)
    Debug StripHTMLTags(Mid(Source, *Tag\Start, *Tag\Length))
    Debug "---"
  EndIf
Wend
Edit : My first code contained a bug so I updated it.
Windows (x64)
Raspberry Pi OS (Arm64)
ricardo
Addict
Addict
Posts: 2438
Joined: Fri Apr 25, 2003 7:06 pm
Location: Argentina

Re: I need to do a fast string search.

Post by ricardo »

Danilo wrote:Small start for strings:

Code: Select all

EnableExplicit

Global NewMap allowedTags()
Macro AddTag(string):allowedTags(string) = 1:EndMacro

AddTag( "h1") : AddTag("h2") : AddTag( "h3")
AddTag( "b" ) : AddTag("i" ) : AddTag( "u" ) : AddTag("p")
AddTag("div") : ;AddTag("br")

#RemoveOnlyAllowedTags = 0

Macro EatWhiteSpace(charPointer)
        While charPointer\c And ( charPointer\c = ' ' Or charPointer\c = #TAB )   ; remove SPACE and TAB
                charPointer + SizeOf(Character)
        Wend
EndMacro

Macro EatEverythingIncludingChar(charPointer, charToSearch)
        While charPointer\c And charPointer\c <> charToSearch                     ; remove any chars until
                charPointer + SizeOf(Character)                                   ; charToSearch was found
        Wend
        If charPointer\c = charToSearch : charPointer + SizeOf(Character) : EndIf ; remove charToSearch, too
EndMacro

Procedure.s GetTagFromPointer(*pInput.Character,*foundSlash.Integer)              ; extracts an identifier/name/word
    Protected result.s, slash                                                     ;
    If *pInput                                                                    ;
        If *foundSlash : *foundSlash\i = 0 : EndIf                                ; reset: *foundSlash\i = 0
        EatWhiteSpace(*pInput)                                                    ; remove whitespace
        If *pInput\c = '/'                                                        ;
            slash = #True                                                         ; check for optional '/' in front
            *pInput + SizeOf(Character)                                           ; of the identifier
        EndIf
        If (*pInput\c >= 'a' And *pInput\c <= 'z') Or (*pInput\c >= 'A' And *pInput\c <= 'Z') ; identifiers start with a-zA-Z,
            While *pInput\c And ((*pInput\c >= 'a' And *pInput\c <= 'z') Or                   ; followed by a-zA-Z0-9
                  (*pInput\c >= 'A' And *pInput\c <= 'Z') Or                                  ;
                  (*pInput\c >= '0' And *pInput\c <= '9'))                                    ;
                result + Chr( *pInput\c )                                                     ; extract the identifier (tag name)
                *pInput + SizeOf(Character)
            Wend
            result = LCase(result)                                                            ; we work internally with lowercase
                                                                                              ; tag names
            CompilerIf #RemoveOnlyAllowedTags <> 0                                ; CompilerIf to check if extracted
                If allowedTags(result)=0                                          ; tag name is within the list of
                    ProcedureReturn ""                                            ; allowed tags
                EndIf
            CompilerEndIf
            If slash And *foundSlash                                              ; write slash state to output variable
                *foundSlash\i = #True
            EndIf
            ProcedureReturn result                                                ; return the result
        EndIf
    EndIf
EndProcedure

Procedure.s ExtractTag(inputText.s,tagToExtract.s,*foundAtPos.Integer,startPos=0)
    Protected result.s, recording, tag.s, foundSlash, startPosWritten
    Protected NewMap tagCount()
    Protected *pInput.Character = @inputText
    If *foundAtPos : *foundAtPos\i = -1 : EndIf
    tagToExtract = LCase(tagToExtract)
    If *pInput
        *pInput + startPos*SizeOf(Character)                                      ; begin search at 'inputText' + 'startPos' chars
        Repeat
            If *pInput\c = '<'                                                    ; found tag start char '<'
                tag = GetTagFromPointer(*pInput+SizeOf(Character),@foundSlash)    ; get the identifier/name of the tag
                If tag
                    If foundSlash                                                 ; ending tag was found
                        If FindMapElement(tagCount(),tag)
                            If tagCount(tag) > 0
                                tagCount(tag) - 1
                            EndIf
                            If tag = tagToExtract And tagCount(tag) = 0
                                ProcedureReturn result
                            EndIf
                        EndIf
                    Else                                                          ; start tag was found
                        tagCount(tag) + 1
                        If tag = tagToExtract
                            If startPosWritten=0 And *foundAtPos
                                *foundAtPos\i = (*pInput - @inputText)/SizeOf(Character)
                                startPosWritten = 1
                            EndIf
                            recording = #True                                     ; we found our start tag, start recording
                        EndIf
                    EndIf
                    EatEverythingIncludingChar(*pInput,'>')                       ; ignore all other stuff after the tag name
                Else                                                              ; no allowed or valid tag name was found:
                    If recording                                                  ;
                        result + Chr(*pInput\c)                                   ; add chars to the result
                    EndIf
                    *pInput + SizeOf(Character)
                EndIf
            Else
                If recording                                                      ; we are after the start tag, so
                    result + Chr(*pInput\c)                                       ; add all chars outside tags to the result
                EndIf
                *pInput + SizeOf(Character)
            EndIf
        Until *pInput\c = 0
    EndIf
    ProcedureReturn result.s
EndProcedure

Define text.s, pos

text = ExtractTag("<div>This is <div>the content</div> of the div</div>","div",@pos)
If text = ""
    Debug "nothing found."
Else
    Debug "Extracted text (pos:"+pos+") "+text
EndIf

Debug "---------------"

text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag</div></p>","div",@pos)
If pos = -1
    Debug "nothing found."
Else
    Debug "Extracted text (pos:"+pos+") "+text
EndIf

Debug "---------------"

text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+
                  "<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","p",@pos)
If pos = -1
    Debug "nothing found."
Else
    Debug "Extracted text (pos:"+pos+") "+text
EndIf

Debug "---------------"

text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+
                  "<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","br",@pos)
If pos = -1
    Debug "nothing found."
Else
    Debug "Extracted text (pos:"+pos+") "+text
EndIf

Debug "---------------"

;
; find all <div>
;
Define startpos = 0

Repeat
    text = ExtractTag("<p><div id='1'>This is <b><u>the</u></b> <div id=2>content</div> of <i>the</i> tag.</div><br />"+
                      "<div id='1'>This is <b><u>the</u></b> <div id=2>second content</div> of <i>the</i> tag</div></p>","div",@pos,startpos)
    If pos <> -1
        Debug "Extracted text (pos:"+pos+") "+text
        startpos = pos+1
    EndIf
Until pos = -1
EDIT: changed it, see last example (get all <div> elements)
EDIT2: added some comments

Excellent!! Thanks Danilo, i will test it right now.
This can be runned in threads (each thread for a different content), right?

Best Regards
ARGENTINA WORLD CHAMPION
ricardo
Addict
Addict
Posts: 2438
Joined: Fri Apr 25, 2003 7:06 pm
Location: Argentina

Re: I need to do a fast string search.

Post by ricardo »

wilbert wrote:My attempt :D
Hopefully fast enough ...

Thanks!! I will run it now.
This can run in threads (one thread for each different content), right?
ARGENTINA WORLD CHAMPION
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: I need to do a fast string search.

Post by wilbert »

ricardo wrote:This can run in threads (one thread for each different content), right?
It should be safe to let each thread process it's own html document provided you give every thread it's own LinkedList to fill with tag information.
The parsing itself of a 2KB html file to get all tag information probably takes a millisecond or so. After that you can process the tags that were found as you wish.
Windows (x64)
Raspberry Pi OS (Arm64)
Oliver13
User
User
Posts: 90
Joined: Thu Sep 30, 2010 6:40 am

Re: I need to do a fast string search.

Post by Oliver13 »

wilbert wrote:My attempt :D
Hopefully fast enough ...
Works great, thank you.
Now I also would like to process the tag attributes, so I need the whole contents of the tag itself

Code: Select all

Source.s = "<div id='1'>This is<Div align='left' id='2'> a</div> test</div>"
Do you have an idea how to extract this additional tag info ?
*Tag\Contents -> id='1'
*Tag\Contents -> align='left' id='2'

TIA
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: I need to do a fast string search.

Post by wilbert »

You can modify the code a bit to extract the tag with all attributes as a single string but you will have to process that further to get the information you need.

Example of extracting tag with attributes

Code: Select all

#CharSize = SizeOf(Character)
#CharSizeShift = #CharSize - 1

Structure TagValue
  q1.q
  q2.q
EndStructure

Structure Tag
  Name.s      ; tag name
  NameValue.q ; lowercase ascii representation of first 8 characters
  Start.l     ; inner text start pos 
  Length.l    ; length of inner text
EndStructure

Global Dim CharacterType.a($ffff)

Procedure.s StripHTMLTags(Source.s)
  Protected *src.Character = @Source
  Protected *dst.Character = @Source
  Protected Ignore.i
  If *src\c
    Repeat
      Select *src\c
        Case 60; '<'
          Ignore = #True
        Case 62; '>'
          Ignore = #False
        Default
          If Ignore = #False
            *dst\c = *src\c
            *dst + #CharSize
          EndIf
      EndSelect
      *src + #CharSize
    Until *src\c = 0
    *dst\c = 0
  EndIf
  ProcedureReturn PeekS(@Source)  
EndProcedure

Procedure.q TagNameValue(TagName.s)
  Protected Value.q
  PokeS(@Value, LCase(TagName), 8, #PB_Ascii)
  ProcedureReturn Value
EndProcedure

Procedure InitHTMLParser()
  Protected c.c
  CharacterType( 9) = 1       ; Whitespace
  CharacterType(10) = 1       ; Whitespace
  CharacterType(13) = 1       ; Whitespace
  CharacterType(32) = 1       ; Whitespace
  CharacterType(33) = 2       ; '!'
  CharacterType(47) = 3       ; '/'
  CharacterType(60) = 4       ; '<'
  CharacterType(62) = 5       ; '>'
  For c = 48 To 57
    CharacterType(c) = 6      ; 0-9
  Next
  For c = 65 To 90
    CharacterType(c) = 7      ; A-Z and a-z
    CharacterType(c | 32) = 7
  Next
EndProcedure

InitHTMLParser()

Procedure ParseHTML(Source.s, List TagList.Tag())
  Protected *Tag.Tag, *TagName.Ascii, TagValue.TagValue
  Protected.i ProcessingTagName, ContinueTagName, IsClosingTag, InnerEndPos, Pos
  Protected *c.Character = @Source
  ClearList(TagList())
  If *c\c
    Repeat
      If ProcessingTagName
        Select CharacterType(*c\c)
          Case 1              ; Whitespace
            ContinueTagName = #False
          Case 2              ; '!'
            If ContinueTagName
              ProcessingTagName = #False
              If TagValue\q1 = 0 And Mid(Source, InnerEndPos, 4) = "<!--"
                ; Skip comment
                Pos = FindString (Source, "-->", InnerEndPos + 4)
                If Pos
                  *c = @Source + (Pos + 1) << #CharSizeShift
                Else
                  Break  
                EndIf
              EndIf
            EndIf
          Case 3              ; '/'
            If TagValue\q1 = 0
              IsClosingTag = #True
            EndIf
          Case 5              ; '>'
            ProcessingTagName = #False
            *TagName\a = 0
            ; *** Process tag code ***
            If IsClosingTag
              *Tag = LastElement(TagList())
              While *Tag
                If *Tag\Length = 0 And *Tag\NameValue = TagValue\q1
                  *Tag\Length = InnerEndPos - *Tag\Start
                  If *Tag\Length = 0
                    *Tag\Length = -1
                  EndIf
                  Break
                EndIf
                *Tag = PreviousElement(TagList())
              Wend
              LastElement(TagList())
            Else
              *Tag = AddElement(TagList())
              *Tag\NameValue = TagValue\q1
              *Tag\Start = (*c - @Source) >> #CharSizeShift + 2
              *Tag\Name = Mid(Source, InnerEndPos, *Tag\Start - InnerEndPos)
              If TagValue\q1 = $747069726373
                ; Handle script tag
                Pos = FindString(Source, "</script", InnerEndPos + 8, #PB_String_NoCase)
                If Pos
                  *c = @Source + (Pos - 2) << #CharSizeShift
                Else
                  Break  
                EndIf
              EndIf
            EndIf
            ; *** End of process tag code ***
          Case 6              ; 0-9
            If TagValue\q1 = 0
              ProcessingTagName = #False
            ElseIf ContinueTagName And *TagName - @TagValue < 15
              *TagName\a = *c\c
              *TagName + 1
            EndIf
          Case 7              ; A-Z a-z
            If ContinueTagName And *TagName - @TagValue < 15
              *TagName\a = *c\c | $20
              *TagName + 1
            EndIf
          Default
            If ContinueTagName
              ProcessingTagName = #False
            EndIf
        EndSelect
      ElseIf *c\c = 60
        InnerEndPos = (*c - @Source) >> #CharSizeShift + 1
        ProcessingTagName = #True
        ContinueTagName = #True
        IsClosingTag = #False
        *TagName = @TagValue
        TagValue\q1 = 0
      EndIf
      *c + #CharSize
    Until *c\c = 0
    ; remove empty tags
    ResetList(TagList())
    While NextElement(TagList())
      If TagList()\Length < 1
        DeleteElement(TagList())
      EndIf
    Wend
  EndIf
EndProcedure
Windows (x64)
Raspberry Pi OS (Arm64)
Post Reply