I need to do a fast string search.

ricardo · Post by **ricardo** » Fri Aug 08, 2014 10:53 pm

Hi,

I am using a lot the PB's findstring and its getting to long to make the job.

What is the fastest way to find a string in a 500 word (per example) text? Sometimes in 32 an sometimes using a 64 bit computer.

I remeber i saw once some ASM code here, but i cant find it.

Thanks in advance

ricardo · Post by **ricardo** » Fri Aug 08, 2014 11:14 pm

regex is faster?

skywalk · Post by **skywalk** » Fri Aug 08, 2014 11:18 pm

regex is never faster than customized code.

juror · Post by **juror** » Fri Aug 08, 2014 11:26 pm

You might look here http://www.purebasic.fr/english/viewtop ... it=blitzes and check the referenced post.

I used the asm search until sometime several years back when PB changed/optimized it's findstring afterwhich I could determine little difference in PB native versus the asm so I reverted back to PB findstring.

Danilo · Post by **Danilo** » Fri Aug 08, 2014 11:30 pm

What's the speed requirement for looking for a word within a text/string of 500 words?

I mean, you can parse a 1 MB file in 100ms. How big is your 500 words text? Some more info's required,
and better show some code for the slow search you are using.

IdeasVacuum · Post by **IdeasVacuum** » Sat Aug 09, 2014 12:07 am

FindString() is (should be) fast for a small word total like that. Most of the other methods, such as Map, probably would not be faster for a small word total as initialisation will add a bit of overhead.

ricardo · Post by **ricardo** » Sat Aug 09, 2014 1:16 am

Hi,

I am doing a lot of searches in that 500~1000 words text, in fact im doing 30 to 50 searches in 2k texts and i feel its slower.
I tried to do it in threads but does not found a significant difference. Maybe my code was not the better or maybe i need to find an alternative to findstring.

I am dealing with HTML code and i am searching for the innerText of some tags (the text inside the tag). Thats why regex comes to my mind.
I want to find the faster way, because im doing it some 2k texts and want to be as fastest as possible.

I want to do something like (its not real code, just an idea of what i want to do)

Code: Select all

Pos = findString(Code$,"<a",Pos+1)
EndPos = FindString(Code$,"</a",Pos+1)
Result$ = mid(Code$,Pos,EndPos-Pos)

As i said, its not real code, its just and idea of the kind of search im doing several times per page and doing it in 2k different pages.

Imagine i want to search some tags (title,p,h1, h2, etc) to see if it contain some specific word in 2k web pages.Thats more or less what im looking to do as fast as possible.
.

I want to search in 2k webpages if it contains some specific word in any h1, h2, h3, h4, h5, h6, p,etc and, in case its contained, extract the innerText.
That why any millisecond is good for me.

IdeasVacuum · Post by **IdeasVacuum** » Sat Aug 09, 2014 1:31 am

OK, in summary, you are:

1) Searching a file (HTML/XML) sized 2K.
2) You don't want to just find out if the file contains a specific word or string
3) You want to find specific tags or titles, and then search from that point for a specific word or string.
4) The text associated with/following the tag/title can be around 500 to 1000 words.
5) All of the text in (4) is to be extracted if it contains the word/string searched for.

Is that correct? If not, can you make a similar list of the process you do require? Can you upload some sample files?

ricardo · Post by **ricardo** » Sat Aug 09, 2014 1:49 am

IdeasVacuum wrote:OK, in summary, you are:

1) Searching a file (HTML/XML) sized 2K.
2) You don't want to just find out if the file contains a specific word or string
3) You want to find specific tags or titles, and then search from that point for a specific word or string.
4) The text associated with/following the tag/title can be around 500 to 1000 words.
5) All of the text in (4) is to be extracted if it contains the word/string searched for.

Is that correct? If not, can you make a similar list of the process you do require? Can you upload some sample files?

No, its a little bit different.

1) Im searching in 2k different (HTML/XML) files, each file are around 500 and 1k words.
2) I want to detect if in certain tags (of each file) contain some specific word and if its cointained, then extract the entire content inside that tag (<tag>This text</tag>, obtain "This text"), the innerText.

Like im doing this several times, i want to be as fast as possible.

Ofcourse its a little more complicated because im searching again, once i extract the innerText, if it contains Tags and clean that tags and its content (if are unwanted tags) or extract the content.

To make it shorter, basically im searching all the h1, h2, etc and <p></p>, <li></li>,<div></div> (other tags too but this ones are the most important) an d checking and extrating content (taking out all the unwanted tags). Yes, the tags can contain some attributes, im ignoring them.

My code is doing fine the work, but i want to find a findstring alternative (if possible) that its faster.

IdeasVacuum · Post by **IdeasVacuum** » Sat Aug 09, 2014 3:31 am

OK, well actually your description seems to mostly fit mine....

Even if not, method of search could be similar.

So, to start-off, here is a search that is easy to understand. It's not slow, but only you can judge if it is fast enough for your purposes. It would be a little faster if ReadFile() was replaced by ReadData().

Code: Select all

#FileIO = 0

Structure Tags
sStart.s
sEnd.s
EndStructure

NewList Tag.Tags()
NewList Word.s()
NewList ExtractedText.s()

AddElement(Tag()) : Tag()\sStart = "<h1>" : Tag()\sEnd = "</h1>"
AddElement(Tag()) : Tag()\sStart = "<h2>" : Tag()\sEnd = "</h2>"
AddElement(Tag()) : Tag()\sStart = "<h3>" : Tag()\sEnd = "</h3>"
AddElement(Tag()) : Tag()\sStart = "<h4>" : Tag()\sEnd = "</h4>"
AddElement(Tag()) : Tag()\sStart = "<h5>" : Tag()\sEnd = "</h5>"
AddElement(Tag()) : Tag()\sStart = "<h6>" : Tag()\sEnd = "</h6>"
AddElement(Tag()) : Tag()\sStart = "<p>"  : Tag()\sEnd = "</p>"

AddElement(Word()) : Word() = "### My1stWord ###"
AddElement(Word()) : Word() = "### My2ndWord ###"
AddElement(Word()) : Word() = "### My3rdWord ###"

Define iTagPosnStart.i, iTagPosnEnd.i, iWordPosn.i, iSearchStart.i = 0
Define iThisTagDone.i = #False
Define sWholeFile.s, sTextToSearch.s

If ReadFile(#FileIO, "C:\MyFile.html")

       sWholeFile = ReadString(#FileIO, #PB_UTF8 | #PB_File_IgnoreEOL)
                     CloseFile(#FileIO)
EndIf

ForEach Tag()

       iThisTagDone = #False
       iSearchStart = 0

       Repeat
                 iTagPosnStart = FindString(sWholeFile, Tag()\sStart, iSearchStart, #PB_String_CaseSensitive)
              If(iTagPosnStart > 0) ;Start Tag found

                        iTagPosnEnd = FindString(sWholeFile, Tag()\sEnd, iTagPosnStart, #PB_String_CaseSensitive)
                     If(iTagPosnEnd > iTagPosnStart) ;End Tag found

                            sTextToSearch = Mid(sWholeFile, iTagPosnStart, (iTagPosnEnd - iTagPosnStart))

                            ForEach Word()

                                       iWordPosn = FindString(sTextToSearch, Word(), 0, #PB_String_CaseSensitive)
                                    If(iWordPosn > 0) ;Word found

                                               AddElement(ExtractedText()) : ExtractedText() = sTextToSearch
                                    EndIf
                            Next
                     EndIf

                     iSearchStart = iTagPosnEnd
              Else
                     iThisTagDone = #True
              EndIf

       Until iThisTagDone = #True
Next

ForEach ExtractedText()

        Debug ExtractedText()
Next

ClearList(Tag())
ClearList(Word())
ClearList(ExtractedText())

End

The sample file I used:MyFile.html

wilbert · Post by **wilbert** » Sat Aug 09, 2014 6:04 am

If you have to look for that many tags, you better write your own parser that does everything in one pass.
Having to go over and over the same html code for each tag is what makes it slow.

NicTheQuick · Post by **NicTheQuick** » Sat Aug 09, 2014 8:24 am

What about using a suffix tree?

IdeasVacuum · Post by **IdeasVacuum** » Sat Aug 09, 2014 3:45 pm

...another approach would be to find the word/string, and then verify if it was under a tag of interest.

RASHAD · Post by **RASHAD** » Sat Aug 09, 2014 3:56 pm

Hi ricardo
Search the forum for [Search with Wild Card by jamba]
It might help

Good luck

IdeasVacuum · Post by **IdeasVacuum** » Sat Aug 09, 2014 3:58 pm

A tiny bit faster and extracted text has tags removed:

Code: Select all

#FileIO = 0

Structure Tags
sStart.s
sEnd.s
EndStructure

NewList Tag.Tags()
NewList Word.s()
NewList ExtractedText.s()

AddElement(Tag()) : Tag()\sStart = "<h1>" : Tag()\sEnd = "</h1>"
AddElement(Tag()) : Tag()\sStart = "<h2>" : Tag()\sEnd = "</h2>"
AddElement(Tag()) : Tag()\sStart = "<h3>" : Tag()\sEnd = "</h3>"
AddElement(Tag()) : Tag()\sStart = "<h4>" : Tag()\sEnd = "</h4>"
AddElement(Tag()) : Tag()\sStart = "<h5>" : Tag()\sEnd = "</h5>"
AddElement(Tag()) : Tag()\sStart = "<h6>" : Tag()\sEnd = "</h6>"
AddElement(Tag()) : Tag()\sStart = "<p>"  : Tag()\sEnd = "</p>"

AddElement(Word()) : Word() = "### My1stWord ###"
AddElement(Word()) : Word() = "### My2ndWord ###"
AddElement(Word()) : Word() = "### My3rdWord ###"

Define iTagPosnStart.i, iTagPosnEnd.i, iWordPosn.i, iSearchStart.i = 0
Define iThisTagDone.i = #False
Define *WholeFile
Define sTextToSearch.s
Define qFileLen.q = 0

If ReadFile(#FileIO, "C:\MyFile.html")

                        qFileLen = Lof(#FileIO)
                      *WholeFile = AllocateMemory(qFileLen)

                      ReadData(#FileIO, *WholeFile, qFileLen)
                     CloseFile(#FileIO)
EndIf

ForEach Tag()

       iThisTagDone = #False
       iSearchStart = 0

       Repeat
                 iTagPosnStart = FindString(PeekS(*WholeFile, qFileLen, #PB_UTF8), Tag()\sStart, iSearchStart, #PB_String_CaseSensitive)
              If(iTagPosnStart > 0) ;Start Tag found

                        iTagPosnEnd = FindString(PeekS(*WholeFile, qFileLen, #PB_UTF8), Tag()\sEnd, iTagPosnStart, #PB_String_CaseSensitive)
                     If(iTagPosnEnd > iTagPosnStart) ;End Tag found

                            sTextToSearch = Mid(PeekS(*WholeFile, qFileLen, #PB_UTF8), iTagPosnStart + (Len(Tag()\sStart)), (iTagPosnEnd - iTagPosnStart + (Len(Tag()\sStart))))

                            ForEach Word()

                                       iWordPosn = FindString(sTextToSearch, Word(), 0, #PB_String_CaseSensitive)
                                    If(iWordPosn > 0) ;Word found

                                               AddElement(ExtractedText()) : ExtractedText() = sTextToSearch
                                    EndIf
                            Next
                     EndIf

                     iSearchStart = iTagPosnEnd
              Else
                     iThisTagDone = #True
              EndIf

       Until iThisTagDone = #True
Next

ForEach ExtractedText()

        Debug ExtractedText()
Next

ClearList(Tag())
ClearList(Word())
ClearList(ExtractedText())

End

PureBasic Forums - English

I need to do a fast string search.

I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.

Re: I need to do a fast string search.