I need to do a fast string search.

Just starting out? Need help? Post your questions and find answers here.
ricardo
Addict
Addict
Posts: 2438
Joined: Fri Apr 25, 2003 7:06 pm
Location: Argentina

I need to do a fast string search.

Post by ricardo »

Hi,

I am using a lot the PB's findstring and its getting to long to make the job.

What is the fastest way to find a string in a 500 word (per example) text? Sometimes in 32 an sometimes using a 64 bit computer.

I remeber i saw once some ASM code here, but i cant find it.

Thanks in advance
Last edited by ricardo on Fri Aug 08, 2014 11:16 pm, edited 1 time in total.
ARGENTINA WORLD CHAMPION
ricardo
Addict
Addict
Posts: 2438
Joined: Fri Apr 25, 2003 7:06 pm
Location: Argentina

Re: I need to do a fast string search.

Post by ricardo »

regex is faster?
ARGENTINA WORLD CHAMPION
User avatar
skywalk
Addict
Addict
Posts: 4211
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: I need to do a fast string search.

Post by skywalk »

regex is never faster than customized code.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
juror
Enthusiast
Enthusiast
Posts: 228
Joined: Mon Jul 09, 2007 4:47 pm
Location: Courthouse

Re: I need to do a fast string search.

Post by juror »

You might look here http://www.purebasic.fr/english/viewtop ... it=blitzes and check the referenced post.

I used the asm search until sometime several years back when PB changed/optimized it's findstring afterwhich I could determine little difference in PB native versus the asm so I reverted back to PB findstring.
User avatar
Danilo
Addict
Addict
Posts: 3036
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: I need to do a fast string search.

Post by Danilo »

What's the speed requirement for looking for a word within a text/string of 500 words?

I mean, you can parse a 1 MB file in 100ms. How big is your 500 words text? Some more info's required,
and better show some code for the slow search you are using.
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: I need to do a fast string search.

Post by IdeasVacuum »

FindString() is (should be) fast for a small word total like that. Most of the other methods, such as Map, probably would not be faster for a small word total as initialisation will add a bit of overhead.
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
ricardo
Addict
Addict
Posts: 2438
Joined: Fri Apr 25, 2003 7:06 pm
Location: Argentina

Re: I need to do a fast string search.

Post by ricardo »

Hi,

I am doing a lot of searches in that 500~1000 words text, in fact im doing 30 to 50 searches in 2k texts and i feel its slower.
I tried to do it in threads but does not found a significant difference. Maybe my code was not the better or maybe i need to find an alternative to findstring.

I am dealing with HTML code and i am searching for the innerText of some tags (the text inside the tag). Thats why regex comes to my mind.
I want to find the faster way, because im doing it some 2k texts and want to be as fastest as possible.


I want to do something like (its not real code, just an idea of what i want to do)

Code: Select all

Pos = findString(Code$,"<a",Pos+1)
EndPos = FindString(Code$,"</a",Pos+1)
Result$ = mid(Code$,Pos,EndPos-Pos)
As i said, its not real code, its just and idea of the kind of search im doing several times per page and doing it in 2k different pages.

Imagine i want to search some tags (title,p,h1, h2, etc) to see if it contain some specific word in 2k web pages.Thats more or less what im looking to do as fast as possible.
.


I want to search in 2k webpages if it contains some specific word in any h1, h2, h3, h4, h5, h6, p,etc and, in case its contained, extract the innerText.
That why any millisecond is good for me.
ARGENTINA WORLD CHAMPION
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: I need to do a fast string search.

Post by IdeasVacuum »

OK, in summary, you are:

1) Searching a file (HTML/XML) sized 2K.
2) You don't want to just find out if the file contains a specific word or string
3) You want to find specific tags or titles, and then search from that point for a specific word or string.
4) The text associated with/following the tag/title can be around 500 to 1000 words.
5) All of the text in (4) is to be extracted if it contains the word/string searched for.

Is that correct? If not, can you make a similar list of the process you do require? Can you upload some sample files?
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
ricardo
Addict
Addict
Posts: 2438
Joined: Fri Apr 25, 2003 7:06 pm
Location: Argentina

Re: I need to do a fast string search.

Post by ricardo »

IdeasVacuum wrote:OK, in summary, you are:

1) Searching a file (HTML/XML) sized 2K.
2) You don't want to just find out if the file contains a specific word or string
3) You want to find specific tags or titles, and then search from that point for a specific word or string.
4) The text associated with/following the tag/title can be around 500 to 1000 words.
5) All of the text in (4) is to be extracted if it contains the word/string searched for.

Is that correct? If not, can you make a similar list of the process you do require? Can you upload some sample files?

No, its a little bit different.

1) Im searching in 2k different (HTML/XML) files, each file are around 500 and 1k words.
2) I want to detect if in certain tags (of each file) contain some specific word and if its cointained, then extract the entire content inside that tag (<tag>This text</tag>, obtain "This text"), the innerText.

Like im doing this several times, i want to be as fast as possible.

Ofcourse its a little more complicated because im searching again, once i extract the innerText, if it contains Tags and clean that tags and its content (if are unwanted tags) or extract the content.

To make it shorter, basically im searching all the h1, h2, etc and <p></p>, <li></li>,<div></div> (other tags too but this ones are the most important) an d checking and extrating content (taking out all the unwanted tags). Yes, the tags can contain some attributes, im ignoring them.

My code is doing fine the work, but i want to find a findstring alternative (if possible) that its faster.
ARGENTINA WORLD CHAMPION
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: I need to do a fast string search.

Post by IdeasVacuum »

OK, well actually your description seems to mostly fit mine.... :) Even if not, method of search could be similar.

So, to start-off, here is a search that is easy to understand. It's not slow, but only you can judge if it is fast enough for your purposes. It would be a little faster if ReadFile() was replaced by ReadData().

Code: Select all

#FileIO = 0

Structure Tags
sStart.s
sEnd.s
EndStructure

NewList Tag.Tags()
NewList Word.s()
NewList ExtractedText.s()

AddElement(Tag()) : Tag()\sStart = "<h1>" : Tag()\sEnd = "</h1>"
AddElement(Tag()) : Tag()\sStart = "<h2>" : Tag()\sEnd = "</h2>"
AddElement(Tag()) : Tag()\sStart = "<h3>" : Tag()\sEnd = "</h3>"
AddElement(Tag()) : Tag()\sStart = "<h4>" : Tag()\sEnd = "</h4>"
AddElement(Tag()) : Tag()\sStart = "<h5>" : Tag()\sEnd = "</h5>"
AddElement(Tag()) : Tag()\sStart = "<h6>" : Tag()\sEnd = "</h6>"
AddElement(Tag()) : Tag()\sStart = "<p>"  : Tag()\sEnd = "</p>"

AddElement(Word()) : Word() = "### My1stWord ###"
AddElement(Word()) : Word() = "### My2ndWord ###"
AddElement(Word()) : Word() = "### My3rdWord ###"

Define iTagPosnStart.i, iTagPosnEnd.i, iWordPosn.i, iSearchStart.i = 0
Define iThisTagDone.i = #False
Define sWholeFile.s, sTextToSearch.s

If ReadFile(#FileIO, "C:\MyFile.html")

       sWholeFile = ReadString(#FileIO, #PB_UTF8 | #PB_File_IgnoreEOL)
                     CloseFile(#FileIO)
EndIf

ForEach Tag()

       iThisTagDone = #False
       iSearchStart = 0

       Repeat
                 iTagPosnStart = FindString(sWholeFile, Tag()\sStart, iSearchStart, #PB_String_CaseSensitive)
              If(iTagPosnStart > 0) ;Start Tag found

                        iTagPosnEnd = FindString(sWholeFile, Tag()\sEnd, iTagPosnStart, #PB_String_CaseSensitive)
                     If(iTagPosnEnd > iTagPosnStart) ;End Tag found

                            sTextToSearch = Mid(sWholeFile, iTagPosnStart, (iTagPosnEnd - iTagPosnStart))

                            ForEach Word()

                                       iWordPosn = FindString(sTextToSearch, Word(), 0, #PB_String_CaseSensitive)
                                    If(iWordPosn > 0) ;Word found

                                               AddElement(ExtractedText()) : ExtractedText() = sTextToSearch
                                    EndIf
                            Next
                     EndIf

                     iSearchStart = iTagPosnEnd
              Else
                     iThisTagDone = #True
              EndIf

       Until iThisTagDone = #True
Next

ForEach ExtractedText()

        Debug ExtractedText()
Next

ClearList(Tag())
ClearList(Word())
ClearList(ExtractedText())

End
The sample file I used:MyFile.html
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: I need to do a fast string search.

Post by wilbert »

If you have to look for that many tags, you better write your own parser that does everything in one pass.
Having to go over and over the same html code for each tag is what makes it slow.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
NicTheQuick
Addict
Addict
Posts: 1504
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: I need to do a fast string search.

Post by NicTheQuick »

What about using a suffix tree?
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: I need to do a fast string search.

Post by IdeasVacuum »

...another approach would be to find the word/string, and then verify if it was under a tag of interest.
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
RASHAD
PureBasic Expert
PureBasic Expert
Posts: 4946
Joined: Sun Apr 12, 2009 6:27 am

Re: I need to do a fast string search.

Post by RASHAD »

Hi ricardo
Search the forum for [Search with Wild Card by jamba]
It might help

Good luck
Egypt my love
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: I need to do a fast string search.

Post by IdeasVacuum »

A tiny bit faster and extracted text has tags removed:

Code: Select all

#FileIO = 0

Structure Tags
sStart.s
sEnd.s
EndStructure

NewList Tag.Tags()
NewList Word.s()
NewList ExtractedText.s()

AddElement(Tag()) : Tag()\sStart = "<h1>" : Tag()\sEnd = "</h1>"
AddElement(Tag()) : Tag()\sStart = "<h2>" : Tag()\sEnd = "</h2>"
AddElement(Tag()) : Tag()\sStart = "<h3>" : Tag()\sEnd = "</h3>"
AddElement(Tag()) : Tag()\sStart = "<h4>" : Tag()\sEnd = "</h4>"
AddElement(Tag()) : Tag()\sStart = "<h5>" : Tag()\sEnd = "</h5>"
AddElement(Tag()) : Tag()\sStart = "<h6>" : Tag()\sEnd = "</h6>"
AddElement(Tag()) : Tag()\sStart = "<p>"  : Tag()\sEnd = "</p>"

AddElement(Word()) : Word() = "### My1stWord ###"
AddElement(Word()) : Word() = "### My2ndWord ###"
AddElement(Word()) : Word() = "### My3rdWord ###"

Define iTagPosnStart.i, iTagPosnEnd.i, iWordPosn.i, iSearchStart.i = 0
Define iThisTagDone.i = #False
Define *WholeFile
Define sTextToSearch.s
Define qFileLen.q = 0

If ReadFile(#FileIO, "C:\MyFile.html")

                        qFileLen = Lof(#FileIO)
                      *WholeFile = AllocateMemory(qFileLen)

                      ReadData(#FileIO, *WholeFile, qFileLen)
                     CloseFile(#FileIO)
EndIf

ForEach Tag()

       iThisTagDone = #False
       iSearchStart = 0

       Repeat
                 iTagPosnStart = FindString(PeekS(*WholeFile, qFileLen, #PB_UTF8), Tag()\sStart, iSearchStart, #PB_String_CaseSensitive)
              If(iTagPosnStart > 0) ;Start Tag found

                        iTagPosnEnd = FindString(PeekS(*WholeFile, qFileLen, #PB_UTF8), Tag()\sEnd, iTagPosnStart, #PB_String_CaseSensitive)
                     If(iTagPosnEnd > iTagPosnStart) ;End Tag found

                            sTextToSearch = Mid(PeekS(*WholeFile, qFileLen, #PB_UTF8), iTagPosnStart + (Len(Tag()\sStart)), (iTagPosnEnd - iTagPosnStart + (Len(Tag()\sStart))))

                            ForEach Word()

                                       iWordPosn = FindString(sTextToSearch, Word(), 0, #PB_String_CaseSensitive)
                                    If(iWordPosn > 0) ;Word found

                                               AddElement(ExtractedText()) : ExtractedText() = sTextToSearch
                                    EndIf
                            Next
                     EndIf

                     iSearchStart = iTagPosnEnd
              Else
                     iThisTagDone = #True
              EndIf

       Until iThisTagDone = #True
Next

ForEach ExtractedText()

        Debug ExtractedText()
Next

ClearList(Tag())
ClearList(Word())
ClearList(ExtractedText())

End
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
Post Reply