Page 1 of 3
I need to do a fast string search.
Posted: Fri Aug 08, 2014 10:53 pm
by ricardo
Hi,
I am using a lot the PB's findstring and its getting to long to make the job.
What is the fastest way to find a string in a 500 word (per example) text? Sometimes in 32 an sometimes using a 64 bit computer.
I remeber i saw once some ASM code here, but i cant find it.
Thanks in advance
Re: I need to do a fast string search.
Posted: Fri Aug 08, 2014 11:14 pm
by ricardo
regex is faster?
Re: I need to do a fast string search.
Posted: Fri Aug 08, 2014 11:18 pm
by skywalk
regex is never faster than customized code.
Re: I need to do a fast string search.
Posted: Fri Aug 08, 2014 11:26 pm
by juror
You might look here
http://www.purebasic.fr/english/viewtop ... it=blitzes and check the referenced post.
I used the asm search until sometime several years back when PB changed/optimized it's findstring afterwhich I could determine little difference in PB native versus the asm so I reverted back to PB findstring.
Re: I need to do a fast string search.
Posted: Fri Aug 08, 2014 11:30 pm
by Danilo
What's the speed requirement for looking for a word within a text/string of 500 words?
I mean, you can parse a 1 MB file in 100ms. How big is your 500 words text? Some more info's required,
and better show some code for the slow search you are using.
Re: I need to do a fast string search.
Posted: Sat Aug 09, 2014 12:07 am
by IdeasVacuum
FindString() is (should be) fast for a small word total like that. Most of the other methods, such as Map, probably would not be faster for a small word total as initialisation will add a bit of overhead.
Re: I need to do a fast string search.
Posted: Sat Aug 09, 2014 1:16 am
by ricardo
Hi,
I am doing a lot of searches in that 500~1000 words text, in fact im doing 30 to 50 searches in 2k texts and i feel its slower.
I tried to do it in threads but does not found a significant difference. Maybe my code was not the better or maybe i need to find an alternative to findstring.
I am dealing with HTML code and i am searching for the innerText of some tags (the text inside the tag). Thats why regex comes to my mind.
I want to find the faster way, because im doing it some 2k texts and want to be as fastest as possible.
I want to do something like (its not real code, just an idea of what i want to do)
Code: Select all
Pos = findString(Code$,"<a",Pos+1)
EndPos = FindString(Code$,"</a",Pos+1)
Result$ = mid(Code$,Pos,EndPos-Pos)
As i said, its not real code, its just and idea of the kind of search
im doing several times per page and doing it in 2k different pages.
Imagine i want to search some tags (title,p,h1, h2, etc) to see if it contain some specific word in 2k web pages.Thats more or less what im looking to do as fast as possible.
.
I want to search in 2k webpages if it contains some specific word in any h1, h2, h3, h4, h5, h6, p,etc and, in case its contained, extract the innerText.
That why any millisecond is good for me.
Re: I need to do a fast string search.
Posted: Sat Aug 09, 2014 1:31 am
by IdeasVacuum
OK, in summary, you are:
1) Searching a file (HTML/XML) sized 2K.
2) You don't want to just find out if the file contains a specific word or string
3) You want to find specific tags or titles, and then search from that point for a specific word or string.
4) The text associated with/following the tag/title can be around 500 to 1000 words.
5) All of the text in (4) is to be extracted if it contains the word/string searched for.
Is that correct? If not, can you make a similar list of the process you do require? Can you upload some sample files?
Re: I need to do a fast string search.
Posted: Sat Aug 09, 2014 1:49 am
by ricardo
IdeasVacuum wrote:OK, in summary, you are:
1) Searching a file (HTML/XML) sized 2K.
2) You don't want to just find out if the file contains a specific word or string
3) You want to find specific tags or titles, and then search from that point for a specific word or string.
4) The text associated with/following the tag/title can be around 500 to 1000 words.
5) All of the text in (4) is to be extracted if it contains the word/string searched for.
Is that correct? If not, can you make a similar list of the process you do require? Can you upload some sample files?
No, its a little bit different.
1) Im searching in 2k different (HTML/XML) files, each file are around 500 and 1k words.
2) I want to detect if in certain tags (of each file) contain some specific word and if its cointained, then extract the entire content inside that tag (<tag>This text</tag>, obtain "This text"), the innerText.
Like im doing this several times,
i want to be as fast as possible.
Ofcourse its a little more complicated because im searching again, once i extract the innerText, if it contains Tags and clean that tags and its content (if are unwanted tags) or extract the content.
To make it shorter, basically im searching all the h1, h2, etc and <p></p>, <li></li>,<div></div> (other tags too but this ones are the most important) an d checking and extrating content (taking out all the unwanted tags). Yes, the tags can contain some attributes, im ignoring them.
My code is doing fine the work, but i want to find a findstring alternative (if possible) that its faster.
Re: I need to do a fast string search.
Posted: Sat Aug 09, 2014 3:31 am
by IdeasVacuum
OK, well actually your description seems to mostly fit mine....

Even if not, method of search could be similar.
So, to start-off, here is a search that is easy to understand. It's not slow, but only you can judge if it is fast enough for your purposes. It would be a little faster if ReadFile() was replaced by ReadData().
Code: Select all
#FileIO = 0
Structure Tags
sStart.s
sEnd.s
EndStructure
NewList Tag.Tags()
NewList Word.s()
NewList ExtractedText.s()
AddElement(Tag()) : Tag()\sStart = "<h1>" : Tag()\sEnd = "</h1>"
AddElement(Tag()) : Tag()\sStart = "<h2>" : Tag()\sEnd = "</h2>"
AddElement(Tag()) : Tag()\sStart = "<h3>" : Tag()\sEnd = "</h3>"
AddElement(Tag()) : Tag()\sStart = "<h4>" : Tag()\sEnd = "</h4>"
AddElement(Tag()) : Tag()\sStart = "<h5>" : Tag()\sEnd = "</h5>"
AddElement(Tag()) : Tag()\sStart = "<h6>" : Tag()\sEnd = "</h6>"
AddElement(Tag()) : Tag()\sStart = "<p>" : Tag()\sEnd = "</p>"
AddElement(Word()) : Word() = "### My1stWord ###"
AddElement(Word()) : Word() = "### My2ndWord ###"
AddElement(Word()) : Word() = "### My3rdWord ###"
Define iTagPosnStart.i, iTagPosnEnd.i, iWordPosn.i, iSearchStart.i = 0
Define iThisTagDone.i = #False
Define sWholeFile.s, sTextToSearch.s
If ReadFile(#FileIO, "C:\MyFile.html")
sWholeFile = ReadString(#FileIO, #PB_UTF8 | #PB_File_IgnoreEOL)
CloseFile(#FileIO)
EndIf
ForEach Tag()
iThisTagDone = #False
iSearchStart = 0
Repeat
iTagPosnStart = FindString(sWholeFile, Tag()\sStart, iSearchStart, #PB_String_CaseSensitive)
If(iTagPosnStart > 0) ;Start Tag found
iTagPosnEnd = FindString(sWholeFile, Tag()\sEnd, iTagPosnStart, #PB_String_CaseSensitive)
If(iTagPosnEnd > iTagPosnStart) ;End Tag found
sTextToSearch = Mid(sWholeFile, iTagPosnStart, (iTagPosnEnd - iTagPosnStart))
ForEach Word()
iWordPosn = FindString(sTextToSearch, Word(), 0, #PB_String_CaseSensitive)
If(iWordPosn > 0) ;Word found
AddElement(ExtractedText()) : ExtractedText() = sTextToSearch
EndIf
Next
EndIf
iSearchStart = iTagPosnEnd
Else
iThisTagDone = #True
EndIf
Until iThisTagDone = #True
Next
ForEach ExtractedText()
Debug ExtractedText()
Next
ClearList(Tag())
ClearList(Word())
ClearList(ExtractedText())
End
The sample file I used:
MyFile.html
Re: I need to do a fast string search.
Posted: Sat Aug 09, 2014 6:04 am
by wilbert
If you have to look for that many tags, you better write your own parser that does everything in one pass.
Having to go over and over the same html code for each tag is what makes it slow.
Re: I need to do a fast string search.
Posted: Sat Aug 09, 2014 8:24 am
by NicTheQuick
What about using a
suffix tree?
Re: I need to do a fast string search.
Posted: Sat Aug 09, 2014 3:45 pm
by IdeasVacuum
...another approach would be to find the word/string, and then verify if it was under a tag of interest.
Re: I need to do a fast string search.
Posted: Sat Aug 09, 2014 3:56 pm
by RASHAD
Hi ricardo
Search the forum for [Search with Wild Card by jamba]
It might help
Good luck
Re: I need to do a fast string search.
Posted: Sat Aug 09, 2014 3:58 pm
by IdeasVacuum
A tiny bit faster and extracted text has tags removed:
Code: Select all
#FileIO = 0
Structure Tags
sStart.s
sEnd.s
EndStructure
NewList Tag.Tags()
NewList Word.s()
NewList ExtractedText.s()
AddElement(Tag()) : Tag()\sStart = "<h1>" : Tag()\sEnd = "</h1>"
AddElement(Tag()) : Tag()\sStart = "<h2>" : Tag()\sEnd = "</h2>"
AddElement(Tag()) : Tag()\sStart = "<h3>" : Tag()\sEnd = "</h3>"
AddElement(Tag()) : Tag()\sStart = "<h4>" : Tag()\sEnd = "</h4>"
AddElement(Tag()) : Tag()\sStart = "<h5>" : Tag()\sEnd = "</h5>"
AddElement(Tag()) : Tag()\sStart = "<h6>" : Tag()\sEnd = "</h6>"
AddElement(Tag()) : Tag()\sStart = "<p>" : Tag()\sEnd = "</p>"
AddElement(Word()) : Word() = "### My1stWord ###"
AddElement(Word()) : Word() = "### My2ndWord ###"
AddElement(Word()) : Word() = "### My3rdWord ###"
Define iTagPosnStart.i, iTagPosnEnd.i, iWordPosn.i, iSearchStart.i = 0
Define iThisTagDone.i = #False
Define *WholeFile
Define sTextToSearch.s
Define qFileLen.q = 0
If ReadFile(#FileIO, "C:\MyFile.html")
qFileLen = Lof(#FileIO)
*WholeFile = AllocateMemory(qFileLen)
ReadData(#FileIO, *WholeFile, qFileLen)
CloseFile(#FileIO)
EndIf
ForEach Tag()
iThisTagDone = #False
iSearchStart = 0
Repeat
iTagPosnStart = FindString(PeekS(*WholeFile, qFileLen, #PB_UTF8), Tag()\sStart, iSearchStart, #PB_String_CaseSensitive)
If(iTagPosnStart > 0) ;Start Tag found
iTagPosnEnd = FindString(PeekS(*WholeFile, qFileLen, #PB_UTF8), Tag()\sEnd, iTagPosnStart, #PB_String_CaseSensitive)
If(iTagPosnEnd > iTagPosnStart) ;End Tag found
sTextToSearch = Mid(PeekS(*WholeFile, qFileLen, #PB_UTF8), iTagPosnStart + (Len(Tag()\sStart)), (iTagPosnEnd - iTagPosnStart + (Len(Tag()\sStart))))
ForEach Word()
iWordPosn = FindString(sTextToSearch, Word(), 0, #PB_String_CaseSensitive)
If(iWordPosn > 0) ;Word found
AddElement(ExtractedText()) : ExtractedText() = sTextToSearch
EndIf
Next
EndIf
iSearchStart = iTagPosnEnd
Else
iThisTagDone = #True
EndIf
Until iThisTagDone = #True
Next
ForEach ExtractedText()
Debug ExtractedText()
Next
ClearList(Tag())
ClearList(Word())
ClearList(ExtractedText())
End