Page 2 of 3
Posted: Wed Oct 26, 2005 8:20 pm
by netmaestro
And finally, the simplest solution.
You are doing excellent work, GedB. But I'm not sure I can agree that this is the simplest solution. The thing is, this approach requires a preparatory step by the program to prepare the data for access. I'd like to avoid that if I can. I want to create a dll file with one function only that is usable by any program in any language. Using an indexed approach requires two functions, one to call after initializing the dll, the prep one that gets the index going, and one for wordsearch. Your approach would be ideal for a program that is including the data into itself, but for a dll distribution I think this might be better:
The main problem is that the file is too large because of trailing spaces in the words. But what if we just took them all out and left only the 0's to delimit the words? The binary search wouldn't work.
As written it wouldn't work. But it's lightning fast right now, capable of searching thousands of words in a few milliseconds. A little fat added to it wouldn't make a noticeable difference at all. So, why not, instead of finding the middle
word in the list, just find the middle of the list bytewise and then back up 'til we hit a null? Then do a PeekS(), call that the middle word and proceed as normal? The performance hit would be finding the starts of the words and, because we aren't dividing the words exactly in half, an extra loop or two might be performed. Then it might take 15 or 20 milliseconds to search 1000 words instead of 10 but optimum size is achieved and there is still just the one single function. To me that seems simplest. What do you think?
Btw, many people are downloading those text files from my site. At least a dozen a day average. I have no idea where they are getting the link because when I google TWL98 or SOWPODS my site doesn't come up. SCRABBLE either. I'm mystified on that one.
Posted: Wed Oct 26, 2005 9:03 pm
by GedB
netmaestro,
There are payoffs, which is why I didn't remove any of the previous examples.
The index can be build dynamically or in advance, I've provided examples of both.
Accessing the word list through an array is by far the best approach. The ASM for dereferencing a structure lookup is very efficient. As you can see below, all the word is done in place, rather than having to call the PeekS procedure.
Code: Select all
; wordchecked.s = *Word\List[current]
MOV ebp,dword [s_wordexists.p_word]
PUSH ebp
MOV eax,dword [esp+16]
SAL eax,2
POP ebp
ADD ebp,eax
MOV edx,dword [ebp]
; PeekS(?lex+((offset-1)*16))
PUSH dword [PB_StringBase]
MOV ebp,l_lex
MOV ebx,ebp
MOV edi,dword [esp+8]
DEC edi
SAL edi,4
ADD ebx,edi
MOV eax,ebx
CALL PB_PeekS
It is also much easier to read.
Code: Select all
*Word\List[current]
PeekS(?lex+((offset-1)*16))
The intention is much clearer.
If you want to stick with the fixed length string solution you can get the same benefits with a stucture like this
Code: Select all
Structure WordBytes
bytes.b[16]
EndStructure
Structure WordList
List.WordBytes[#word_count]
EndStructure
For added efficiency use MemoryCompare() rather than string comparison
Code: Select all
; CompareMemory(@WordToFind, @*Word\List[507], WordLength)
PUSH dword [v_WordLength]
MOV ebp,dword [p_Word]
LEA eax,[ebp+8112]
PUSH eax
MOV eax,dword [v_WordToFind]
PUSH eax
CALL _PB_CompareMemory@12
btw. The only reason I know about the file is because you told me about it here. I tried to google for it the other day, and couldn't find it. I ended up coming back here for it.
Posted: Wed Oct 26, 2005 9:10 pm
by GedB
The idea of just jumping straight into the buffer and backtracking to a null is an interesting one.
If memory were at a premium, I think it would be worth it.
However, as you point out, memory is cheap. Your better off optimising for speed than memory. Keep the download small, but you don't have to be tight with runtime memory.
Also, you might want to grow your dictionary dll later on. Having a nice clean lookup for strings will make that easier.
btw. Are they definately intentional downloads of the text file? Could it be bots crawling the site?
Posted: Thu Oct 27, 2005 1:05 am
by netmaestro
I went ahead and tried it out without the spaces and with the search modified as I described above. The dll lost a megabyte in size and a test program showed that the entire contents of the TWL98 text file could be tested for validity with the dll in under one second. That's lots fast for me.
Posted: Thu Oct 27, 2005 5:18 am
by Paul
I made a DLLs for both the SOWPODS dictionary and the TWL98 dictionary using the dictionary/spell checker engine I created back a year ago.
I use a different approach for lookup and it seems very fast.
You or anyone else are quite welcome to use the DLL's for any project you wish.
http://www.pureproject.net/archive411/d ... aryDLL.zip
Basically if you send a word to the DLL it returns 0 if the work is found. If the word is not found it returns 1. If you send a string of words, it returns the number of words not found.
Posted: Fri Oct 28, 2005 10:34 pm
by netmaestro
I got a listing of 3235 new words to add to TWL98 from the latest version of the Official Scrabble players' Dictionary. So existing dlls are not going to be usable. A new one must be made. Thanks anyway, Paul. Your free stuff and ever-present readiness to help are much appreciated.
This:
http://lloydsplace.com/newtwl98.txt
When processed thus:
Code: Select all
Dim wordin.s(250000)
count.l=0
ReadFile(0,"newtwl98.txt")
While Eof(0) = 0
wordin(count)=ReadString()
count+1
Wend
CloseFile(0)
count=0
OpenFile(0,"wordlist.bin")
WriteByte(0)
For i=0 To 171326
WriteString(wordin(i))
WriteByte(0)
Next i
CloseFile(0)
becomes this:
http://lloydsplace.com/wordlist.bin
which gets included into this:
Code: Select all
; This procedure searches a binary word list made from a text file
; with all newlines replaced by binary zeros to delimit the words.
; The included word list must also begin with a binary zero.
; This version of the dll is written for PureBasic compatibility.
ProcedureDLL.l WordExists(wordtofind.s)
middleword.l
wordtofind = UCase(wordtofind)
wordchecked.s = ""
found.l = 0
lo = ?lexiconstart
hi = ?lexiconend
ccount=0
While ccount <= 18 ; if it isn't found in 18 loops it isn't there
middleword = lo + (hi-lo) >> 1 ; find the middle of the word list bytewise
While PeekB(middleword-1)<>0 ; relocate to the beginning of the word
middleword-1
Wend
wordchecked=PeekS(middleword) ; check it against the search word
If wordtofind=wordchecked
found = 1
ccount=18 ; no more loops are required if search word is found
Else
If wordtofind < wordchecked
hi = middleword ;cut out this word and all words above leaving delimiting 0 at top
Else
lo = middleword + Len(wordchecked)-1 ;cut out this word and all below leaving delimiting 0 at bottom
EndIf
EndIf
ccount+1
Wend
ProcedureReturn found
DataSection
lexiconstart:
IncludeBinary "wordlist.bin"
lexiconend:
EndDataSection
EndProcedure
which gets compiled into this:
http://lloydsplace.com/lexicon_pb.dll
which in turn gets called by this:
Code: Select all
OpenLibrary(0,"lexicon_pb.dll")
OpenConsole()
PrintN("Starting...")
Dim words.s(171327)
ccount=0
ReadFile(0,"newtwl98.txt")
While Eof(0) = 0
words(ccount)=ReadString()
ccount+1
Wend
words(171327)="fragilistic" ;stick a bad one in there to test fail condition
CloseFile(0)
st1=GetTickCount_()
For i=0 To 171327
If CallFunction(0,"WordExists",words(i)) <> 1
PrintN(words(i) + " " + "Was not found")
EndIf
Next i
st2=GetTickCount_()
result = (st2-st1)
PrintN("Validating 171327 words took "+Str(result) + " milliseconds.")
PrintN("Bye")
Input()
CloseConsole()
and it verifies every word in the list in around one second.
Posted: Fri Oct 28, 2005 11:48 pm
by GedB
Netmaestro,
That final version is looking rather nice.
I love watching code evolve like this. Different methods tried, lots of listings demonstrating various ideas. It's like poetry, don't you think?
Posted: Fri Oct 28, 2005 11:58 pm
by netmaestro
Fred Reinfeld wrote:
"Chess, like music, like love, has the power to make men happy."
I think programming has the same possibilities when you're using a finely-crafted tool like PureBasic. It really is a pleasing combination of art and science.
Re: Use a binaryincluded text file
Posted: Fri Oct 25, 2013 6:38 am
by IdeasVacuum
Adapted your excellent method netmaestro, works beautifully.
The code is nicely commented too, but there is one thing - how did you determine the limit of the number of loops? (18)
Re: Use a binaryincluded text file
Posted: Fri Oct 25, 2013 7:12 am
by netmaestro
By the number of words in the list. A list of one million words can be exhaustively searched with 20 tests, half a million takes 19, and however many words are in that list, I forget but it's less than 1/4 million, takes up to 18. Not found in 18 tests, it isn't there.
Re: Use a binaryincluded text file
Posted: Fri Oct 25, 2013 7:33 am
by netmaestro
Updated the links in that posting, the files are available now. networkmaestro.com is now lloydsplace.com since a couple of years.
Re: Use a binaryincluded text file
Posted: Fri Oct 25, 2013 11:29 am
by PB
> I would like to include the text file in my exe and search
> it quickly to see if a word exists in it or not
Not my code, but does what you describe:
Code: Select all
text$=PeekS(?MyTextFile)
offset=FindString(text$,"support@purebasic.com")
Debug offset
End
DataSection
MyTextFile:
IncludeBinary #PB_Compiler_Home+"SDK\Readme.txt"
Data.c 0
EndDataSection
Re: Use a binaryincluded text file
Posted: Fri Oct 25, 2013 12:05 pm
by rsts
Good answer (to an 8 year old question. Wonder if he's still waiting?)

Re: Use a binaryincluded text file
Posted: Fri Oct 25, 2013 12:22 pm
by PB
[Edit] My post was wrong, and seems to have upset someone.
Re: Use a binaryincluded text file
Posted: Fri Oct 25, 2013 12:38 pm
by rsts
Once again, PB has to be correct.
Please show me where he asked the question anywhere besides the eight year old post.