Use a binaryincluded text file

netmaestro · Post by **netmaestro** » Wed Oct 26, 2005 8:20 pm

And finally, the simplest solution.

You are doing excellent work, GedB. But I'm not sure I can agree that this is the simplest solution. The thing is, this approach requires a preparatory step by the program to prepare the data for access. I'd like to avoid that if I can. I want to create a dll file with one function only that is usable by any program in any language. Using an indexed approach requires two functions, one to call after initializing the dll, the prep one that gets the index going, and one for wordsearch. Your approach would be ideal for a program that is including the data into itself, but for a dll distribution I think this might be better:

The main problem is that the file is too large because of trailing spaces in the words. But what if we just took them all out and left only the 0's to delimit the words? The binary search wouldn't work. As written it wouldn't work. But it's lightning fast right now, capable of searching thousands of words in a few milliseconds. A little fat added to it wouldn't make a noticeable difference at all. So, why not, instead of finding the middle word in the list, just find the middle of the list bytewise and then back up 'til we hit a null? Then do a PeekS(), call that the middle word and proceed as normal? The performance hit would be finding the starts of the words and, because we aren't dividing the words exactly in half, an extra loop or two might be performed. Then it might take 15 or 20 milliseconds to search 1000 words instead of 10 but optimum size is achieved and there is still just the one single function. To me that seems simplest. What do you think?

Btw, many people are downloading those text files from my site. At least a dozen a day average. I have no idea where they are getting the link because when I google TWL98 or SOWPODS my site doesn't come up. SCRABBLE either. I'm mystified on that one.

GedB · Post by **GedB** » Wed Oct 26, 2005 9:03 pm

netmaestro,

There are payoffs, which is why I didn't remove any of the previous examples.

The index can be build dynamically or in advance, I've provided examples of both.

Accessing the word list through an array is by far the best approach. The ASM for dereferencing a structure lookup is very efficient. As you can see below, all the word is done in place, rather than having to call the PeekS procedure.

Code: Select all

; wordchecked.s = *Word\List[current]
  MOV    ebp,dword [s_wordexists.p_word]
  PUSH   ebp
  MOV    eax,dword [esp+16]
  SAL    eax,2
  POP    ebp
  ADD    ebp,eax
  MOV    edx,dword [ebp]

; PeekS(?lex+((offset-1)*16))
  PUSH   dword [PB_StringBase]
  MOV    ebp,l_lex
  MOV    ebx,ebp
  MOV    edi,dword [esp+8]
  DEC    edi
  SAL    edi,4
  ADD    ebx,edi
  MOV    eax,ebx
  CALL   PB_PeekS

It is also much easier to read.

Code: Select all

*Word\List[current]

PeekS(?lex+((offset-1)*16))

The intention is much clearer.

If you want to stick with the fixed length string solution you can get the same benefits with a stucture like this

Code: Select all

Structure WordBytes
  bytes.b[16]
EndStructure

Structure WordList
  List.WordBytes[#word_count]
EndStructure

For added efficiency use MemoryCompare() rather than string comparison

Code: Select all

; CompareMemory(@WordToFind, @*Word\List[507], WordLength)
  PUSH   dword [v_WordLength]
  MOV    ebp,dword [p_Word]
  LEA    eax,[ebp+8112]
  PUSH   eax
  MOV    eax,dword [v_WordToFind]
  PUSH   eax
  CALL  _PB_CompareMemory@12

btw. The only reason I know about the file is because you told me about it here. I tried to google for it the other day, and couldn't find it. I ended up coming back here for it.

GedB · Post by **GedB** » Wed Oct 26, 2005 9:10 pm

The idea of just jumping straight into the buffer and backtracking to a null is an interesting one.

If memory were at a premium, I think it would be worth it.

However, as you point out, memory is cheap. Your better off optimising for speed than memory. Keep the download small, but you don't have to be tight with runtime memory.

Also, you might want to grow your dictionary dll later on. Having a nice clean lookup for strings will make that easier.

btw. Are they definately intentional downloads of the text file? Could it be bots crawling the site?

netmaestro · Post by **netmaestro** » Thu Oct 27, 2005 1:05 am

I went ahead and tried it out without the spaces and with the search modified as I described above. The dll lost a megabyte in size and a test program showed that the entire contents of the TWL98 text file could be tested for validity with the dll in under one second. That's lots fast for me.

Paul · Post by **Paul** » Thu Oct 27, 2005 5:18 am

I made a DLLs for both the SOWPODS dictionary and the TWL98 dictionary using the dictionary/spell checker engine I created back a year ago.

I use a different approach for lookup and it seems very fast.
You or anyone else are quite welcome to use the DLL's for any project you wish.

http://www.pureproject.net/archive411/d ... aryDLL.zip

Basically if you send a word to the DLL it returns 0 if the work is found. If the word is not found it returns 1. If you send a string of words, it returns the number of words not found.

netmaestro · Post by **netmaestro** » Fri Oct 28, 2005 10:34 pm

I got a listing of 3235 new words to add to TWL98 from the latest version of the Official Scrabble players' Dictionary. So existing dlls are not going to be usable. A new one must be made. Thanks anyway, Paul. Your free stuff and ever-present readiness to help are much appreciated.

This: http://lloydsplace.com/newtwl98.txt

When processed thus:

Code: Select all

Dim wordin.s(250000)
count.l=0
ReadFile(0,"newtwl98.txt")
While Eof(0) = 0
  wordin(count)=ReadString()
  count+1
Wend
CloseFile(0)
count=0
OpenFile(0,"wordlist.bin")
WriteByte(0)
For i=0 To 171326
  WriteString(wordin(i))
  WriteByte(0)
Next i

CloseFile(0)

becomes this: http://lloydsplace.com/wordlist.bin

which gets included into this:

Code: Select all

; This procedure searches a binary word list made from a text file
; with all newlines replaced by binary zeros to delimit the words.
; The included word list must also begin with a binary zero.
; This version of the dll is written for PureBasic compatibility.

ProcedureDLL.l WordExists(wordtofind.s) 
  middleword.l
  wordtofind = UCase(wordtofind)
  wordchecked.s = ""
  found.l = 0
  lo = ?lexiconstart
  hi = ?lexiconend
  ccount=0
  While ccount <= 18               ; if it isn't found in 18 loops it isn't there
    middleword = lo + (hi-lo) >> 1 ; find the middle of the word list bytewise
    While PeekB(middleword-1)<>0   ; relocate to the beginning of the word
      middleword-1
    Wend
    wordchecked=PeekS(middleword)  ; check it against the search word
    If wordtofind=wordchecked
      found = 1
      ccount=18                    ; no more loops are required if search word is found 
    Else
      If wordtofind < wordchecked
        hi = middleword ;cut out this word and all words above leaving delimiting 0 at top
      Else
        lo = middleword + Len(wordchecked)-1 ;cut out this word and all below leaving delimiting 0 at bottom
      EndIf
    EndIf
    ccount+1
  Wend
  ProcedureReturn found
  DataSection 
    lexiconstart: 
    IncludeBinary "wordlist.bin"
    lexiconend:
  EndDataSection 
EndProcedure

which gets compiled into this: http://lloydsplace.com/lexicon_pb.dll

which in turn gets called by this:

Code: Select all

OpenLibrary(0,"lexicon_pb.dll")

OpenConsole()
PrintN("Starting...")
Dim words.s(171327)
ccount=0
ReadFile(0,"newtwl98.txt")
While Eof(0) = 0
  words(ccount)=ReadString()
  ccount+1
Wend
words(171327)="fragilistic" ;stick a bad one in there to test fail condition
CloseFile(0)
st1=GetTickCount_()
For i=0 To 171327
  If CallFunction(0,"WordExists",words(i)) <> 1
    PrintN(words(i) + " " + "Was not found")
  EndIf
Next i
st2=GetTickCount_()
result = (st2-st1)
PrintN("Validating 171327 words took "+Str(result) + " milliseconds.")
PrintN("Bye")
Input()
CloseConsole()

and it verifies every word in the list in around one second.

GedB · Post by **GedB** » Fri Oct 28, 2005 11:48 pm

Netmaestro,

That final version is looking rather nice.

I love watching code evolve like this. Different methods tried, lots of listings demonstrating various ideas. It's like poetry, don't you think?

netmaestro · Post by **netmaestro** » Fri Oct 28, 2005 11:58 pm

Fred Reinfeld wrote:
"Chess, like music, like love, has the power to make men happy."

I think programming has the same possibilities when you're using a finely-crafted tool like PureBasic. It really is a pleasing combination of art and science.

IdeasVacuum · Post by **IdeasVacuum** » Fri Oct 25, 2013 6:38 am

Adapted your excellent method netmaestro, works beautifully.

The code is nicely commented too, but there is one thing - how did you determine the limit of the number of loops? (18)

netmaestro · Post by **netmaestro** » Fri Oct 25, 2013 7:12 am

By the number of words in the list. A list of one million words can be exhaustively searched with 20 tests, half a million takes 19, and however many words are in that list, I forget but it's less than 1/4 million, takes up to 18. Not found in 18 tests, it isn't there.

netmaestro · Post by **netmaestro** » Fri Oct 25, 2013 7:33 am

Updated the links in that posting, the files are available now. networkmaestro.com is now lloydsplace.com since a couple of years.

PB · Post by PB » Fri Oct 25, 2013 11:29 am

> I would like to include the text file in my exe and search
> it quickly to see if a word exists in it or not

Not my code, but does what you describe:

Code: Select all

text$=PeekS(?MyTextFile)
offset=FindString(text$,"support@purebasic.com")
Debug offset

End

DataSection
  MyTextFile:
  IncludeBinary #PB_Compiler_Home+"SDK\Readme.txt"
  Data.c 0
EndDataSection

rsts · Post by **rsts** » Fri Oct 25, 2013 12:05 pm

Good answer (to an 8 year old question. Wonder if he's still waiting?)

PB · Post by PB » Fri Oct 25, 2013 12:22 pm

[Edit] My post was wrong, and seems to have upset someone.

rsts · Post by **rsts** » Fri Oct 25, 2013 12:38 pm

Once again, PB has to be correct.

Please show me where he asked the question anywhere besides the eight year old post.

PureBasic Forums - English

Use a binaryincluded text file

Re: Use a binaryincluded text file

Re: Use a binaryincluded text file

Re: Use a binaryincluded text file

Re: Use a binaryincluded text file

Re: Use a binaryincluded text file

Re: Use a binaryincluded text file

Re: Use a binaryincluded text file