vwidmer wrote:....... or any direction any one can give me on this
Here's an alternate answer to your question:
This is an extremely complex problem, and there is no simple answer. Your initial efforts will be inadequate and inefficient, but they will give you something to build on.
The first thing you must do, as is the case with any programming problem, is establish the rules for processing the data. This will not be an easy thing to do for your situation, because you are asking the program to make subjective decisions. In other words, what determines if strings of words are similar? What words should not be considered when making decisions? How many consecutive words should be considered when looking for similarities?
The next problem to solve is how you structure the database in order to look for similarities. Presumably, the file name will be the major key, with one or more minor keys to represent the various word sequences.
And lastly, how do you go about actually searching for similarities between two files? If the database is not structured correctly, you could be looking at very long search times.
Here's something to get you started:
In the 1960s, IBM used a simple indexing technique for listing all their technical manuals. It was called a KWIC index (Key Word in Context) and was printed on fan-fold paper. This was in the days before online enquiries. The one we used in Toronto was about 15cm (6") thick.
The concept is simple:
All non-trivial words in the document title were identified and used as a key word. Duplicates were ignored. This key word, plus the document reference number, made a unique key. There were as many keys for each document title as there were non-trivial words. The KWIC report was created by scanning all the document keys in sequence, and printing the full document title for each key encountered. To make the report easy to read, the key words were centred on the middle of the page. The effect of this was to list all "related/similar" documents together, based on the key words in the title.
You could do something similar, but using blocks of consecutive words in place of the document title. The block reference could be its position in the text file. For example, in a 100 word file, you would have 93 eight-word blocks. And you might decide to look for matches based on 2, 3, or 4 key words per block. This would give you 279 keys for the 100 word file. You could then compare each key in the file to the keys in other files, looking for matches on the key word portion of the keys. And remember, the unique words in the key field should be in alphabetical order.
Edit:
IzzyB wrote:Have you looked into "Key Word In Context" (aka: KWIC) indexing?
You beat me to it!
