I need to compare lots of text files and find similar sentences or word combinations in them and then build a db from that. Is there something like this already done or any direction any one can give me on this.
Thanks
Compare and Find
Compare and Find
WARNING: I dont know what I am doing! I just put stuff here and there and sometimes like magic it works. So please improve on my code and post your changes so I can learn more. TIA
-
- Enthusiast
- Posts: 542
- Joined: Tue Apr 24, 2012 5:08 pm
- Location: Ontario, Canada
Re: Compare and Find
Yes, it's called Google.vwidmer wrote:Is there something like this already done.........

For ten years Caesar ruled with an iron hand, then with a wooden foot, and finally with a piece of string.
~ Spike Milligan
~ Spike Milligan
Re: Compare and Find
Have you looked into "Key Word In Context" (aka: KWIC) indexing?
If that's of interest, I'd point you to the Wikipedia article. I seem to recall some OpenSource/Free (think GNU and others) utilities to produce permuted indexes and concordances. There's also an example in the awk book (isbn: 978-0201079814).
If that's of interest, I'd point you to the Wikipedia article. I seem to recall some OpenSource/Free (think GNU and others) utilities to produce permuted indexes and concordances. There's also an example in the awk book (isbn: 978-0201079814).
-
- Enthusiast
- Posts: 542
- Joined: Tue Apr 24, 2012 5:08 pm
- Location: Ontario, Canada
Re: Compare and Find
Here's an alternate answer to your question:vwidmer wrote:....... or any direction any one can give me on this
This is an extremely complex problem, and there is no simple answer. Your initial efforts will be inadequate and inefficient, but they will give you something to build on.
The first thing you must do, as is the case with any programming problem, is establish the rules for processing the data. This will not be an easy thing to do for your situation, because you are asking the program to make subjective decisions. In other words, what determines if strings of words are similar? What words should not be considered when making decisions? How many consecutive words should be considered when looking for similarities?
The next problem to solve is how you structure the database in order to look for similarities. Presumably, the file name will be the major key, with one or more minor keys to represent the various word sequences.
And lastly, how do you go about actually searching for similarities between two files? If the database is not structured correctly, you could be looking at very long search times.
Here's something to get you started:
In the 1960s, IBM used a simple indexing technique for listing all their technical manuals. It was called a KWIC index (Key Word in Context) and was printed on fan-fold paper. This was in the days before online enquiries. The one we used in Toronto was about 15cm (6") thick.

The concept is simple:
All non-trivial words in the document title were identified and used as a key word. Duplicates were ignored. This key word, plus the document reference number, made a unique key. There were as many keys for each document title as there were non-trivial words. The KWIC report was created by scanning all the document keys in sequence, and printing the full document title for each key encountered. To make the report easy to read, the key words were centred on the middle of the page. The effect of this was to list all "related/similar" documents together, based on the key words in the title.
You could do something similar, but using blocks of consecutive words in place of the document title. The block reference could be its position in the text file. For example, in a 100 word file, you would have 93 eight-word blocks. And you might decide to look for matches based on 2, 3, or 4 key words per block. This would give you 279 keys for the 100 word file. You could then compare each key in the file to the keys in other files, looking for matches on the key word portion of the keys. And remember, the unique words in the key field should be in alphabetical order.
Edit:
You beat me to it!IzzyB wrote:Have you looked into "Key Word In Context" (aka: KWIC) indexing?

For ten years Caesar ruled with an iron hand, then with a wooden foot, and finally with a piece of string.
~ Spike Milligan
~ Spike Milligan