Searching contents of large number of PDF files

Just starting out? Need help? Post your questions and find answers here.
DevilDog
Enthusiast
Enthusiast
Posts: 210
Joined: Thu Aug 04, 2005 9:32 pm
Location: Houston, Tx.

Searching contents of large number of PDF files

Post by DevilDog »

Hi,
I have a C# .Net application that uses the Lucene library to index and search a large number of PDF files (over 50k) in a second or two. I am interested in converting it to PB or writing something in PB that uses another 3rd party library to do the same.

Is anyone aware of a library that can do this and does not require me to install the Java runtime etc. (Lucene is written in Java) or the .Net runtime? The ideal solution would use a 3rd party set of DLLs that PB can load and use to accomplish the goal.

Thanks
When all is said and done, more is said than done.
Marc56us
Addict
Addict
Posts: 1600
Joined: Sat Feb 08, 2014 3:26 pm

Re: Searching contents of large number of PDF files

Post by Marc56us »

Hi,

I doubt we'll be able to index 50,000 files in a second or two on our personal machines: just opening, reading and closing handles of files will take longer than that, even on SSD.
In my opinion, this is the speed of reading already indexed files (which Everything, for example, does very well).

There are DLLs for extracting text from PDFs, but I haven't tested them. There are examples on this forum (use the search engine) But many of these examples have broken links.

To search for text in a PDF, you can also use the xpdf library command line tools, for example.

Code: Select all

; Extract Pdf Text (if text is not saved as image, so without need OCR)
; Using XpdfReader Copyright 2024 Glyph & Cog, LLC 
; https://www.xpdfreader.com/download.html

EnableExplicit

; pdftotext.exe and test.pdf are in the same dir 
; send working dir in parameter
SetCurrentDirectory(ProgramParameter(0))

Define Exec = RunProgram("pdftotext.exe", 
                         "-f 1 -l 1 test.pdf -", 
                         "", 
                         #PB_Program_Open | #PB_Program_Read)
If Not Exec : Debug "error" : End : EndIf

Define Output$
While ProgramRunning(Exec)
    If AvailableProgramOutput(Exec)
        Output$ + ReadProgramString(Exec) + Chr(13)
    EndIf
Wend

Debug Output$

CloseProgram(Exec) 
End
:wink:
Axolotl
Addict
Addict
Posts: 842
Joined: Wed Dec 31, 2008 3:36 pm

Re: Searching contents of large number of PDF files

Post by Axolotl »

I'm using the same tool as Marc56us from time to time with different parameters.

Code: Select all

#PDFToolParams_TablePages$     = "-table -clip -nopgbrk"   ;' 
#PDFToolParams_RawOnePage$     = "-f 1 -l 1 -raw"  
#PDFToolParams_RawAllPages$    = "-raw"  
;#PDFToolParams_None$           = ""  
; ..... 
Parameter$ = #PDFToolParams_RawAllPages$ + " " + #DQUOTE$ + PDF_Filename$ + #DQUOTE$ + " -" 
; important is the last - for output to console (captured by ReadProgramString())
; ..... 
Just because it worked doesn't mean it works.
PureBasic 6.04 (x86) and <latest stable version and current alpha/beta> (x64) on Windows 11 Home. Now started with Linux (VM: Ubuntu 22.04).
DarkDragon
Addict
Addict
Posts: 2345
Joined: Mon Jun 02, 2003 9:16 am
Location: Germany
Contact:

Re: Searching contents of large number of PDF files

Post by DarkDragon »

Even if you have the full text of your pdfs, you still need to index it. Lucene is written in Java, do you actually use the Lucene.NET port? It's very powerful and fast and I think there is nothing comparable available in PB, yet, when it comes to full text indexing. You could try to check how the PyLucene Wrapper works and write a similar wrapper for PB.
bye,
Daniel
DevilDog
Enthusiast
Enthusiast
Posts: 210
Joined: Thu Aug 04, 2005 9:32 pm
Location: Houston, Tx.

Re: Searching contents of large number of PDF files

Post by DevilDog »

Thanks everyone for the replies!

Sorry I wasn't clear in my description. The searching of the indexed files with Lucene is what takes a second or two (actually less). The indexing of the files (50k) does take quite some time, but it's a one time thing.

I am using the .Net port of Lucene in my C# application and it works great. The issue is that to use it I have to install the .Net runtimes and my target system is quite small so I don't want to take up that much space.

That's what led me to wonder if there was some 3rd party library I could use along with PB to develop a small footprint solution.
When all is said and done, more is said than done.
morosh
Enthusiast
Enthusiast
Posts: 329
Joined: Wed Aug 03, 2011 4:52 am
Location: Beirut, Lebanon

Re: Searching contents of large number of PDF files

Post by morosh »

Have a look at: https://www.purebasic.fr/english/viewto ... 10#p553210, you can convert Pdf to Txt using Pdfium
HTH
PureBasic: Surprisingly simple, diabolically powerful
DarkDragon
Addict
Addict
Posts: 2345
Joined: Mon Jun 02, 2003 9:16 am
Location: Germany
Contact:

Re: Searching contents of large number of PDF files

Post by DarkDragon »

OP's main topic is to index the texts, pulling texts out of a PDF is probably the smallest issue.

You could use N-grams and tries for example. However you will not reach the accuracy, flexibility and probably also not the speed of lucene.

You could host a Solr instance and use the PB http/json commands for querying lucene. That would probably be the easiest way. Or try the retired Lucy project https://lucy.apache.org/
bye,
Daniel
Post Reply