Searching contents of large number of PDF files

DevilDog · Post by **DevilDog** » Thu Jun 27, 2024 9:25 pm

Hi,
I have a C# .Net application that uses the Lucene library to index and search a large number of PDF files (over 50k) in a second or two. I am interested in converting it to PB or writing something in PB that uses another 3rd party library to do the same.

Is anyone aware of a library that can do this and does not require me to install the Java runtime etc. (Lucene is written in Java) or the .Net runtime? The ideal solution would use a 3rd party set of DLLs that PB can load and use to accomplish the goal.

Thanks

Marc56us · Post by **Marc56us** » Fri Jun 28, 2024 9:36 am

Hi,

I doubt we'll be able to index 50,000 files in a second or two on our personal machines: just opening, reading and closing handles of files will take longer than that, even on SSD.
In my opinion, this is the speed of reading already indexed files (which Everything, for example, does very well).

There are DLLs for extracting text from PDFs, but I haven't tested them. There are examples on this forum (use the search engine) But many of these examples have broken links.

To search for text in a PDF, you can also use the xpdf library command line tools, for example.

Code: Select all

; Extract Pdf Text (if text is not saved as image, so without need OCR)
; Using XpdfReader Copyright 2024 Glyph & Cog, LLC 
; https://www.xpdfreader.com/download.html

EnableExplicit

; pdftotext.exe and test.pdf are in the same dir 
; send working dir in parameter
SetCurrentDirectory(ProgramParameter(0))

Define Exec = RunProgram("pdftotext.exe", 
                         "-f 1 -l 1 test.pdf -", 
                         "", 
                         #PB_Program_Open | #PB_Program_Read)
If Not Exec : Debug "error" : End : EndIf

Define Output$
While ProgramRunning(Exec)
    If AvailableProgramOutput(Exec)
        Output$ + ReadProgramString(Exec) + Chr(13)
    EndIf
Wend

Debug Output$

CloseProgram(Exec) 
End

Axolotl · Post by **Axolotl** » Fri Jun 28, 2024 12:06 pm

I'm using the same tool as Marc56us from time to time with different parameters.

Code: Select all

#PDFToolParams_TablePages$     = "-table -clip -nopgbrk"   ;' 
#PDFToolParams_RawOnePage$     = "-f 1 -l 1 -raw"  
#PDFToolParams_RawAllPages$    = "-raw"  
;#PDFToolParams_None$           = ""  
; ..... 
Parameter$ = #PDFToolParams_RawAllPages$ + " " + #DQUOTE$ + PDF_Filename$ + #DQUOTE$ + " -" 
; important is the last - for output to console (captured by ReadProgramString())
; .....

DarkDragon · Post by **DarkDragon** » Fri Jun 28, 2024 2:24 pm

Even if you have the full text of your pdfs, you still need to index it. Lucene is written in Java, do you actually use the Lucene.NET port? It's very powerful and fast and I think there is nothing comparable available in PB, yet, when it comes to full text indexing. You could try to check how the PyLucene Wrapper works and write a similar wrapper for PB.

DevilDog · Post by **DevilDog** » Fri Jun 28, 2024 2:56 pm

Thanks everyone for the replies!

Sorry I wasn't clear in my description. The searching of the indexed files with Lucene is what takes a second or two (actually less). The indexing of the files (50k) does take quite some time, but it's a one time thing.

I am using the .Net port of Lucene in my C# application and it works great. The issue is that to use it I have to install the .Net runtimes and my target system is quite small so I don't want to take up that much space.

That's what led me to wonder if there was some 3rd party library I could use along with PB to develop a small footprint solution.

morosh · Post by **morosh** » Fri Jun 28, 2024 4:06 pm

Have a look at: https://www.purebasic.fr/english/viewto ... 10#p553210, you can convert Pdf to Txt using Pdfium
HTH

DarkDragon · Post by **DarkDragon** » Sat Jun 29, 2024 6:55 am

OP's main topic is to index the texts, pulling texts out of a PDF is probably the smallest issue.

You could use N-grams and tries for example. However you will not reach the accuracy, flexibility and probably also not the speed of lucene.

You could host a Solr instance and use the PB http/json commands for querying lucene. That would probably be the easiest way. Or try the retired Lucy project https://lucy.apache.org/

PureBasic Forums - English

Searching contents of large number of PDF files

Searching contents of large number of PDF files

Re: Searching contents of large number of PDF files

Re: Searching contents of large number of PDF files

Re: Searching contents of large number of PDF files

Re: Searching contents of large number of PDF files

Re: Searching contents of large number of PDF files

Re: Searching contents of large number of PDF files