Page 1 of 1
Request-PDFtoText
Posted: Wed Feb 12, 2025 6:49 pm
by WilliamL
I see that there is a way to convert a PDF to text using 'PDFtoText" and Axolotl posted code of how to do it in this thread
https://www.purebasic.fr/english/viewto ... =pdftotext. It appears that you have to download a 'tool' and somehow finesse the code to work.
I would like to ask someone to make me an app (executable) of the code that will run on a Mac. I would rather have it convert to .txt rather than .rtf but both would be even better.
Thanks in advance.
Re: Request-PDFtoText
Posted: Thu Feb 13, 2025 4:59 pm
by Axolotl
Hi WilliamL,
I cannot help you with that, because I have no experiences on MacOS.
Maybe some pointers.
1. Download the MacOS version of the xpdf tools.
2. Remove the four Win-API calls (be deletion or see example below)
3. Rewrite the Procedure RunPDFToText() for MacOS (the following links can be of help)
Code: Select all
CompilerIf #PB_Compiler_OS = #PB_OS_Windows
CoInitialize_(#Null) ;' for autocomplete in stringgadgets
CompilerEndIf
runprogram command in Mac OS
Re: Request-PDFtoText
Posted: Wed Feb 26, 2025 11:05 am
by loulou2522
warning
PDFTOTEXT cannot convert all PDF files correctly. Conversion depends on how the PDF was created (by software or by scanning or photocopying).
To optimise reading, it is preferable to use TESSERACT OCR software beforehand. But even with this software it is necessary to use image processing software such as IMAGEMAGICK.
It should be pointed out, however, that even with these three programs, you will not achieve optimum character recognition (90 to 95% recognition rate).
Re: Request-PDFtoText
Posted: Wed Feb 26, 2025 6:30 pm
by WilliamL
Thanks for the info loulou2522.
I am scanning my PDF stock statements to accumulate the information on many stocks for my records. I have been using an OCR (PDF Converter OCR.app) which worked (almost perfectly.. the only one that I found that would) but the statement now has some new formatting that even gives it a hard time. The problem I'm running into is that spaces are missed and that causes the words to run together and then I can't parse out the information. I seem to need an actual text reading including the spaces and I suspect an OCR won't give me that. I have a version (1.2) of PDFtoText.app that works perfectly with the new statements and I was just looking for a later version (3.0?) so it would be updated to work with the new systems. I can just use the version I have. Oddly, if I just select the text and copy and paste it into TextEdit.app it works fine with my parsing routine so I have another method of retrieving the info. I am not sure why this method works since what is selected and what is pasted in TextEdit is much less (same with PDFtoText).
Fortunately, I have two methods to get what I need. I was just hoping someone might have the newer version of PDFtoText (3.0?) since the older version worked ok.
This is not a pressing problem for me right now. Thanks for your input.