Request-PDFtoText

Just starting out? Need help? Post your questions and find answers here.
WilliamL
Addict
Addict
Posts: 1252
Joined: Mon Aug 04, 2008 10:56 pm
Location: Seattle, USA

Request-PDFtoText

Post by WilliamL »

I see that there is a way to convert a PDF to text using 'PDFtoText" and Axolotl posted code of how to do it in this thread https://www.purebasic.fr/english/viewto ... =pdftotext. It appears that you have to download a 'tool' and somehow finesse the code to work.

I would like to ask someone to make me an app (executable) of the code that will run on a Mac. I would rather have it convert to .txt rather than .rtf but both would be even better.

Thanks in advance.
MacBook Pro-M1 (2021), Sequoia 15.4, PB 6.20
Axolotl
Addict
Addict
Posts: 802
Joined: Wed Dec 31, 2008 3:36 pm

Re: Request-PDFtoText

Post by Axolotl »

Hi WilliamL,
I cannot help you with that, because I have no experiences on MacOS. :oops:
Maybe some pointers.
1. Download the MacOS version of the xpdf tools.
2. Remove the four Win-API calls (be deletion or see example below)
3. Rewrite the Procedure RunPDFToText() for MacOS (the following links can be of help)

Code: Select all

CompilerIf #PB_Compiler_OS = #PB_OS_Windows
  CoInitialize_(#Null) ;' for autocomplete in stringgadgets  
CompilerEndIf
runprogram command in Mac OS
Just because it worked doesn't mean it works.
PureBasic 6.04 (x86) and <latest stable version and current alpha/beta> (x64) on Windows 11 Home. Now started with Linux (VM: Ubuntu 22.04).
loulou2522
Enthusiast
Enthusiast
Posts: 542
Joined: Tue Oct 14, 2014 12:09 pm

Re: Request-PDFtoText

Post by loulou2522 »

warning
PDFTOTEXT cannot convert all PDF files correctly. Conversion depends on how the PDF was created (by software or by scanning or photocopying).
To optimise reading, it is preferable to use TESSERACT OCR software beforehand. But even with this software it is necessary to use image processing software such as IMAGEMAGICK.
It should be pointed out, however, that even with these three programs, you will not achieve optimum character recognition (90 to 95% recognition rate).
WilliamL
Addict
Addict
Posts: 1252
Joined: Mon Aug 04, 2008 10:56 pm
Location: Seattle, USA

Re: Request-PDFtoText

Post by WilliamL »

Thanks for the info loulou2522.

I am scanning my PDF stock statements to accumulate the information on many stocks for my records. I have been using an OCR (PDF Converter OCR.app) which worked (almost perfectly.. the only one that I found that would) but the statement now has some new formatting that even gives it a hard time. The problem I'm running into is that spaces are missed and that causes the words to run together and then I can't parse out the information. I seem to need an actual text reading including the spaces and I suspect an OCR won't give me that. I have a version (1.2) of PDFtoText.app that works perfectly with the new statements and I was just looking for a later version (3.0?) so it would be updated to work with the new systems. I can just use the version I have. Oddly, if I just select the text and copy and paste it into TextEdit.app it works fine with my parsing routine so I have another method of retrieving the info. I am not sure why this method works since what is selected and what is pasted in TextEdit is much less (same with PDFtoText).

Fortunately, I have two methods to get what I need. I was just hoping someone might have the newer version of PDFtoText (3.0?) since the older version worked ok.

This is not a pressing problem for me right now. Thanks for your input.
MacBook Pro-M1 (2021), Sequoia 15.4, PB 6.20
Post Reply