I see that there is a way to convert a PDF to text using 'PDFtoText" and Axolotl posted code of how to do it in this thread https://www.purebasic.fr/english/viewto ... =pdftotext.  It appears that you have to download a 'tool' and somehow finesse the code to work.
I would like to ask someone to make me an app (executable) of the code that will run on a Mac.  I would rather have it convert to .txt rather than .rtf but both would be even better.
Thanks in advance.
			
			
									
									Request-PDFtoText
Request-PDFtoText
MacBook Pro-M1 (2021), Sequoia 15.4, PB 6.20
						Re: Request-PDFtoText
Hi WilliamL,
I cannot help you with that, because I have no experiences on MacOS. 
 
Maybe some pointers.
1. Download the MacOS version of the xpdf tools.
2. Remove the four Win-API calls (be deletion or see example below)
3. Rewrite the Procedure RunPDFToText() for MacOS (the following links can be of help)
runprogram command in Mac OS
			
			
									
									I cannot help you with that, because I have no experiences on MacOS.
 
 Maybe some pointers.
1. Download the MacOS version of the xpdf tools.
2. Remove the four Win-API calls (be deletion or see example below)
3. Rewrite the Procedure RunPDFToText() for MacOS (the following links can be of help)
Code: Select all
CompilerIf #PB_Compiler_OS = #PB_OS_Windows
  CoInitialize_(#Null) ;' for autocomplete in stringgadgets  
CompilerEndIf
Just because it worked doesn't mean it works.
PureBasic 6.04 (x86) and <latest stable version and current alpha/beta> (x64) on Windows 11 Home. Now started with Linux (VM: Ubuntu 22.04).
						PureBasic 6.04 (x86) and <latest stable version and current alpha/beta> (x64) on Windows 11 Home. Now started with Linux (VM: Ubuntu 22.04).
- 
				loulou2522
- Enthusiast 
- Posts: 553
- Joined: Tue Oct 14, 2014 12:09 pm
Re: Request-PDFtoText
warning
PDFTOTEXT cannot convert all PDF files correctly. Conversion depends on how the PDF was created (by software or by scanning or photocopying).
To optimise reading, it is preferable to use TESSERACT OCR software beforehand. But even with this software it is necessary to use image processing software such as IMAGEMAGICK.
It should be pointed out, however, that even with these three programs, you will not achieve optimum character recognition (90 to 95% recognition rate).
			
			
									
									
						PDFTOTEXT cannot convert all PDF files correctly. Conversion depends on how the PDF was created (by software or by scanning or photocopying).
To optimise reading, it is preferable to use TESSERACT OCR software beforehand. But even with this software it is necessary to use image processing software such as IMAGEMAGICK.
It should be pointed out, however, that even with these three programs, you will not achieve optimum character recognition (90 to 95% recognition rate).
Re: Request-PDFtoText
Thanks for the info loulou2522.
I am scanning my PDF stock statements to accumulate the information on many stocks for my records. I have been using an OCR (PDF Converter OCR.app) which worked (almost perfectly.. the only one that I found that would) but the statement now has some new formatting that even gives it a hard time. The problem I'm running into is that spaces are missed and that causes the words to run together and then I can't parse out the information. I seem to need an actual text reading including the spaces and I suspect an OCR won't give me that. I have a version (1.2) of PDFtoText.app that works perfectly with the new statements and I was just looking for a later version (3.0?) so it would be updated to work with the new systems. I can just use the version I have. Oddly, if I just select the text and copy and paste it into TextEdit.app it works fine with my parsing routine so I have another method of retrieving the info. I am not sure why this method works since what is selected and what is pasted in TextEdit is much less (same with PDFtoText).
Fortunately, I have two methods to get what I need. I was just hoping someone might have the newer version of PDFtoText (3.0?) since the older version worked ok.
This is not a pressing problem for me right now. Thanks for your input.
			
			
									
									I am scanning my PDF stock statements to accumulate the information on many stocks for my records. I have been using an OCR (PDF Converter OCR.app) which worked (almost perfectly.. the only one that I found that would) but the statement now has some new formatting that even gives it a hard time. The problem I'm running into is that spaces are missed and that causes the words to run together and then I can't parse out the information. I seem to need an actual text reading including the spaces and I suspect an OCR won't give me that. I have a version (1.2) of PDFtoText.app that works perfectly with the new statements and I was just looking for a later version (3.0?) so it would be updated to work with the new systems. I can just use the version I have. Oddly, if I just select the text and copy and paste it into TextEdit.app it works fine with my parsing routine so I have another method of retrieving the info. I am not sure why this method works since what is selected and what is pasted in TextEdit is much less (same with PDFtoText).
Fortunately, I have two methods to get what I need. I was just hoping someone might have the newer version of PDFtoText (3.0?) since the older version worked ok.
This is not a pressing problem for me right now. Thanks for your input.
MacBook Pro-M1 (2021), Sequoia 15.4, PB 6.20
						