Tesseract question or OCR recognizion

Just starting out? Need help? Post your questions and find answers here.
loulou2522
Enthusiast
Enthusiast
Posts: 552
Joined: Tue Oct 14, 2014 12:09 pm

Tesseract question or OCR recognizion

Post by loulou2522 »

Hello,
I would like to recognise text inside a PDF using TESSERACT or anothe OCR but I don't know how to go about it. Does anyone have a pre-written solution that I could adapt? I would also like to know where to get tesseract and how to install it.
Thanks in advance
Cyllceaux
Enthusiast
Enthusiast
Posts: 511
Joined: Mon Jun 23, 2014 1:18 pm

Re: Tesseract question or OCR recognizion

Post by Cyllceaux »

Hey there,

I use tesseract and ocrmypdf. I installed it with winget. you need python, too.

Installation and examples: https://ocrmypdf.readthedocs.io/en/late ... on-windows


This is my commandline for doing this

Code: Select all

ocrmypdf --output-type pdf -l deu+eng --redo-ocr --optimize 1 <input> <output>
I have A LOT pdfs for this. So I created a pb for creating a bat file.

Code: Select all

EnableExplicit

#opt="ocrmypdf --output-type pdf -l deu+eng --force-ocr --optimize 1 "
#norm="ocrmypdf --output-type pdf -l deu+eng "

#pfad="OCR"

CreateFile(1,"ocr_norm.bat")
CreateFile(2,"ocr_opt.bat")

WriteStringN(1,"cd OCR")
WriteStringN(2,"cd OCR")


If ExamineDirectory(0,#pfad,"*.pdf")
  While NextDirectoryEntry(0)
    WriteStringN(1,#norm+DirectoryEntryName(0)+" "+DirectoryEntryName(0))
    WriteStringN(2,#opt+DirectoryEntryName(0)+" "+DirectoryEntryName(0))
  Wend
  FinishDirectory(0)
EndIf

CloseFile(1)
CloseFile(2)
loulou2522
Enthusiast
Enthusiast
Posts: 552
Joined: Tue Oct 14, 2014 12:09 pm

Re: Tesseract question or OCR recognizion

Post by loulou2522 »

I install all with success. I install also French language.
I work on table and use the following sentence to convert
ocrmypdf -l fra --output-type pdf --pages 38-41 --oversample 300 --force-ocr output.pdf c56.pdf
But i have trouble with certain character :
exemple
Exportations et livraisons intra communautaires
is recognized as
livraisons fntracomenunautaires
and the result of conversion is
Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67 0:00:00
Start processing 16 pages concurrently ocr.py:98
38 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
39 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
40 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
41 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67 0:00:00
Postprocessing... ocr.py:147
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Image optimization did not improve the file - optimizations will not be used optimize.py:717
Image optimization ratio: 1.00 savings: -0.2% _pipeline.py:904
Total file size ratio: 1.03 savings: 2.5%
Is there something wrong ?
infratec
Always Here
Always Here
Posts: 7623
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Tesseract question or OCR recognizion

Post by infratec »

If the text in the PDF file is not handwritten ... simply extract the text no OCR needed:

https://www.purebasic.fr/english/viewto ... 10#p553210
Marc56us
Addict
Addict
Posts: 1600
Joined: Sat Feb 08, 2014 3:26 pm

Re: Tesseract question or OCR recognizion

Post by Marc56us »

page already has text! - rasterizing text and running OCR anyway
I have the impression that your PDF already contains text stored as text and not as an image ?
In this case, just use a simple text extractor: no need for OCR.
Try pdftotext
I work on table
pdftotext.exe -table <file.pdf>
(try also -layout instead of -table)
:wink:
loulou2522
Enthusiast
Enthusiast
Posts: 552
Joined: Tue Oct 14, 2014 12:09 pm

Re: Tesseract question or OCR recognizion

Post by loulou2522 »

I use also PDFTOTEXT but in Mt PDF i am sure that theres is not text Because table is inside PDF like an image and with PDFIMAGES I success to downoad jgp. img
Cyllceaux
Enthusiast
Enthusiast
Posts: 511
Joined: Mon Jun 23, 2014 1:18 pm

Re: Tesseract question or OCR recognizion

Post by Cyllceaux »

I wasn't sure what the task was. I use ocrmypdf to make a PDF searchable. So indexable by Windows and other programs. And since ocrmypdf can also turn images embedded in PDFs into text with the help of tesseract, I use this excessively.
KosterNET
User
User
Posts: 34
Joined: Tue Mar 22, 2016 10:08 pm

Re: Tesseract question or OCR recognizion

Post by KosterNET »

I created a tool to put a text-layer over PDFs from our scanner. This does the following:

1. Search for PDF's in a folder (and subfolders)
2. Check if it has a text-layer already using pdftotext
3. if not, create a multipage tiff-file of the PDF using gswin32c.exe
4. Create a text-only PDF-file from the tiff-file using tesseract (tesseract.exe uit.tiff text -l nld -c textonly_pdf=1 pdf
5. Merge the original PDF-file and the text-only PDF-file into a new file using pdftk.exe

The reason that I do not have tesseract create the OCR-ed PDF directly is that this way the PDF gets quite large. For me Tesseract works better on black-white tiff files.

Success with your journey :)
loulou2522
Enthusiast
Enthusiast
Posts: 552
Joined: Tue Oct 14, 2014 12:09 pm

Re: Tesseract question or OCR recognizion

Post by loulou2522 »

Thanks Kosternet
Can you send me your tools.Maybe it will be very ointeresant for me because with ocrmypdf my translation is not good
KosterNET
User
User
Posts: 34
Joined: Tue Mar 22, 2016 10:08 pm

Re: Tesseract question or OCR recognizion

Post by KosterNET »

I will send you a link via PM. It is internally developed and I have not done extensive testing / do not have time for long support hours ;-)
Post Reply