Tesseract question or OCR recognizion
-
- Enthusiast
- Posts: 552
- Joined: Tue Oct 14, 2014 12:09 pm
Tesseract question or OCR recognizion
Hello,
I would like to recognise text inside a PDF using TESSERACT or anothe OCR but I don't know how to go about it. Does anyone have a pre-written solution that I could adapt? I would also like to know where to get tesseract and how to install it.
Thanks in advance
I would like to recognise text inside a PDF using TESSERACT or anothe OCR but I don't know how to go about it. Does anyone have a pre-written solution that I could adapt? I would also like to know where to get tesseract and how to install it.
Thanks in advance
Re: Tesseract question or OCR recognizion
Hey there,
I use tesseract and ocrmypdf. I installed it with winget. you need python, too.
Installation and examples: https://ocrmypdf.readthedocs.io/en/late ... on-windows
This is my commandline for doing this
I have A LOT pdfs for this. So I created a pb for creating a bat file.
I use tesseract and ocrmypdf. I installed it with winget. you need python, too.
Installation and examples: https://ocrmypdf.readthedocs.io/en/late ... on-windows
This is my commandline for doing this
Code: Select all
ocrmypdf --output-type pdf -l deu+eng --redo-ocr --optimize 1 <input> <output>
Code: Select all
EnableExplicit
#opt="ocrmypdf --output-type pdf -l deu+eng --force-ocr --optimize 1 "
#norm="ocrmypdf --output-type pdf -l deu+eng "
#pfad="OCR"
CreateFile(1,"ocr_norm.bat")
CreateFile(2,"ocr_opt.bat")
WriteStringN(1,"cd OCR")
WriteStringN(2,"cd OCR")
If ExamineDirectory(0,#pfad,"*.pdf")
While NextDirectoryEntry(0)
WriteStringN(1,#norm+DirectoryEntryName(0)+" "+DirectoryEntryName(0))
WriteStringN(2,#opt+DirectoryEntryName(0)+" "+DirectoryEntryName(0))
Wend
FinishDirectory(0)
EndIf
CloseFile(1)
CloseFile(2)
-
- Enthusiast
- Posts: 552
- Joined: Tue Oct 14, 2014 12:09 pm
Re: Tesseract question or OCR recognizion
I install all with success. I install also French language.
I work on table and use the following sentence to convert
exemple
I work on table and use the following sentence to convert
But i have trouble with certain character :ocrmypdf -l fra --output-type pdf --pages 38-41 --oversample 300 --force-ocr output.pdf c56.pdf
exemple
is recognized asExportations et livraisons intra communautaires
and the result of conversion islivraisons fntracomenunautaires
Is there something wrong ?Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67 0:00:00
Start processing 16 pages concurrently ocr.py:98
38 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
39 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
40 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
41 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67 0:00:00
Postprocessing... ocr.py:147
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Image optimization did not improve the file - optimizations will not be used optimize.py:717
Image optimization ratio: 1.00 savings: -0.2% _pipeline.py:904
Total file size ratio: 1.03 savings: 2.5%
Re: Tesseract question or OCR recognizion
If the text in the PDF file is not handwritten ... simply extract the text no OCR needed:
https://www.purebasic.fr/english/viewto ... 10#p553210
https://www.purebasic.fr/english/viewto ... 10#p553210
Re: Tesseract question or OCR recognizion
I have the impression that your PDF already contains text stored as text and not as an image ?page already has text! - rasterizing text and running OCR anyway
In this case, just use a simple text extractor: no need for OCR.
Try pdftotext
pdftotext.exe -table <file.pdf>I work on table
(try also -layout instead of -table)

-
- Enthusiast
- Posts: 552
- Joined: Tue Oct 14, 2014 12:09 pm
Re: Tesseract question or OCR recognizion
I use also PDFTOTEXT but in Mt PDF i am sure that theres is not text Because table is inside PDF like an image and with PDFIMAGES I success to downoad jgp. img
Re: Tesseract question or OCR recognizion
I wasn't sure what the task was. I use ocrmypdf to make a PDF searchable. So indexable by Windows and other programs. And since ocrmypdf can also turn images embedded in PDFs into text with the help of tesseract, I use this excessively.
Re: Tesseract question or OCR recognizion
I created a tool to put a text-layer over PDFs from our scanner. This does the following:
1. Search for PDF's in a folder (and subfolders)
2. Check if it has a text-layer already using pdftotext
3. if not, create a multipage tiff-file of the PDF using gswin32c.exe
4. Create a text-only PDF-file from the tiff-file using tesseract (tesseract.exe uit.tiff text -l nld -c textonly_pdf=1 pdf
5. Merge the original PDF-file and the text-only PDF-file into a new file using pdftk.exe
The reason that I do not have tesseract create the OCR-ed PDF directly is that this way the PDF gets quite large. For me Tesseract works better on black-white tiff files.
Success with your journey
1. Search for PDF's in a folder (and subfolders)
2. Check if it has a text-layer already using pdftotext
3. if not, create a multipage tiff-file of the PDF using gswin32c.exe
4. Create a text-only PDF-file from the tiff-file using tesseract (tesseract.exe uit.tiff text -l nld -c textonly_pdf=1 pdf
5. Merge the original PDF-file and the text-only PDF-file into a new file using pdftk.exe
The reason that I do not have tesseract create the OCR-ed PDF directly is that this way the PDF gets quite large. For me Tesseract works better on black-white tiff files.
Success with your journey

-
- Enthusiast
- Posts: 552
- Joined: Tue Oct 14, 2014 12:09 pm
Re: Tesseract question or OCR recognizion
Thanks Kosternet
Can you send me your tools.Maybe it will be very ointeresant for me because with ocrmypdf my translation is not good
Can you send me your tools.Maybe it will be very ointeresant for me because with ocrmypdf my translation is not good
Re: Tesseract question or OCR recognizion
I will send you a link via PM. It is internally developed and I have not done extensive testing / do not have time for long support hours 
