Page 1 of 1

Tesseract question or OCR recognizion

Posted: Fri Dec 15, 2023 2:06 pm
by loulou2522
Hello,
I would like to recognise text inside a PDF using TESSERACT or anothe OCR but I don't know how to go about it. Does anyone have a pre-written solution that I could adapt? I would also like to know where to get tesseract and how to install it.
Thanks in advance

Re: Tesseract question or OCR recognizion

Posted: Fri Dec 15, 2023 3:11 pm
by Cyllceaux
Hey there,

I use tesseract and ocrmypdf. I installed it with winget. you need python, too.

Installation and examples: https://ocrmypdf.readthedocs.io/en/late ... on-windows


This is my commandline for doing this

Code: Select all

ocrmypdf --output-type pdf -l deu+eng --redo-ocr --optimize 1 <input> <output>
I have A LOT pdfs for this. So I created a pb for creating a bat file.

Code: Select all

EnableExplicit

#opt="ocrmypdf --output-type pdf -l deu+eng --force-ocr --optimize 1 "
#norm="ocrmypdf --output-type pdf -l deu+eng "

#pfad="OCR"

CreateFile(1,"ocr_norm.bat")
CreateFile(2,"ocr_opt.bat")

WriteStringN(1,"cd OCR")
WriteStringN(2,"cd OCR")


If ExamineDirectory(0,#pfad,"*.pdf")
  While NextDirectoryEntry(0)
    WriteStringN(1,#norm+DirectoryEntryName(0)+" "+DirectoryEntryName(0))
    WriteStringN(2,#opt+DirectoryEntryName(0)+" "+DirectoryEntryName(0))
  Wend
  FinishDirectory(0)
EndIf

CloseFile(1)
CloseFile(2)

Re: Tesseract question or OCR recognizion

Posted: Sat Dec 16, 2023 4:24 pm
by loulou2522
I install all with success. I install also French language.
I work on table and use the following sentence to convert
ocrmypdf -l fra --output-type pdf --pages 38-41 --oversample 300 --force-ocr output.pdf c56.pdf
But i have trouble with certain character :
exemple
Exportations et livraisons intra communautaires
is recognized as
livraisons fntracomenunautaires
and the result of conversion is
Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67 0:00:00
Start processing 16 pages concurrently ocr.py:98
38 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
39 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
40 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
41 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67 0:00:00
Postprocessing... ocr.py:147
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Image optimization did not improve the file - optimizations will not be used optimize.py:717
Image optimization ratio: 1.00 savings: -0.2% _pipeline.py:904
Total file size ratio: 1.03 savings: 2.5%
Is there something wrong ?

Re: Tesseract question or OCR recognizion

Posted: Sat Dec 16, 2023 4:46 pm
by infratec
If the text in the PDF file is not handwritten ... simply extract the text no OCR needed:

https://www.purebasic.fr/english/viewto ... 10#p553210

Re: Tesseract question or OCR recognizion

Posted: Sat Dec 16, 2023 4:47 pm
by Marc56us
page already has text! - rasterizing text and running OCR anyway
I have the impression that your PDF already contains text stored as text and not as an image ?
In this case, just use a simple text extractor: no need for OCR.
Try pdftotext
I work on table
pdftotext.exe -table <file.pdf>
(try also -layout instead of -table)
:wink:

Re: Tesseract question or OCR recognizion

Posted: Sun Dec 17, 2023 9:21 am
by loulou2522
I use also PDFTOTEXT but in Mt PDF i am sure that theres is not text Because table is inside PDF like an image and with PDFIMAGES I success to downoad jgp. img

Re: Tesseract question or OCR recognizion

Posted: Mon Dec 18, 2023 8:57 am
by Cyllceaux
I wasn't sure what the task was. I use ocrmypdf to make a PDF searchable. So indexable by Windows and other programs. And since ocrmypdf can also turn images embedded in PDFs into text with the help of tesseract, I use this excessively.

Re: Tesseract question or OCR recognizion

Posted: Tue Dec 19, 2023 11:46 am
by KosterNET
I created a tool to put a text-layer over PDFs from our scanner. This does the following:

1. Search for PDF's in a folder (and subfolders)
2. Check if it has a text-layer already using pdftotext
3. if not, create a multipage tiff-file of the PDF using gswin32c.exe
4. Create a text-only PDF-file from the tiff-file using tesseract (tesseract.exe uit.tiff text -l nld -c textonly_pdf=1 pdf
5. Merge the original PDF-file and the text-only PDF-file into a new file using pdftk.exe

The reason that I do not have tesseract create the OCR-ed PDF directly is that this way the PDF gets quite large. For me Tesseract works better on black-white tiff files.

Success with your journey :)

Re: Tesseract question or OCR recognizion

Posted: Wed Dec 20, 2023 9:26 am
by loulou2522
Thanks Kosternet
Can you send me your tools.Maybe it will be very ointeresant for me because with ocrmypdf my translation is not good

Re: Tesseract question or OCR recognizion

Posted: Thu Dec 21, 2023 10:13 am
by KosterNET
I will send you a link via PM. It is internally developed and I have not done extensive testing / do not have time for long support hours ;-)