Tesseract question or OCR recognizion

loulou2522 · Post by **loulou2522** » Fri Dec 15, 2023 2:06 pm

Hello,
I would like to recognise text inside a PDF using TESSERACT or anothe OCR but I don't know how to go about it. Does anyone have a pre-written solution that I could adapt? I would also like to know where to get tesseract and how to install it.
Thanks in advance

Cyllceaux · Post by **Cyllceaux** » Fri Dec 15, 2023 3:11 pm

Hey there,

I use tesseract and ocrmypdf. I installed it with winget. you need python, too.

Installation and examples: https://ocrmypdf.readthedocs.io/en/late ... on-windows

This is my commandline for doing this

Code: Select all

ocrmypdf --output-type pdf -l deu+eng --redo-ocr --optimize 1 <input> <output>

I have A LOT pdfs for this. So I created a pb for creating a bat file.

Code: Select all

EnableExplicit

#opt="ocrmypdf --output-type pdf -l deu+eng --force-ocr --optimize 1 "
#norm="ocrmypdf --output-type pdf -l deu+eng "

#pfad="OCR"

CreateFile(1,"ocr_norm.bat")
CreateFile(2,"ocr_opt.bat")

WriteStringN(1,"cd OCR")
WriteStringN(2,"cd OCR")


If ExamineDirectory(0,#pfad,"*.pdf")
  While NextDirectoryEntry(0)
    WriteStringN(1,#norm+DirectoryEntryName(0)+" "+DirectoryEntryName(0))
    WriteStringN(2,#opt+DirectoryEntryName(0)+" "+DirectoryEntryName(0))
  Wend
  FinishDirectory(0)
EndIf

CloseFile(1)
CloseFile(2)

loulou2522 · Post by **loulou2522** » Sat Dec 16, 2023 4:24 pm

I install all with success. I install also French language.
I work on table and use the following sentence to convert

ocrmypdf -l fra --output-type pdf --pages 38-41 --oversample 300 --force-ocr output.pdf c56.pdf

But i have trouble with certain character :
exemple

Exportations et livraisons intra communautaires

is recognized as

livraisons fntracomenunautaires

and the result of conversion is

Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67 0:00:00
Start processing 16 pages concurrently ocr.py:98
38 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
39 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
40 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
41 page already has text! - rasterizing text and running OCR anyway _pipeline.py:305
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67 0:00:00
Postprocessing... ocr.py:147
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Image optimization did not improve the file - optimizations will not be used optimize.py:717
Image optimization ratio: 1.00 savings: -0.2% _pipeline.py:904
Total file size ratio: 1.03 savings: 2.5%

Is there something wrong ?

infratec · Post by **infratec** » Sat Dec 16, 2023 4:46 pm

If the text in the PDF file is not handwritten ... simply extract the text no OCR needed:

https://www.purebasic.fr/english/viewto ... 10#p553210

Marc56us · Post by **Marc56us** » Sat Dec 16, 2023 4:47 pm

page already has text! - rasterizing text and running OCR anyway

I have the impression that your PDF already contains text stored as text and not as an image ?
In this case, just use a simple text extractor: no need for OCR.
Try pdftotext

I work on table

pdftotext.exe -table <file.pdf>
(try also -layout instead of -table)

loulou2522 · Post by **loulou2522** » Sun Dec 17, 2023 9:21 am

I use also PDFTOTEXT but in Mt PDF i am sure that theres is not text Because table is inside PDF like an image and with PDFIMAGES I success to downoad jgp. img

Cyllceaux · Post by **Cyllceaux** » Mon Dec 18, 2023 8:57 am

I wasn't sure what the task was. I use ocrmypdf to make a PDF searchable. So indexable by Windows and other programs. And since ocrmypdf can also turn images embedded in PDFs into text with the help of tesseract, I use this excessively.

KosterNET · Post by **KosterNET** » Tue Dec 19, 2023 11:46 am

I created a tool to put a text-layer over PDFs from our scanner. This does the following:

1. Search for PDF's in a folder (and subfolders)
2. Check if it has a text-layer already using pdftotext
3. if not, create a multipage tiff-file of the PDF using gswin32c.exe
4. Create a text-only PDF-file from the tiff-file using tesseract (tesseract.exe uit.tiff text -l nld -c textonly_pdf=1 pdf
5. Merge the original PDF-file and the text-only PDF-file into a new file using pdftk.exe

The reason that I do not have tesseract create the OCR-ed PDF directly is that this way the PDF gets quite large. For me Tesseract works better on black-white tiff files.

Success with your journey

loulou2522 · Post by **loulou2522** » Wed Dec 20, 2023 9:26 am

Thanks Kosternet
Can you send me your tools.Maybe it will be very ointeresant for me because with ocrmypdf my translation is not good

KosterNET · Post by **KosterNET** » Thu Dec 21, 2023 10:13 am

I will send you a link via PM. It is internally developed and I have not done extensive testing / do not have time for long support hours

PureBasic Forums - English

Tesseract question or OCR recognizion

Tesseract question or OCR recognizion

Re: Tesseract question or OCR recognizion

Re: Tesseract question or OCR recognizion

Re: Tesseract question or OCR recognizion

Re: Tesseract question or OCR recognizion

Re: Tesseract question or OCR recognizion

Re: Tesseract question or OCR recognizion

Re: Tesseract question or OCR recognizion

Re: Tesseract question or OCR recognizion

Re: Tesseract question or OCR recognizion