Page 2 of 3
Posted: Tue Dec 09, 2008 11:29 pm
by milan1612
Xombie wrote:milan1612 - will you release your converted source code?
I didn't change much and the main problem was the compilation & linking process
but if you really like to see it:
http://rapidshare.com/files/171898788/pdf.cpp.html
And Marco, if Srod doesn't want the money feel free to donate it to Fantaisie Software...
Posted: Wed Dec 10, 2008 2:38 am
by Rook Zimbabwe
Xombie... a possibility also: Could it be Page as Image?
Posted: Wed Dec 10, 2008 8:48 am
by gnozal
I just tried milan1612's release.
It doesn't seem to work correctly with non english text, i.e. using characters like éà...
Posted: Wed Dec 10, 2008 12:15 pm
by milan1612
gnozal wrote:I just tried milan1612's release.
It doesn't seem to work correctly with non english text, i.e. using characters like éà...
It's because the original C++ code is ASCII only, no unicode

Posted: Wed Dec 10, 2008 12:43 pm
by srod
Well, pdf is not Unicode by default; it essentially uses a 7-bit Ascii encoding. My limited understanding of the pdf format indicates that to support unicode you essentially have to create tables of character codes (to map character codes to font glyphs etc.) and from what I saw of the c-source, the library doesn't seem equipped at all to deal with unicode character sets. Whether the library uses a wide-character string representation for it's variables or not is, I thus think, immaterial.
Posted: Wed Dec 10, 2008 1:23 pm
by milan1612
Found another possibility to extract text from a pdf:
xpdf (Direct download)
It is a pack of opensource utilities containing, among others, a pdf2txt commandline
utility which I must say works much better than the C++ library.
Posted: Wed Dec 10, 2008 7:41 pm
by Marco2007
@Milan: Could you do it with the better one, please?
...as I wrote, there are problems with this one:
There`s always 240 in the txt-File:
SCHNITT-240
SPALT
LINSEN-240
BRENNWEITE
DUESEN-240
DURCHMESSER
MAX. LASER-240
LEISTUNG
EINSTELL-240
TEILENUMMER:240
TEILE-ID:240
Any ideas, why?
...and txt-File is empty with this pdf (created with PurePdf):
http://www.free-space.at/elke/Marco.pdf
You decide, what happen -> Fantaisie Software will get the Donation and we all have something, what we can use. That`s good!
Any chance for the better pdfextracter? Now it`s a little problematic...
Thank you!
Posted: Wed Dec 10, 2008 7:49 pm
by milan1612
@Marco
I had another look on the tool mentioned above today. I managed to compile it
from source, but before I even try to make a library out of it I have to know
if the utility is good enough for you. Do you mind trying if it works for you?
Posted: Wed Dec 10, 2008 7:55 pm
by Marco2007
Of Course! I pm you...
Posted: Wed Dec 10, 2008 10:09 pm
by Xombie
I'm watching this thread with an eagle eye. It would be very useful to my work-work project to extract text from PDF files.
Let me know if y'all need any additional testing or help on compiling and such.
Posted: Wed Dec 10, 2008 10:11 pm
by milan1612
Xombie wrote:I'm watching this thread with an eagle eye. It would be very useful to my work-work project to extract text from PDF files.
Let me know if y'all need any additional testing or help on compiling and such.
The library conversion of the tool mentioned above is finished, Marco and I
are currently testing various PDFs. It's working much better than my first
conversion, if you pm me your e-mail I can send you the library...
The more testers the better the result

Posted: Wed Dec 10, 2008 10:20 pm
by Marco2007
It`s brilliant. I tested it with Pdfs created with PurePDF and I testet different pdfs. Milan`s work is really great!! ...no problems.
@Fantaisie Software: ...could take til Monday (i have to reload my electron prepaid visa for PayPal).
Thanks a lot to Milan!

Posted: Wed Dec 10, 2008 10:28 pm
by milan1612
No problem Marco, it was fun to refresh my C++ knowledge. Please don't forget that
the major work on this library wasn't done by me but by the original authors of Xpdf.
For all the others here is the link:
http://rapidshare.com/files/172190796/pdf2text.zip
Re:
Posted: Wed Jun 15, 2011 10:44 am
by MachineCode
Anyone got this pdftext.zip file? This link is 404.
Re: Re:
Posted: Wed Jun 15, 2011 11:02 am
by Little John
MachineCode wrote:Anyone got this pdftext.zip file?
Yes.
And many thanks to Milan!
Regards, Little John