Page 1 of 3

Make a lib (from C-Code) and get Esgrid!

Posted: Tue Dec 09, 2008 7:32 pm
by Marco2007
Hello to everyone,

need some help from C-User. I`m very busy at work now, so I can`t try it by myself. ...and I`m a little too bad :wink: .

I need a lib and dll for extracting text from pdf.
Procedure (pdf.s, outputtxt.s) or something like that.

Here`s the code: http://www.codeproject.com/KB/cpp/ExtractPDFText.aspx

First one, who will do that for me, will get Esgrid (I will buy a new licence for him -> Srod will send him/her the key).

thanx
Marco

Here`s the Source: http://www.free-space.at/elke/ExtractPDFText_src.zip

Posted: Tue Dec 09, 2008 8:10 pm
by SFSxOI
http://www.rentacoder.com/RentACoder/Do ... fault.aspx

Just kidding :)

Anyway, I was just looking at doing something like this...maybe, if i could do it quickly enough. I have about around 3000 .pdf documents that need the text extracted and archived. I think what the boss wants to end up doing is have Adobe do it in some way. If you come up with something please let the rest of us know.

You know why they call it Adobe Acrobat? Because you have to be an acrobat to use it. Ughhhh...I hate .pdf to begin with.

Posted: Tue Dec 09, 2008 8:15 pm
by Marco2007
The exe, which is on that site works really good.
If someone could do a lib (of course it must work) -> it should be for everyone.

Posted: Tue Dec 09, 2008 8:59 pm
by srod
I'd do it, but I already own a copy of EsGRID! :wink:

Posted: Tue Dec 09, 2008 9:06 pm
by Marco2007
@Srod: I would like if you`d do it! Whatcha want?

Posted: Tue Dec 09, 2008 10:04 pm
by srod
Sorry mate - haven't the time right now. :)

From what I know about the pdf format though I really don't think it would be difficult to code such a routine from scratch. Something I'd be interested in looking at when I get time.

Posted: Tue Dec 09, 2008 10:08 pm
by Marco2007
:(

Anyone else?

Posted: Tue Dec 09, 2008 10:23 pm
by Marco2007
Ok! I got a solution, because the code from Codeproject doesn`t work perfectly like I want with my pdfs.

My solution: RunProgram the pdf -> Stringmark all -> Copy and paste it then into a textfile -> not the best solution, but it works.

Posted: Tue Dec 09, 2008 10:37 pm
by milan1612
http://rapidshare.com/files/171884768/pdftext.zip.html

There you are, I tested it briefly and didn't find any bugs. Let me now if you find one.
As I already have an EsGrid license I want you to donate the money to Srod,
he truly deserves it!

Posted: Tue Dec 09, 2008 10:53 pm
by Xombie
Caught this thread by a happy accident. @milan1612 - I tested your code on two different PDF files and it only wrote a 0 byte text file. Do you have a small PDF file that worked on your system for me to test on mine?

Posted: Tue Dec 09, 2008 10:57 pm
by milan1612
Here is the Call of Duty 4 manual:
http://rapidshare.com/files/171890493/manual.pdf.html
Works quite well here...

Posted: Tue Dec 09, 2008 11:09 pm
by srod
milan1612 wrote:http://rapidshare.com/files/171884768/pdftext.zip.html

There you are, I tested it briefly and didn't find any bugs. Let me now if you find one.
As I already have an EsGrid license I want you to donate the money to Srod,
he truly deserves it!
Marco, please - if I can, whilst it's a very kind offer and much appreciated, would you mind donating to Purebasic instead; I think that Fred and co are more deserving than I. :)

Posted: Tue Dec 09, 2008 11:10 pm
by Xombie
Can you try the file here: http://www.esri.com/library/whitepapers ... pefile.pdf

I've only found one PDF file on my system that works out of 10 so far.

Posted: Tue Dec 09, 2008 11:14 pm
by srod
Yes that particular pdf file must be using one of the alternative compression schemes for object streams than that supported by this c library.

Posted: Tue Dec 09, 2008 11:15 pm
by Xombie
Or some protection in place?

milan1612 - will you release your converted source code?