It's not like hes going to make millions with it if he accomplishes anything. Seriously, most captchas are not as good as you make them sound; that's the thing... talking crap makes some people run away, not me. (and by that I mean, make them seem better than they are).
First, you'll need a cool interface for bitmap manipulation - so you can forget about file formats and just deal with raw data.
Secondly, you'll need to clean your input data to a state where it becomes usable for you.
Most captchas, still, try to add noise and artifacts on lower gammas which means that you can effectively cut all that away and end up with a series of "binary images" (because we only have 2 colors now).
Either way, once you have cleaned the data, you'll need to separate it into glpyhs, once that's done, you'll need to register each glyph so it's rotation is nulled out. Once that's done, you can use your favorite method to identify each one of the glyphs, or a combination of them all.
The thing that makes good captchas good is the fact that they don't use public fonts, instead, private font sets. This means that in order to correctly train a neural net, you'll need to first get all the glyphs in the shape you'll be having them after the processing is done.
While some people avoid this step, it does bring their results down to 80% positives at most.
But you see, even if I talk about it without giving away the keywords the guy won't just do anything wrong... I think that no one should hide information from anyone at all.
There's also the common misconception that "neural nets are me god" and that's just bull, you don't need a neural net at all to identify a bunch of glyphs, specially when the captcha uses known fonts.
There are several distinct methods regarding captcha recognition, some companies actually sell their services! - But as mentioned captchas are there for a reason, however, with recaptchas... well, not so much. I don't agree on the creator's point of view regarding this subject. It's not as noble as he painted it to be.
Funny you should mention OCR... you know, recaptchas exist because current OCRs can't identify those sequences and thus, humans are required to identify them.
I am currently breaking a captcha from a NIC service where they use biased parameters to generate the final image. After analyzing over 5000 samples I concluded they only use 1 font and the rotation angles are 15° maximum, the noise is easily removed through equalization and as soon as I finish the segmentation code, I'll be able to identify those glyphs!
Lucky for me they don't distort the image, but if they did, I assume they would leave the grid visible... knowing how stupid they are
