Fedora Users — Re: A question on OCR for bad old document?

On Sun, Jun 6, 2010 at 10:01 PM, mike cloaked <mike.cloaked@xxxxxxxxx> wrote:
> I have a scanned pdf of a very old document which was typewritten
> about half a century ago. The scanned copy is noisy and the letters
> are far from clear. The text can be made out (mostly) by eye, but it
> is 19 pages long and I would like to OCR it to get a digitised text to
> save the eye strain and lots of typing.
>
> I have tried various routes to doing this, including converting the
> pdf to jpg, tif and other formats after fiddling with it in GIMP to
> turn it (not very well) from grey scale to monochrome with an indexed
> image before trying to OCR it. I have tried GOCR, OCRAD and gscan2pdf
> but all give pretty awful results with a very low success rate.
>
> Does anyone have any guidance or a url to point me to that may help
> with turning that scanned old document into something sensible as a
> character file within Fedora ?

Have you tried Tesseract? I suppose that Tesseract can work from
inside gscan2pdf.

(http://code.google.com/p/tesseract-ocr/)

(yum install tesseract)

The best OCR tool that I have found up to now is a commercial one:
Acrobat Professional.

Paul
-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines