On Sun, Jun 6, 2010 at 10:01 PM, mike cloaked <mike.cloaked@xxxxxxxxx> wrote: > I have a scanned pdf of a very old document which was typewritten > about half a century ago. The scanned copy is noisy and the letters > are far from clear. The text can be made out (mostly) by eye, but it > is 19 pages long and I would like to OCR it to get a digitised text to > save the eye strain and lots of typing. > > I have tried various routes to doing this, including converting the > pdf to jpg, tif and other formats after fiddling with it in GIMP to > turn it (not very well) from grey scale to monochrome with an indexed > image before trying to OCR it. I have tried GOCR, OCRAD and gscan2pdf > but all give pretty awful results with a very low success rate. > > Does anyone have any guidance or a url to point me to that may help > with turning that scanned old document into something sensible as a > character file within Fedora ? Have you tried Tesseract? I suppose that Tesseract can work from inside gscan2pdf. (http://code.google.com/p/tesseract-ocr/) (yum install tesseract) The best OCR tool that I have found up to now is a commercial one: Acrobat Professional. Paul -- users mailing list users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines