Fedora Users — A question on OCR for bad old document?

A question on OCR for bad old document?

Date Prev

Date Next

Thread Prev

Thread Next

Date Index

Thread Index

To: Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx>

Subject: A question on OCR for bad old document?

From: mike cloaked <mike.cloaked@xxxxxxxxx>

Date: Sun, 6 Jun 2010 22:01:32 +0100

Delivered-to: users@xxxxxxxxxxxxxxxxxxxxxxx

Reply-to: Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx>

I have a scanned pdf of a very old document which was typewritten
about half a century ago. The scanned copy is noisy and the letters
are far from clear. The text can be made out (mostly) by eye, but it
is 19 pages long and I would like to OCR it to get a digitised text to
save the eye strain and lots of typing.

I have tried various routes to doing this, including converting the
pdf to jpg, tif and other formats after fiddling with it in GIMP to
turn it (not very well) from grey scale to monochrome with an indexed
image before trying to OCR it. I have tried GOCR, OCRAD and gscan2pdf
but all give pretty awful results with a very low success rate.

Does anyone have any guidance or a url to point me to that may help
with turning that scanned old document into something sensible as a
character file within Fedora ?

Thanks in advance for any tips.

-- 
mike c
-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines