Fedora Users — Re: OCR

On Wed, Nov 22, 2006 at 02:07:06AM +0100, Alain PORTAL wrote:
> Hi,
>
> Is there an ocr tool in Fedora ?

I've had little luck with the open source OCR tools I've tried, they
don't recognize test very well.

There IS the tesseract OCR engine, recently released as open source by
Google. It's available on sourceforge: sourcefortge.net/projects/tesseract-ocr

A warning: it's fairly rough. There's no GUI. it recognizes only
TIFF files as its input medium. It knows nothing about multi-column
pages or any other aspect of page layout.

but it does a pretty good job of recognizing text.

not that it's perfect,... far from it. I've been using it on a bunch
of the (admittedly lousy quality) PDF legal documents from Groklaw. On
those low-quality docs it makes quite a few mistakes, but nevertheless
gets the great bulk of it right.

One other caveat with Tesseract is that it tends not to work when
compiled with a modern GCC. If you have an older system around with
version 2.95, or thereabouts, you can compile a static binary there
and run it on a newer system (I compiled it on RH 7.3 and it runs
on my Centos 4.4).

To use it, it's helpful to have some scripts to automate some of the
processes, depending on the source of your scanned images. If you're
scanning stuff on your own scanner your needs may be different than
mine were when I started using it for the aforementioned purpose.

Since I'm converting PDF to text (these PDF files are low resolution
scanned images of -- apparently -- low-quality originals, they do not
contain text!) I need to convert the documents to TIFF files, one per
page. To do that I hacked a copy of the pdf2ps script that comes with
GhostScript so that it emits tiff instead of postscript. Then I threw
together a script to turn the resulting TIFF files (one per page) into
text using tesseract. FYI, in case it's of any assistance to anyone,
I'll append it here:

#!/bin/sh
# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.

pdf2tif $1

# edit this to point to wherever you've got your tesseract binary
progdir=..

for j in *.tif
do
x=`basename $j .tif`
${progdir}/tesseract ${j} ${x}
rm ${x}.raw
rm ${x}.map
#un-comment next line if you want to remove the .tif files when done.
rm ${j}
done

-------------

here's pdf2tif:

#!/bin/sh
# Derived from pdf2ps.
# Convert PDF to TIFF file.

OPTIONS=""
while true
do
case "$1" in
-?*) OPTIONS="$OPTIONS $1" ;;
*) break ;;
esac
shift
done

if [ $# -eq 2 ]
then
outfile=$2
elif [ $# -eq 1 ]
then
outfile=`basename "$1" \.pdf`-%02d.tif
else
echo "Usage: `basename $0` [-dASCII85EncodePages=false] [-dLanguageLevel=1|2|3] input.pdf [output.tif]" 1>&2
exit 1
fi

# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"

-----------------------

--
-------------------------------------------------------------------------------
Under no circumstances will I ever purchase anything offered to me as
the result of an unsolicited e-mail message. Nor will I forward chain
letters, petitions, mass mailings, or virus warnings to large numbers
of others. This is my contribution to the survival of the online
community.
--Roger Ebert, December, 1996
----------------------------- The Boulder Pledge -----------------------------

Attachment: pgpXUsIxU7rQR.pgp
Description: PGP signature