On Wed, Nov 22, 2006 at 02:07:06AM +0100, Alain PORTAL wrote: > Hi, > > Is there an ocr tool in Fedora ? I've had little luck with the open source OCR tools I've tried, they don't recognize test very well. There IS the tesseract OCR engine, recently released as open source by Google. It's available on sourceforge: sourcefortge.net/projects/tesseract-ocr A warning: it's fairly rough. There's no GUI. it recognizes only TIFF files as its input medium. It knows nothing about multi-column pages or any other aspect of page layout. but it does a pretty good job of recognizing text. not that it's perfect,... far from it. I've been using it on a bunch of the (admittedly lousy quality) PDF legal documents from Groklaw. On those low-quality docs it makes quite a few mistakes, but nevertheless gets the great bulk of it right. One other caveat with Tesseract is that it tends not to work when compiled with a modern GCC. If you have an older system around with version 2.95, or thereabouts, you can compile a static binary there and run it on a newer system (I compiled it on RH 7.3 and it runs on my Centos 4.4). To use it, it's helpful to have some scripts to automate some of the processes, depending on the source of your scanned images. If you're scanning stuff on your own scanner your needs may be different than mine were when I started using it for the aforementioned purpose. Since I'm converting PDF to text (these PDF files are low resolution scanned images of -- apparently -- low-quality originals, they do not contain text!) I need to convert the documents to TIFF files, one per page. To do that I hacked a copy of the pdf2ps script that comes with GhostScript so that it emits tiff instead of postscript. Then I threw together a script to turn the resulting TIFF files (one per page) into text using tesseract. FYI, in case it's of any assistance to anyone, I'll append it here: #!/bin/sh # takes one parameter, the path to a pdf file to be processed. # uses custom script 'pdf2tif' to generate the tif files, # generates them at 300x300 dpi. # drops them in our current directory # then runs $progdir/tesseract on them, deleting the .raw # and .map files that tesseract drops. pdf2tif $1 # edit this to point to wherever you've got your tesseract binary progdir=.. for j in *.tif do x=`basename $j .tif` ${progdir}/tesseract ${j} ${x} rm ${x}.raw rm ${x}.map #un-comment next line if you want to remove the .tif files when done. rm ${j} done ------------- here's pdf2tif: #!/bin/sh # Derived from pdf2ps. # Convert PDF to TIFF file. OPTIONS="" while true do case "$1" in -?*) OPTIONS="$OPTIONS $1" ;; *) break ;; esac shift done if [ $# -eq 2 ] then outfile=$2 elif [ $# -eq 1 ] then outfile=`basename "$1" \.pdf`-%02d.tif else echo "Usage: `basename $0` [-dASCII85EncodePages=false] [-dLanguageLevel=1|2|3] input.pdf [output.tif]" 1>&2 exit 1 fi # Doing an initial 'save' helps keep fonts from being flushed between pages. # We have to include the options twice because -I only takes effect if it # appears before other options. exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1" ----------------------- -- ------------------------------------------------------------------------------- Under no circumstances will I ever purchase anything offered to me as the result of an unsolicited e-mail message. Nor will I forward chain letters, petitions, mass mailings, or virus warnings to large numbers of others. This is my contribution to the survival of the online community. --Roger Ebert, December, 1996 ----------------------------- The Boulder Pledge -----------------------------
Attachment:
pgpXUsIxU7rQR.pgp
Description: PGP signature