Fedora Users — Fedora Linux OCR howto v0.1

Below is the process I use to get good OCR results in Linux. Posted here for posterity.

My Setup:
-------------
Fedora Core 1 (Linux)
Kooka ("Scan & OCR Program" in KDE Graphics menu)
gocr (gocr-0.37-0.rhfc1.dag.i386.rpm)
Canon CanoScan LIDE 30 USB scanner

Notes about input document quality: 1. The better the quality of the document, the better your OCR results will be. If the characters on the page don't have at least some whitespace all the way around, you've got an uphill battle. And you know what Sun Tzu says about that! 2. Text that is of a uniform size and weight will yield better results. 3. If you're scanning text from non-white stock, you'll have to fiddle with the brightness and contrast when scanning to try to get as near true black and true white as you can.

Scanning ------------ 1. Connect scanner 2. Place document on scanner glass, careful to keep the lines of text as horizontal as possible. 3. Open Kooka and set the following: Scan Mode: "Gray" Resolution: 150 - 300 dpi, depending on the size of the text on the page. 4. Do a "Preview Scan" 5. Select the text area you want to scan for OCR 6. Do a "Final Scan" and select an output image format. I use "PNG" for no good reason.

Note: For some odd reason, I have to close Kooka after scanning and before OCR to get things to work they way they should. YMMV.

OCR ----------- 1. Select the image you saved in step 6 above, and select a few words from the image to do a test OCR run. 2. Click the "OCR on Selection" button on the Kooka toolbar. 3. Use default gocr setting for the first run. 4. Check results. Adjust gocr settings if needed to get better results. 5. Repeat as necessary. 6. Try selecting all of your image text and click the "OCR on Selection" button again.

Good luck.

--
Mitch Wiedemann
mc Computer Consulting
mc2@xxxxxxxxxxxxx
http://www.lightlink.com/mc2