On 21/04/07, Keith G. Robertson-Turner <fedora-gmane.00003@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
I have some PDF documents that are photocopied text documents (embedded image, rather than text glyphs). When I open these with Evince, I am able to copy and paste the actual text. At first I though this was some kind of OCR process, but then I realised it's actually the document itself, which has the original text embedded in it (OCRed and embedded during the original scan). Is there any command I can use to extract the text from these PDF documents in a batch? I have a couple of thousand documents that need converting.
Have you looked at pdftk? "If PDF is electronic paper, then pdftk is an electronic staple-remover, hole-punch, binder, secret-decoder-ring, and X-Ray-glasses. Pdftk is a command-line tool for doing everyday things with PDF documents." http://www.accesspdf.com/pdftk/