Verily I say unto thee, that bdk@xxxxxx spake thusly: > I think pdftohtml is part of > > poppler-utils Got it, thanks. However, now there's another problem - it doesn't really work. All it produces is "empty" html files, that is - they are proper html (head, body, etc.) but the actual content is not there. IOW it looks like it can only work if the content of the PDF really is text, and not a scanned image of text. This definitely works with Evince, I just wish there was a way to automate it with a batch script, rather than me having to copy and paste the text out of 2000 documents. Here's the original PDF file: http://antitrust.slated.org/www.iowaconsumercase.org/011607/0000/PX00111.pdf And here's a video of Evince "OCRing" the text from the image: http://media.slated.org/albums/userpics/Evince_podit.mp4 (H264 MP4) Download the PDF and try it yourself. It's bizarre, surely there's a way to automate this? TIA. -- K. http://slated.org .---- | I found [Vista] to be a dangerously unstable operating system, | which has caused me to lose data ... unfortunately this product | is unfit for any user. - [H]ardOCP, <http://tinyurl.com/3bpfs2> `---- Fedora Core release 5 (Bordeaux) on sky, running kernel 2.6.20-1.2312.fc5 01:31:48 up 4 days, 23:03, 3 users, load average: 0.57, 0.52, 0.54