Fedora Users — Re: Copying text from a protected pdf file

Re: Copying text from a protected pdf file

Date Prev

Date Next

Thread Prev

Thread Next

Date Index

Thread Index

To: For users of Fedora Core releases <fedora-list@xxxxxxxxxx>

Subject: Re: Copying text from a protected pdf file

From: Paul Smith <phhs80@xxxxxxxxx>

Date: Thu, 15 Sep 2005 23:45:11 +0100

Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=guAssyOmgFGe8WoLXAqQ/YpTD4FNfCCSaEuU8zTO28HuDYAK40Z1K/a5rqZS0XzkPU/Q4hnFSXsekWX4bFx0JrcgSMjbLM8oXnL33/b/BSq2ID7wQ+OxYV5Myw0992lD2lwgFJGYMUlvepYXmzG1FS0XEyxhCRWaCNAzKJEFtVc=

In-reply-to: <[email protected]>

List-help: <mailto:[email protected]?subject=help>

List-id: For users of Fedora Core releases <fedora-list.redhat.com>

List-post: <mailto:[email protected]>

List-subscribe: <http://www.redhat.com/mailman/listinfo/fedora-list>, <mailto:[email protected]?subject=subscribe>

List-unsubscribe: <http://www.redhat.com/mailman/listinfo/fedora-list>, <mailto:[email protected]?subject=unsubscribe>

References: <[email protected]> <[email protected]> <[email protected]> <[email protected]> <[email protected]> <[email protected]>

Reply-to: For users of Fedora Core releases <fedora-list@xxxxxxxxxx>

On 9/15/05, Deron Meranda <deron.meranda@xxxxxxxxx> wrote:
> > > > > > I have got a pdf file, whose text I would like to copy to a word
> > > > > > processor. However, it seems to be protected, as when I copy and paste
> > > > > > a piece of text from there into a word processor, I only see garbage.
> ...
> > Thanks, Leonard. I have just checked: the pdf file is not copy
> > protected, but, even so, what I can copy into a word processor is
> > garbage. It may be something relating with encodings.
> 
> It could be encodings.  Text in PDF is really only in terms of glyphs,
> not characters, which makes text extraction particularly difficult
> and font-specific.  Fortunately there are a few standard PDF encodings
> defined by Adobe (these map "characters" to glyphs, and are not
> quite the same things as you'd think of an "encoding" being), but
> each PDF file can create it's own custom encodings as well and
> visually you'd see nothing different.  There's also nothing to keep
> the "text" in a PDF file from being written weird (such as writing
> from right-to-left) since it's just graphics instructions; but most PDF
> generating programs do it in the obvious way.
> 
> You might want to look at the "pdftotext" program (which is part of
> the xpdf package, obsoleted in FC4).  It generally can do a good job
> of extracting text.
> 
> Just some more information... are your documents generally
> written in English (or use the English alphabet)?  And are they more
> like plain prose (paragraphs of text), or fanciful like marketing marterials
> with lots of interspersed graphics, panels, and so forth?

Thanks, Deron. My documents are not written in English, and they only
have text and tables, apparently created with MS Windows. pdftotext
and pdftohtml do not produce good or reasonable results.

Paul