Re: Copying text from a protected pdf file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




--- Paul Smith <phhs80@xxxxxxxxx> wrote:

> On 9/15/05, Deron Meranda <deron.meranda@xxxxxxxxx>
> wrote:
> > > > > > > I have got a pdf file, whose text I
> would like to copy to a word
> > > > > > > processor. However, it seems to be
> protected, as when I copy and paste
> > > > > > > a piece of text from there into a word
> processor, I only see garbage.
> > ...
> > > Thanks, Leonard. I have just checked: the pdf
> file is not copy
> > > protected, but, even so, what I can copy into a
> word processor is
> > > garbage. It may be something relating with
> encodings.
> > 
> > It could be encodings.  Text in PDF is really only
> in terms of glyphs,
> > not characters, which makes text extraction
> particularly difficult
> > and font-specific.  Fortunately there are a few
> standard PDF encodings
> > defined by Adobe (these map "characters" to
> glyphs, and are not
> > quite the same things as you'd think of an
> "encoding" being), but
> > each PDF file can create it's own custom encodings
> as well and
> > visually you'd see nothing different.  There's
> also nothing to keep
> > the "text" in a PDF file from being written weird
> (such as writing
> > from right-to-left) since it's just graphics
> instructions; but most PDF
> > generating programs do it in the obvious way.
> > 
> > You might want to look at the "pdftotext" program
> (which is part of
> > the xpdf package, obsoleted in FC4).  It generally
> can do a good job
> > of extracting text.
> > 
> > Just some more information... are your documents
> generally
> > written in English (or use the English alphabet)? 
> And are they more
> > like plain prose (paragraphs of text), or fanciful
> like marketing marterials
> > with lots of interspersed graphics, panels, and so
> forth?
> 
> Thanks, Deron. My documents are not written in
> English, and they only
> have text and tables, apparently created with MS
> Windows. pdftotext
> and pdftohtml do not produce good or reasonable
> results.
> 
> Paul
> 
> -- 
> fedora-list mailing list
> fedora-list@xxxxxxxxxx
> To unsubscribe:
> http://www.redhat.com/mailman/listinfo/fedora-list
> 

Have you tried converting your file to postscript and
then using ps2ascii or something similar?

Best Regards,

Antonio


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


[Index of Archives]     [Current Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [Yosemite Photos]     [KDE Users]     [Fedora Tools]     [Fedora Docs]

  Powered by Linux