--- Paul Smith <phhs80@xxxxxxxxx> wrote: > On 9/15/05, Deron Meranda <deron.meranda@xxxxxxxxx> > wrote: > > > > > > > I have got a pdf file, whose text I > would like to copy to a word > > > > > > > processor. However, it seems to be > protected, as when I copy and paste > > > > > > > a piece of text from there into a word > processor, I only see garbage. > > ... > > > Thanks, Leonard. I have just checked: the pdf > file is not copy > > > protected, but, even so, what I can copy into a > word processor is > > > garbage. It may be something relating with > encodings. > > > > It could be encodings. Text in PDF is really only > in terms of glyphs, > > not characters, which makes text extraction > particularly difficult > > and font-specific. Fortunately there are a few > standard PDF encodings > > defined by Adobe (these map "characters" to > glyphs, and are not > > quite the same things as you'd think of an > "encoding" being), but > > each PDF file can create it's own custom encodings > as well and > > visually you'd see nothing different. There's > also nothing to keep > > the "text" in a PDF file from being written weird > (such as writing > > from right-to-left) since it's just graphics > instructions; but most PDF > > generating programs do it in the obvious way. > > > > You might want to look at the "pdftotext" program > (which is part of > > the xpdf package, obsoleted in FC4). It generally > can do a good job > > of extracting text. > > > > Just some more information... are your documents > generally > > written in English (or use the English alphabet)? > And are they more > > like plain prose (paragraphs of text), or fanciful > like marketing marterials > > with lots of interspersed graphics, panels, and so > forth? > > Thanks, Deron. My documents are not written in > English, and they only > have text and tables, apparently created with MS > Windows. pdftotext > and pdftohtml do not produce good or reasonable > results. > > Paul > > -- > fedora-list mailing list > fedora-list@xxxxxxxxxx > To unsubscribe: > http://www.redhat.com/mailman/listinfo/fedora-list > Have you tried converting your file to postscript and then using ps2ascii or something similar? Best Regards, Antonio __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com