On 9/16/05, Antonio Olivares <olivares14031@xxxxxxxxx> wrote: > > > > > > > > I have got a pdf file, whose text I > > would like to copy to a word > > > > > > > > processor. However, it seems to be > > protected, as when I copy and paste > > > > > > > > a piece of text from there into a word > > processor, I only see garbage. > > > ... > > > > Thanks, Leonard. I have just checked: the pdf > > file is not copy > > > > protected, but, even so, what I can copy into a > > word processor is > > > > garbage. It may be something relating with > > encodings. > > > > > > It could be encodings. Text in PDF is really only > > in terms of glyphs, > > > not characters, which makes text extraction > > particularly difficult > > > and font-specific. Fortunately there are a few > > standard PDF encodings > > > defined by Adobe (these map "characters" to > > glyphs, and are not > > > quite the same things as you'd think of an > > "encoding" being), but > > > each PDF file can create it's own custom encodings > > as well and > > > visually you'd see nothing different. There's > > also nothing to keep > > > the "text" in a PDF file from being written weird > > (such as writing > > > from right-to-left) since it's just graphics > > instructions; but most PDF > > > generating programs do it in the obvious way. > > > > > > You might want to look at the "pdftotext" program > > (which is part of > > > the xpdf package, obsoleted in FC4). It generally > > can do a good job > > > of extracting text. > > > > > > Just some more information... are your documents > > generally > > > written in English (or use the English alphabet)? > > And are they more > > > like plain prose (paragraphs of text), or fanciful > > like marketing marterials > > > with lots of interspersed graphics, panels, and so > > forth? > > > > Thanks, Deron. My documents are not written in > > English, and they only > > have text and tables, apparently created with MS > > Windows. pdftotext > > and pdftohtml do not produce good or reasonable > > results. > > Have you tried converting your file to postscript and > then using ps2ascii or something similar? Yes, Antonio, I tried that, but with no better results. Paul