On 9/16/05, George White <aa056@xxxxxxxxxxxxxx> wrote: > > I have got a pdf file, whose text I would like to copy to a word > > processor. However, it seems to be protected, as when I copy and paste > > a piece of text from there into a word processor, I only see garbage. > > Is there some way of getting clean text from the pdf file? > > The PDF format has many ways to display text. To be able to extract text > you need a file that stores strings and uses font information to render them > in the viewer. You may be seeing images that were rasterized long ago. > You should provide the output of the "pdffonts" command, preferrable for a > minimal document (a big document could combine sections that use fonts with > images). > > For example, the simplest case is a document that uses the PostScript Type 1 > fonts provided by the viewer: > > $ pdffonts /usr/share/doc/cups-1.1.20/ssr.pdf > name type emb sub uni object ID > ------------------------------------ ------------ --- --- --- --------- > Times-Roman Type 1 no no no 4 0 > Helvetica Type 1 no no no 7 0 > Helvetica-Bold Type 1 no no no 8 0 > Times-Bold Type 1 no no no 5 0 > Courier Type 1 no no no 3 0 > Symbol Type 1 no no no 9 0 > Times-Italic Type 1 no no no 6 0 > > > -- > George N. White III > Head of St. Margarets Bay, Nova Scotia > > -- Thanks, George. In my case, $ pdffonts myfile.pdf name type emb sub uni object ID ------------------------------------ ------------ --- --- --- --------- DTUUBE+TTBC19E318t00 TrueType yes yes no 13 0 URMVBE+TTBC18C910t00 TrueType yes yes no 16 0 TOYVBE+Symbol Type 1C yes yes no 19 0 Helvetica Type 1C yes no no 22 0 CLLUBE+TTBC1802E0t00 TrueType yes yes no 34 0 Helvetica-Bold Type 1C yes no no 43 0 Helvetica-Oblique Type 1C yes no no 58 0 $ Paul