On 9/21/05, George White <aa056@xxxxxxxxxxxxxx> wrote: > > > > > I have got a pdf file, whose text I would like to copy to a word > > > > > processor. However, it seems to be protected, as when I copy and paste > > > > > a piece of text from there into a word processor, I only see garbage. > > > > > Is there some way of getting clean text from the pdf file? > > > > > > > > The PDF format has many ways to display text. To be able to extract > > text > > > > you need a file that stores strings and uses font information to render > > them > > > > in the viewer. You may be seeing images that were rasterized long ago. > > > > You should provide the output of the "pdffonts" command, preferrable for > > a > > > > minimal document (a big document could combine sections that use fonts > > with > > > > images). > > > > > > > > For example, the simplest case is a document that uses the PostScript > > Type 1 > > > > fonts provided by the viewer: > > > > > > > > $ pdffonts /usr/share/doc/cups-1.1.20/ssr.pdf > > > > name type emb sub uni object ID > > > > ------------------------------------ ------------ --- --- --- --------- > > > > Times-Roman Type 1 no no no 4 0 > > > > Helvetica Type 1 no no no 7 0 > > > > Helvetica-Bold Type 1 no no no 8 0 > > > > Times-Bold Type 1 no no no 5 0 > > > > Courier Type 1 no no no 3 0 > > > > Symbol Type 1 no no no 9 0 > > > > Times-Italic Type 1 no no no 6 0 > > > > > > Thanks, George. In my case, > > > > > > $ pdffonts myfile.pdf > > > name type emb sub uni object ID > > > ------------------------------------ ------------ --- --- --- --------- > > > DTUUBE+TTBC19E318t00 TrueType yes yes no 13 0 > > > URMVBE+TTBC18C910t00 TrueType yes yes no 16 0 > > > TOYVBE+Symbol Type 1C yes yes no 19 0 > > > Helvetica Type 1C yes no no 22 0 > > > CLLUBE+TTBC1802E0t00 TrueType yes yes no 34 0 > > > Helvetica-Bold Type 1C yes no no 43 0 > > > Helvetica-Oblique Type 1C yes no no 58 0 > > > $ > > > > Is it possible to find the missing fonts to install them? > > Do you have a friend at the No Such Agency? > > The four embedded subsets will be a problem. When you extract text from a PDF > file you don't get encoding or font information, so even if the fonts are > installed you would have to manually assign the font to each fragment. A > subsetted font may not use any recognizable encoding. I have some where it > appears that the subsets are encoded starting with ASCII control-character > codes (e.g., 0x01, 0x02, ...). If you are dealing with normal text, you might > be dealing with a simple substitution code. Try constructing a > table by working with short strings from text that seems to be in the same > font. > > I'm looking at a document where "off" becomes "<ACK><BEL><BEL>", so my table > would have: > > o -> 6 > f -> 7 Thanks, George. Now, I understand how complicated is to achieve my goal, and therefore it is better to give up! Paul