On 4/9/06, George N. White III <aa056@xxxxxxxxxxxxxx> wrote: > > I print to a file, file.ps, a web-page with text. Then, I apply ps2pdf > > and I get file.pdf. However, I cannot copy (from file.pdf) the text to > > a text editor. Can one get a pdf file with copyable text? > > Does this work with a really trivial web page? > > What does "pdffonts file.pdf" show? > > If the pdf file uses strings, then you stand a better chance of being able > to cut and paste from a pdf viewer to the editor, but you may run into > encoding issues, so the pasted text is gibberish. > > I get: > > $ cat t.html > abc > > Print to ps from Firefox, convert to pdf, load in Adobe Reader, and > cut and paste gives: "^Y^Z^[", so the encoding is a problem. Xpdf > would not let me copy the text. The t.html.ps file has: > > 8 dict begin > /FontName /Nimbus_Roman_No9_L.Regular.0.0.Set0 def > /FontType 1 def > /FontMatrix [ 0.001 0 0 0.001 0 0 ]readonly def > /PaintType 0 def > /FontBBox [-168 -281 1031 1098]readonly def > /Encoding [ > /.notdef > /uni0066/uni0069/uni006C/uni0065/uni003A/uni002F/uni0068/uni006F > /uni006D/uni0067/uni0077/uni0074/uni0057/uni0073/uni002E/uni0031 > /uni0020/uni0030/uni0034/uni0039/uni0032/uni0036/uni0041/uni004D > /uni0061/uni0062/uni0063/ > > This is the 'abc' --> '^Y^Z^[' encoding. > > $ pdffonts t.html.pdf > name type emb sub uni object ID > ---------------------------- ------------ --- --- --- --------- > YNAHAD+Nimbus_Roman_No9_L.Regular.0.0.Set0 > Type 1C yes yes no 9 0 > > If the pdf file uses images, you need to use an OCR tool to get the text. > I have seen cases where printing docs to PS on Win32 results in the > text being rasterized in the driver so the PS file has images. This may > happen with screen fonts and/or certain effects (transparency, text > outlines filled with colored patterns). Thanks, George and Mike. After pstill, I get $ pdffonts file.pdf name type emb sub uni object ID ------------------------------------ ------------ --- --- --- --------- Nimbus_Roman_No9_L.Regular.0.0.Set0 Type 1 yes no no 33 0 Verdana.Bold.0.0.Set0 Type 1 yes no no 37 0 Verdana.Regular.0.0.Set0 Type 1 yes no no 41 0 Lucida_Sans.Regular.0.0.Set0 Type 1 yes no no 45 0 Arial.Regular.0.0.Set0 Type 1 yes no no 49 0 Arial.Bold.0.0.Set0 Type 1 yes no no 53 0 Verdana.Italic.0.0.Set0 Type 1 yes no no 57 0 [1]- Done acroread anselmo.pdf [2]+ Done kwrite $ After ps2pdf, I get $ pdffonts file.pdf name type emb sub uni object ID ------------------------------------ ------------ --- --- --- --------- EOZSTF+Verdana.Regular.0.0.Set0 Type 1C yes yes no 13 0 MQEXGW+Arial.Regular.0.0.Set0 Type 1C yes yes no 19 0 DMCZLT+Lucida_Sans.Regular.0.0.Set0 Type 1C yes yes no 17 0 YTBXNU+Nimbus_Roman_No9_L.Regular.0.0.Set0 Type 1C yes yes no 8 0 GBGOAU+Verdana.Bold.0.0.Set0 Type 1C yes yes no 10 0 GMTXSU+Arial.Bold.0.0.Set0 Type 1C yes yes no 23 0 AJKQFS+Verdana.Italic.0.0.Set0 Type 1C yes yes no 26 0 $ Paul