On Sun, 9 Apr 2006, Paul Smith wrote:
I print to a file, file.ps, a web-page with text. Then, I apply ps2pdf
and I get file.pdf. However, I cannot copy (from file.pdf) the text to
a text editor. Can one get a pdf file with copyable text?
Does this work with a really trivial web page?
What does "pdffonts file.pdf" show?
If the pdf file uses strings, then you stand a better chance of being able
to cut and paste from a pdf viewer to the editor, but you may run into
encoding issues, so the pasted text is gibberish.
I get:
$ cat t.html
abc
Print to ps from Firefox, convert to pdf, load in Adobe Reader, and
cut and paste gives: "^Y^Z^[", so the encoding is a problem. Xpdf
would not let me copy the text. The t.html.ps file has:
8 dict begin
/FontName /Nimbus_Roman_No9_L.Regular.0.0.Set0 def
/FontType 1 def
/FontMatrix [ 0.001 0 0 0.001 0 0 ]readonly def
/PaintType 0 def
/FontBBox [-168 -281 1031 1098]readonly def
/Encoding [
/.notdef
/uni0066/uni0069/uni006C/uni0065/uni003A/uni002F/uni0068/uni006F
/uni006D/uni0067/uni0077/uni0074/uni0057/uni0073/uni002E/uni0031
/uni0020/uni0030/uni0034/uni0039/uni0032/uni0036/uni0041/uni004D
/uni0061/uni0062/uni0063/
This is the 'abc' --> '^Y^Z^[' encoding.
$ pdffonts t.html.pdf
name type emb sub uni object ID
---------------------------- ------------ --- --- --- ---------
YNAHAD+Nimbus_Roman_No9_L.Regular.0.0.Set0
Type 1C yes yes no 9 0
If the pdf file uses images, you need to use an OCR tool to get the text.
I have seen cases where printing docs to PS on Win32 results in the
text being rasterized in the driver so the PS file has images. This may
happen with screen fonts and/or certain effects (transparency, text
outlines filled with colored patterns).
--
George N. White III <aa056@xxxxxxxxxxxxxx>