692450 – text not copyable but readable in the resulting pdf (ps2pdf)

Bug 692450 - text not copyable but readable in the resulting pdf (ps2pdf)

Summary: text not copyable but readable in the resulting pdf (ps2pdf)

Status:	RESOLVED INVALID

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	master
Hardware:	Macintosh MacOS X

Importance:	P4 normal
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-08-20 15:17 UTC by pengyu.ut
Modified:	2011-08-22 13:26 UTC (History)
CC List:	1 user (show)

See Also:
Customer:
Word Size:	---

Attachments
the example ps file (348.46 KB, application/postscript) 2011-08-22 10:12 UTC, pengyu.ut	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description pengyu.ut 2011-08-20 15:17:30 UTC

ps2pdf on the attached ps file will result in a pdf file where the text are not copiable (the copied text are gibberish). Since the text is still readable in a pdf reader, I'd think that there might be a way to make the text copiable. Does anybody know how to do it?

Comment 1 Ken Sharp 2011-08-22 06:52:50 UTC

(In reply to comment #0)
> ps2pdf on the attached ps file will result in a pdf file where the text are not
> copiable (the copied text are gibberish). Since the text is still readable in a
> pdf reader, I'd think that there might be a way to make the text copiable. Does
> anybody know how to do it?

No sample PostScript file attached. 

The likelihood is that the incoming PostScript does not contain Unicode information, and the text is encoded with non-ASCII compatible encodings. In this case it is impossible to create a PDF file where the text can be searched/copied in Acrobat.

But without a sample file I can't tell.

Comment 2 pengyu.ut 2011-08-22 10:12:32 UTC

Created attachment 7812 [details]
the example ps file

The text is not copiable in the pdf generated from the ps file by ps2pdf.

Comment 3 Ken Sharp 2011-08-22 10:28:08 UTC

(In reply to comment #2)
> Created an attachment (id=7812) [details]
> the example ps file
> 
> The text is not copiable in the pdf generated from the ps file by ps2pdf.

The fonts Calibri and Calibri-Bold (TrueType fonts) are embedded as subset CIDFonts. The embedded fonts do not have any Unicode information, so its not possible to construct a ToUnicode CMap for these fonts.

Because they are CIDFonts we cannot use the Encoding or glyph names reliably and in fact the glyph names are useless being of the form '/c00' etc. ALso the Encoding starts at index 0, so its not any kind of ASCII encoding.

In the absence of any information about the text pdfwrite is unable to write any meaningful Unicode or other information to the file. As a result the text in these fonts (and only these fonts, the remainder of the text *is* copyable) cannot be searched/copied.

Exactly the same behaviour exhibits when using Adobe Acrobat Distiller.

This is not a Ghostscript bug, and there is no scope for improving this, given the content of the incoming PostScript, so this is not a potential enhancement either.

I suggest that if you want a PDF file, you save it as such from the creating application, Adobe Indesign CS3.

Comment 4 pengyu.ut 2011-08-22 10:39:27 UTC

I know that my following idea may not be possible to do. But I just want to ask to make sure.

Since the Calibri and Calibri-Bold fonts are embedded in the ps file, is it possible to compare the embedded font to the system fonts (for example, using pattern matching of the glyphs, after there are less than a few hundred glyphs in such fonts) to reverse engineer the correct Unicode?

Comment 5 Ken Sharp 2011-08-22 11:11:08 UTC

(In reply to comment #4)

> Since the Calibri and Calibri-Bold fonts are embedded in the ps file, is it
> possible to compare the embedded font to the system fonts (for example, using
> pattern matching of the glyphs, after there are less than a few hundred glyphs
> in such fonts) to reverse engineer the correct Unicode?

technically its possible, but in practice its untenable, it would be horribly slow.

Comment 6 pengyu.ut 2011-08-22 11:42:12 UTC

Please excuse my limited knowledge. But I'm wondering how you draw the conclusion "horribly slow". Considering that OCR can be done very fast nowadays, I guess comparing a few hundred glyphs to a few hundred other glyphs should be finished in a reasonable time if a proper pattern matching algorithm is used.

Then the only problem to me is whether there is a ready-to-use pattern matching tool for glyphs. A rough google indicates that fontforge may have such capability (but I'm not very positive).

Considering that there are this kind of messed-up ps files out there on the web, I think that it maybe be worthwhile to add this function to ps2pdf (or to a separate tool in ghostscript).

Comment 7 Ken Sharp 2011-08-22 13:10:00 UTC

(In reply to comment #6)
> Please excuse my limited knowledge. But I'm wondering how you draw the
> conclusion "horribly slow".

Have you seen how many fonts are installed on the average Windows PC ? And many of them are far eastern fonts with thousands of glyphs.

Also, there is no satisfactory cross-platform method for doing this.

> Considering that OCR can be done very fast
> nowadays, I guess comparing a few hundred glyphs to a few hundred other glyphs
> should be finished in a reasonable time if a proper pattern matching algorithm
> is used.

Wouldn't use OCR, would hash the outline descriptions and compare hashes.


> Considering that there are this kind of messed-up ps files out there on the
> web, I think that it maybe be worthwhile to add this function to ps2pdf (or to
> a separate tool in ghostscript).

The kind of thing which irritates our customers is reducing performance, so I do not think we will be interested in attempting such an enhancement, especially since Adobe Acrobat Distiller (the de fact standard) behaves the same way. If you can find a PostScript to PDF conversion tool which can take this file and produce a PDF file which is searchable for the text in Calibri I will be very impressed.

Comment 8 pengyu.ut 2011-08-22 13:26:44 UTC

My reference to OCR does not indicate OCR is the solution. My point is that OCR, which is a commonly used operation, should be a slower operation than comparing glyphs, yet OCR is not that slow, so comparing glyphs at least should be faster than OCR.

You main concern is the speed issue, which, to my humble opinion, may not necessarily be a good justification for not adding such a useful feature. After all, you can disable such feature by default. Therefore, an ordinary user will not experience the slowness. Only when a user want to use such a feature, he/she can turn on the command option for it, at that time, speed is not a problem for he/she.