Bug 692345 - Wrong coding of polish diacritic letters "ż" and "Ż"
Summary: Wrong coding of polish diacritic letters "ż" and "Ż"
Status: RESOLVED INVALID
Alias: None
Product: Bug Tracker
Classification: Unclassified
Component: General (show other bugs)
Version: unspecified
Hardware: All All
: P4 normal
Assignee: Default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-07-15 11:39 UTC by Gal
Modified: 2011-07-15 14:24 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
MS Word file (19.50 KB, application/msword)
2011-07-15 14:04 UTC, Gal
Details
ODT file (7.67 KB, application/vnd.oasis.opendocument.text)
2011-07-15 14:04 UTC, Gal
Details
Adobe Acrobat conversion of input file (25.55 KB, application/x-download)
2011-07-15 14:05 UTC, Gal
Details
PDFCreator conversion of input file (10.17 KB, application/x-download)
2011-07-15 14:05 UTC, Gal
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gal 2011-07-15 11:39:23 UTC
I'm not 100% sure if the problem I will describe is caused by Ghostscript, but my analysis of it brought my to this conclusion.

I'm webmaster and because of this I have to care about how my content is shown in Google's search result. While doing this I discovered that in many summaries of search results for PDF files, polish diacritic letter "ż" (z with dot over it) is replaced with letter "Ŝ" (S with circumflex - doesn't used in polish). Because of this it is impossible to find those documents using correct search phrases eg. "ważne", they are shown only for the phrase "waŜne" (that's no such word in polish). You can see examples of this for this search "waŜne filetype:pdf" (http://www.google.pl/search?hl=pl&q=waŜne+filetype%3Apdf).

The same problem is with letter "Ż" (capital version of "ż"), it's shown as a "ś" (s with acute accent, this letter is used in polish).

I analysed the problem and it is most likely caused by the error in the free software used to generate/print PDF files, to be exact error in the Ghostscript library which these programs use. To prove this explanation, I can say that this problem does not occur when the PDF is generated using Adobe software.

The problem occurs also in other search engines results e.g. Bing.

Of course when you open file the text is shown with the correct letters.
Comment 1 Ken Sharp 2011-07-15 11:54:45 UTC
(In reply to comment #0)


> word in polish). You can see examples of this for this search "waŜne
> filetype:pdf" (http://www.google.pl/search?hl=pl&q=waŜne+filetype%3Apdf).

For me this URL returns a (presumably Polish) Google page. My Polish isn't up to reading what it says, but it looks very much like a 'returned no results' error.


> I analysed the problem and it is most likely caused by the error in the free
> software used to generate/print PDF files, to be exact error in the Ghostscript
> library which these programs use. To prove this explanation, I can say that
> this problem does not occur when the PDF is generated using Adobe software.
> 
> The problem occurs also in other search engines results e.g. Bing.
> 
> Of course when you open file the text is shown with the correct letters.

The point of PDF is to display the content in a portable fashion. It is not, however, a text format. The text encoding need not be ASCII, or anything like it. If the document displays correctly then it is probably correct.

If you would like to supply an example input file, and a command line used to convert it, then I will look at the conversion and confirm if there is a problem. Without that I cannot investigate the issue. However it seems likely to me that there is no Ghostscript bug here.
Comment 2 Gal 2011-07-15 13:13:46 UTC
I got lots of search result:
http://www.google.com/search?q=waŜne
http://www.bing.com/search?q=waŝne&FORM=RCRE

All those words in results summaries should be shown as "ważne" not "waŜne".

>However it seems likely to me that there is no Ghostscript bug here.
Maybe the problem is how those free programs use Ghostscript results.
Comment 3 Ken Sharp 2011-07-15 13:40:47 UTC
(In reply to comment #2)

> All those words in results summaries should be shown as "ważne" not "waŜne".

It looks like Google is indexing the result of the ToUnicode CMap conversion. The glyph in question in one of the search engine matches has the character code 0x01, the ToUnicode CMap maps this to U+015C which is indeed a S with Circumflex.


> >However it seems likely to me that there is no Ghostscript bug here.
> Maybe the problem is how those free programs use Ghostscript results.

I believe this problem was resolved some time ago, and I notice that the PDF file in the case I checked was produced by version 8.61 which is three and a half years old.

This is why pointing to search engine results is not helpful.

If you can give me an input file and a command line for Ghostscript which results in incorrect output, or as in this case incorrect ToUnicode values, then please reopen this issue.

Failing that I believe this issue to have been resolved, any remaining issues you see with search engine results are caused by having used old versions of Ghostscript to create the PDF, so I am closing this as INVALID.
Comment 4 Gal 2011-07-15 13:55:28 UTC
I can't reproduce the problem, because the source files that are used for generating PDF's are MS Word or OpenOffice. I tested it on PC (Windows XP) with PDFCreator and few other free PDF converters.

Can you check if it is a Ghostscript problem using MS Word or ODT file (input_file.doc, input_file.odt)?

I also include results of PDF conversion form MS word input file, using PDFCreator and Adobe Acrobat. Both look the same but when you copy text to text editor (e.g. Notepad) you will see the difference.

Do you know what is used by this free PDF converters to generate input for Ghostscript? Maybe some other open source software which is cause of this issue.
Comment 5 Ken Sharp 2011-07-15 14:02:19 UTC
(In reply to comment #4)

> Can you check if it is a Ghostscript problem using MS Word or ODT file
> (input_file.doc, input_file.odt)?

No, we need an input file which Ghostscript can use, either PostScript or PDF.

 
> I also include results of PDF conversion form MS word input file, using
> PDFCreator and Adobe Acrobat. Both look the same but when you copy text to text
> editor (e.g. Notepad) you will see the difference.

There are no attachments.

 
> Do you know what is used by this free PDF converters to generate input for
> Ghostscript? Maybe some other open source software which is cause of this
> issue.

PDF Creator uses the Windows print subsystem.

As I said in comment #3 there *was* a bug relating to generation of ToUnicode CMaps for certain TrueType fonts. This was fixed some time back. If you want to know exactly when then you will have to search this database as I no longer remember. PDF files created using an old version of Ghostscript were incorrect (and if anyone uses the old versions they will still generate incorrect ToUnicode CMaps). Given that its unlikely that the owners will regenerate these old files using a newer version of Ghostscript, there is nothing we can do about it.
Comment 6 Gal 2011-07-15 14:03:07 UTC
Yes, you are right. This issue is resolved.
I've installed new PDFCreator (it uses Ghostscript 9) and the results are OK.

Thanks for help!
Comment 7 Gal 2011-07-15 14:04:32 UTC
Created attachment 7671 [details]
MS Word file
Comment 8 Gal 2011-07-15 14:04:50 UTC
Created attachment 7672 [details]
ODT file
Comment 9 Gal 2011-07-15 14:05:28 UTC
Created attachment 7673 [details]
Adobe Acrobat conversion of input file
Comment 10 Gal 2011-07-15 14:05:56 UTC
Created attachment 7674 [details]
PDFCreator conversion of input file
Comment 11 Gal 2011-07-15 14:17:28 UTC
Sorry, one more question, is it possible to somehow fix PDF with this problem, using Ghostscript or other software?
Comment 12 Ken Sharp 2011-07-15 14:24:12 UTC
(In reply to comment #11)
> Sorry, one more question, is it possible to somehow fix PDF with this problem,
> using Ghostscript or other software?

Edit the ToUnicode CMap. Other than that, no.