Bug 690648 - many incorrectly encoded Latin Extended-A characters
Summary: many incorrectly encoded Latin Extended-A characters
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.00
Hardware: PC Windows XP
: P5 major
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-07-23 11:20 UTC by Robert
Modified: 2010-11-15 18:00 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
test1.zip (181.06 KB, application/zip)
2009-07-24 01:48 UTC, Robert
Details
test.zip (96.39 KB, application/zip)
2009-09-11 03:51 UTC, Robert
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Robert 2009-07-23 11:20:36 UTC
This is the procedure that demonstrates the problem:

(1) Open http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
and print it to a postscript file (I used the HP Universal Print Driver
ver. 4.7 installed in Windows XP Pro. SP3).

(2) Convert the PS file to a PDF file using the following command:
ps2pdf14 test1.ps test1.pdf

(3) Open the PDF file in a PDF reader (I've tried Adobe Reader 8),
select the whole text and copy it to a text processing application
(I've tried the OpenOffice.org Writer 3.1 and MS Word 2003 SP3) as
unformatted text.

The problem is that the following letters are not shown correctly in
the pasted text (although the Adobe Reader displays and prints them
correctly):

U+010A Ċ Latin Capital Letter C with dot above
U+010B ċ Latin Small Letter C with dot above
U+0110 Đ Latin Capital Letter D with stroke
U+0111 đ Latin Small Letter D with stroke
U+0116 Ė Latin Capital Letter E with dot above
U+0117 ė Latin Small Letter E with dot above
U+0120 Ġ Latin Capital Letter G with dot above
U+0121 ġ Latin Small Letter G with dot above
U+0122 Ģ Latin Capital Letter G with cedilla
U+0123 ģ Latin Small Letter G with cedilla
U+0130 İ Latin Capital Letter I with dot above
U+0136 Ķ Latin Capital Letter K with cedilla
U+0137 ķ Latin Small Letter K with cedilla
U+013B Ļ Latin Capital Letter L with cedilla
U+013C ļ Latin Small Letter L with cedilla
U+0145 Ņ Latin Capital Letter N with cedilla
U+0146 ņ Latin Small Letter N with cedilla
U+0150 Ő Latin Capital Letter O with double acute
U+0151 ő Latin Small Letter O with double acute
U+0156 Ŗ Latin Capital Letter R with cedilla
U+0157 ŗ Latin Small Letter R with cedilla
U+015E Ş Latin Capital Letter S with cedilla
U+015F ş Latin Small Letter S with cedilla
U+0162 Ţ Latin Capital Letter T with cedilla
U+0163 ţ Latin Small Letter T with cedilla
U+0170 Ű Latin Capital Letter U with double acute
U+0171 ű Latin Small Letter U with double acute
U+017B Ż Latin Capital Letter Z with dot above
U+017C ż Latin Small Letter Z with dot above

I'd say that the letters get encoded incorrectly in the PDF file.

-- rpr.
Comment 1 Alex Cherepanov 2009-07-23 12:20:02 UTC
Please attach PS files generated by your "HP Universal Print Driver ver. 4.7
installed in Windows XP Pro. SP". Many Ghostscript developers use Mac, and some
prefer GNU/Linux.
Comment 2 Robert 2009-07-24 01:48:28 UTC
Created attachment 5238 [details]
test1.zip

The ZIP file contains:
- the PS file I produced using the HP Universal Print Driver ver. 4.7
  on a MS Windows XP Pro. SP3.
- the PDF file created using the following command:
ps2pdf14 test1.ps test1.pdf

-- rpr.
Comment 3 Ken Sharp 2009-07-24 07:23:15 UTC
There is no Unicode information in the job, so we are forced to fall back to
attempting to construct the information based on the glyph names. To do this we
use the /Unicode /Decoding resource.

In there I see that :

16#00EA LC300000 Cdot ecircumflex 
16#00EB LC290000 cdot edieresis 

So CDot is returned as Unicode code point U+00EA, which is actually the
ecircumflex which is indeed what the character pastes as. The Unicode point for
CDot should be U+010A, and at that position in the Decoding resource I see:

16#010A LI620000 Cdotaccent 
16#010B cdotaccent

So if the glyph had been called Cdotaccent it would work correctly.

This seems fairly clearly wrong to me, CDot should have the same Unicode point
as Cdotaccent I would think, and certainly should not have the same value as
ecircumflex.

If I change the Decoding resource then the character does in fact copy and paste
as expected.

Fixed in revision 9887, patch here:

http://ghostscript.com/pipermail/gs-cvs/2009-July/009585.html
Comment 4 Robert 2009-09-11 03:18:04 UTC
Why is this bug still present in ver. 8.70 which was compiled on 2009-07-31?
Comment 5 Ken Sharp 2009-09-11 03:33:39 UTC
As far as I can tell it is not present in the current version, the test file
works correctly for me. 

However this was a fix to external files, rather than C code. Is it possible
that you are using existing Resources, stored on disk, from a previous version
of Ghostscript ? Did you build GS yourself or download a built binary ?

Try invoking GS using the -I switch to specify the 8.70 Resource directory. Also
check the file gs/Resource/Decoding/Unicode it should contain these lines:

16#010A LI620000 Cdot Cdotaccent 
16#010B cdot cdotaccent

If it instead contains this:

16#010A LI620000 Cdotaccent 
16#010B cdotaccent

Then it is incorrect, the current source contains the correct definitions.
Comment 6 Robert 2009-09-11 03:51:55 UTC
Created attachment 5370 [details]
test.zip

Actually, the problem still exists with only one letter:
U+0111 đ Latin Small Letter D with stroke

I'm attaching the tests.zip file that contains:
test1.ps
test2.ps
- PS files created using the HP Universal Print Driver ver. 4.7 and ver. 5.0
respectively on a MS Windows XP Pro. SP3.
test1.pdf
test2.pdf
- PDF files created using the following commands:
ps2pdf14 test1.ps test1.pdf
ps2pdf14 test2.ps test2.pdf

-- rpr.
Comment 7 Ken Sharp 2010-09-17 11:59:18 UTC
The glyph /dslash is incorrectly named as /dmacron in the PostScript file:

/TTE1A75770t00 findfont /CharStrings get begin
/dmacron 255 def
end
/TTE1A75770t00 findfont /Encoding get
dup 72 /dmacron put


Since the Decoding relies on the glyph name in order to get the correct Unicode code point for the ToUnicode CMap, an incorrect glyph name will lead to an incorrect Unicode code point, and therefore incorrect cut and paste.

Renaming the glyph in the embedded font:

/TTE1A75770t00 findfont /CharStrings get begin
/dslash 255 def
end
/TTE1A75770t00 findfont /Encoding get
dup 72 /dslash put

generates a PDF file with a correct ToUnicode CMap and the lower case d with slash copies and pastes correctly. 

Therefore the remaining problem is with the generation of the PostScript (and possibly with the original TrueType font which may contain an incorrect PostScript name table).

This issue was resolved with revision 9887.
Comment 8 Robert 2010-11-14 20:00:20 UTC
Ken, thank you for your work on this issue.

I've done a few additional tests on this issue and it seems you were right.

On two MS Windows XP SP3 systems I printed a simple text containing letter đ (Latin Small Letter D with stroke) in DejaVu Serif font (v. 2.32) to a file using various PS printers. The created PS files always contained /dmacron instead of /dslash.

All the PS printers used one of the following versions of Microsoft's PostScript Printer Driver (PSCRIPT5.DLL):
ver. 6.0.6001.22127 (vistasp1_ldr.080302-0124)
ver. 6.1.7600.16385 (win7_rtm.090713-1255)

As the PSCRIPT5.DLL file contains an instance of "dmacron" ASCII string, I tried to change it to "dslash" (using a hex editor) and then copied the modified PSCRIPT5.DLL to the printer drivers directory ("C:\WINDOWS\system32\spool\drivers\w32x86\3").

After the restart of Windows I run the same tests again and generated PS files contained the /dslash glyph.

But, if the text was in some of the Windows common fonts, such as Arial, Times New Roman or Courier New, the generated PS files still contained /dmacron glyph.

So, I also edited some of the font files and replaced "dmacron" ASCII string
with "dslash". Then, in the Control Panel I replaced the original fonts with the
modified ones. After that printing letter đ in the respective fonts generated PS
files that contained the /dslash glyph.

At the end I'd like to ask is the /dmacron really an incorrect name of the glyph? The http://en.wikipedia.org/wiki/D_with_stroke page says that in PostScript the glyph can be encoded as dcroat, dmacron and dslash. If it is correct, then could all the glyph names be implemented in GhostScript?

-- rpr.
Comment 9 Ken Sharp 2010-11-15 08:18:41 UTC
(In reply to comment #8)
 
> At the end I'd like to ask is the /dmacron really an incorrect name of the
> glyph? The http://en.wikipedia.org/wiki/D_with_stroke page says that in
> PostScript the glyph can be encoded as dcroat, dmacron and dslash. If it is
> correct, then could all the glyph names be implemented in GhostScript?

If you can find me an Adobe document which says this I will happily change the glyph, I'm afraid I'm not prepared to treat a Wikipedia entry as gospel. If you look at :

http://en.wikipedia.org/wiki/Macron

You will see that a macron is either above the glyph or below it, not part of it like the dslash. You can see an example of a dmacron on that page.

The Adobe Glyph List does not contain an entry for Dmacron, although it does have the same encoding entry for Dcroat and Dslash. It does have an entry for dmacron and dcroat which are the same, but no entry for dslash. 

This is the only Adobe document I can find which mentions these glyphs. and there is no case where the d/D slash and macron are defined the same. The fact that a third glyph can be approximated one way when upper case, and another way when lower case is not conclusive in my opinion.

Of course there is nothing to stop you defining this yourself, particularly since this is a PostScript resource. You can either alter the Decoding resource and rebuild Ghostscript, or simply use the -I switch to point Ghostscript at a modified set of resources on disk instead of the built-in versions. This will allow you to have a version of Resource/Decoding/Unicode which is different to the one we ship.
Comment 10 Robert 2010-11-15 10:21:47 UTC
(In reply to comment #9)
> If you can find me an Adobe document which says this I will happily change the
> glyph, I'm afraid I'm not prepared to treat a Wikipedia entry as gospel.

http://en.wikipedia.org/wiki/Adobe_Glyph_List references two Adobe documents:
(1) Adobe Glyph List - http://partners.adobe.com/public/developer/en/opentype/glyphlist.txt
(2) Adobe Glyph List For New Fonts - http://partners.adobe.com/public/developer/en/opentype/aglfn.txt

Regarding LATIN CAPITAL/SMALL LETTER D WITH STROKE the lists define the following:
(1):
Dcroat;0110
Dslash;0110
dcroat;0111
dmacron;0111

(2):
0110;Dcroat;LATIN CAPITAL LETTER D WITH STROKE
0111;dcroat;LATIN SMALL LETTER D WITH STROKE

According to my tests the GhostScript 9.00 PDF Writer recognizes the following glyph names:
Dcroat, Dslash, dcroat and dslash 
which means that it does not follow strictly either (1) or (2).

Also, I'd say that Microsoft's PostScript Printer Driver (PSCRIPT5.DLL) uses definitions from (1).

-- rpr.
Comment 11 Ken Sharp 2010-11-15 10:55:43 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > If you can find me an Adobe document which says this I will happily change the
> > glyph, I'm afraid I'm not prepared to treat a Wikipedia entry as gospel.
> 
> http://en.wikipedia.org/wiki/Adobe_Glyph_List references two Adobe documents:
> (1) Adobe Glyph List -
> http://partners.adobe.com/public/developer/en/opentype/glyphlist.txt

This is the document I referenced in comment #9, it does not state that Dmacron and dmacron are the same as Dslash and dslash.

> (2) Adobe Glyph List For New Fonts -
> http://partners.adobe.com/public/developer/en/opentype/aglfn.txt

This one doesn't reference either Dmacron or dmacron at all as far as I can see.

> 
> Regarding LATIN CAPITAL/SMALL LETTER D WITH STROKE the lists define the
> following:
> (1):
> Dcroat;0110
> Dslash;0110
> dcroat;0111
> dmacron;0111

Yep, so lower case dcroat is the same as dmacron and upper case Dcroat is the same as Dslash. Doesn't say anything about the relationship between Dslash and Dmacron or dslash and dmacron.

 
> (2):
> 0110;Dcroat;LATIN CAPITAL LETTER D WITH STROKE
> 0111;dcroat;LATIN SMALL LETTER D WITH STROKE

Yes, but this isn't referencing Dmarcon or dmafcron at all. All this says is that Dcroat and dcroat are (in effect) the same as Dslash and dslash. Which contradicts the earlier document, but that's OK because this is for newer fonts.
Comment 12 Robert 2010-11-15 11:39:54 UTC
AFAIK, the GhostScript PDF Writer should be able to map glyph names to Unicode code points. If it uses the glyph names from the Adobe Glyph List, it should recognize dmacron and map it to U+0111 (LATIN SMALL LETTER D WITH STROKE
).

In the Comment #7 above you insisted that the dmacron is an incorrect glyph name for LATIN SMALL LETTER D WITH STROKE and that it should be dslash. But, dslash is not defined in the Adobe Glyph List.

I conclude that the GhostScript PDF Writer should be fixed so that it recognizes dmacron and maps it to U+0111.

On the other hand, for LATIN CAPITAL LETTER D WITH STROKE (U+0110) the Adobe Glyph List defines Dcroat and Dslash. Both of them are correctly recognized by the GhostScript PDF Writer.

-- rpr.
Comment 13 James Cloos 2010-11-15 18:00:21 UTC
FWIW, I just tried grep(1)ing every .afm file I could find (or, more precisly, which locate(1) could find) for '[dD]macron' and looked at the corresponding .pf[ab] files.

In each case the dmacron and Dmacron glyphs were what one would expect for U+0111 LATIN SMALL LETTER D WITH STROKE and U+0110 LATIN CAPITAL LETTER D WITH STROKE.

So, even if Adobe doesn’t specify it as so, font designers have used the Dmacron and dmacron names for the glyphs which look like Đ and đ rather than like D̄ and d̄.