This is the procedure that demonstrates the problem: (1) Open http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block and print it to a postscript file (I used the HP Universal Print Driver ver. 4.7 installed in Windows XP Pro. SP3). (2) Convert the PS file to a PDF file using the following command: ps2pdf14 test1.ps test1.pdf (3) Open the PDF file in a PDF reader (I've tried Adobe Reader 8), select the whole text and copy it to a text processing application (I've tried the OpenOffice.org Writer 3.1 and MS Word 2003 SP3) as unformatted text. The problem is that the following letters are not shown correctly in the pasted text (although the Adobe Reader displays and prints them correctly): U+010A Ċ Latin Capital Letter C with dot above U+010B ċ Latin Small Letter C with dot above U+0110 Đ Latin Capital Letter D with stroke U+0111 đ Latin Small Letter D with stroke U+0116 Ė Latin Capital Letter E with dot above U+0117 ė Latin Small Letter E with dot above U+0120 Ġ Latin Capital Letter G with dot above U+0121 ġ Latin Small Letter G with dot above U+0122 Ģ Latin Capital Letter G with cedilla U+0123 ģ Latin Small Letter G with cedilla U+0130 İ Latin Capital Letter I with dot above U+0136 Ķ Latin Capital Letter K with cedilla U+0137 ķ Latin Small Letter K with cedilla U+013B Ļ Latin Capital Letter L with cedilla U+013C ļ Latin Small Letter L with cedilla U+0145 Ņ Latin Capital Letter N with cedilla U+0146 ņ Latin Small Letter N with cedilla U+0150 Ő Latin Capital Letter O with double acute U+0151 ő Latin Small Letter O with double acute U+0156 Ŗ Latin Capital Letter R with cedilla U+0157 ŗ Latin Small Letter R with cedilla U+015E Ş Latin Capital Letter S with cedilla U+015F ş Latin Small Letter S with cedilla U+0162 Ţ Latin Capital Letter T with cedilla U+0163 ţ Latin Small Letter T with cedilla U+0170 Ű Latin Capital Letter U with double acute U+0171 ű Latin Small Letter U with double acute U+017B Ż Latin Capital Letter Z with dot above U+017C ż Latin Small Letter Z with dot above I'd say that the letters get encoded incorrectly in the PDF file. -- rpr.
Please attach PS files generated by your "HP Universal Print Driver ver. 4.7 installed in Windows XP Pro. SP". Many Ghostscript developers use Mac, and some prefer GNU/Linux.
Created attachment 5238 [details] test1.zip The ZIP file contains: - the PS file I produced using the HP Universal Print Driver ver. 4.7 on a MS Windows XP Pro. SP3. - the PDF file created using the following command: ps2pdf14 test1.ps test1.pdf -- rpr.
There is no Unicode information in the job, so we are forced to fall back to attempting to construct the information based on the glyph names. To do this we use the /Unicode /Decoding resource. In there I see that : 16#00EA LC300000 Cdot ecircumflex 16#00EB LC290000 cdot edieresis So CDot is returned as Unicode code point U+00EA, which is actually the ecircumflex which is indeed what the character pastes as. The Unicode point for CDot should be U+010A, and at that position in the Decoding resource I see: 16#010A LI620000 Cdotaccent 16#010B cdotaccent So if the glyph had been called Cdotaccent it would work correctly. This seems fairly clearly wrong to me, CDot should have the same Unicode point as Cdotaccent I would think, and certainly should not have the same value as ecircumflex. If I change the Decoding resource then the character does in fact copy and paste as expected. Fixed in revision 9887, patch here: http://ghostscript.com/pipermail/gs-cvs/2009-July/009585.html
Why is this bug still present in ver. 8.70 which was compiled on 2009-07-31?
As far as I can tell it is not present in the current version, the test file works correctly for me. However this was a fix to external files, rather than C code. Is it possible that you are using existing Resources, stored on disk, from a previous version of Ghostscript ? Did you build GS yourself or download a built binary ? Try invoking GS using the -I switch to specify the 8.70 Resource directory. Also check the file gs/Resource/Decoding/Unicode it should contain these lines: 16#010A LI620000 Cdot Cdotaccent 16#010B cdot cdotaccent If it instead contains this: 16#010A LI620000 Cdotaccent 16#010B cdotaccent Then it is incorrect, the current source contains the correct definitions.
Created attachment 5370 [details] test.zip Actually, the problem still exists with only one letter: U+0111 đ Latin Small Letter D with stroke I'm attaching the tests.zip file that contains: test1.ps test2.ps - PS files created using the HP Universal Print Driver ver. 4.7 and ver. 5.0 respectively on a MS Windows XP Pro. SP3. test1.pdf test2.pdf - PDF files created using the following commands: ps2pdf14 test1.ps test1.pdf ps2pdf14 test2.ps test2.pdf -- rpr.
The glyph /dslash is incorrectly named as /dmacron in the PostScript file: /TTE1A75770t00 findfont /CharStrings get begin /dmacron 255 def end /TTE1A75770t00 findfont /Encoding get dup 72 /dmacron put Since the Decoding relies on the glyph name in order to get the correct Unicode code point for the ToUnicode CMap, an incorrect glyph name will lead to an incorrect Unicode code point, and therefore incorrect cut and paste. Renaming the glyph in the embedded font: /TTE1A75770t00 findfont /CharStrings get begin /dslash 255 def end /TTE1A75770t00 findfont /Encoding get dup 72 /dslash put generates a PDF file with a correct ToUnicode CMap and the lower case d with slash copies and pastes correctly. Therefore the remaining problem is with the generation of the PostScript (and possibly with the original TrueType font which may contain an incorrect PostScript name table). This issue was resolved with revision 9887.
Ken, thank you for your work on this issue. I've done a few additional tests on this issue and it seems you were right. On two MS Windows XP SP3 systems I printed a simple text containing letter đ (Latin Small Letter D with stroke) in DejaVu Serif font (v. 2.32) to a file using various PS printers. The created PS files always contained /dmacron instead of /dslash. All the PS printers used one of the following versions of Microsoft's PostScript Printer Driver (PSCRIPT5.DLL): ver. 6.0.6001.22127 (vistasp1_ldr.080302-0124) ver. 6.1.7600.16385 (win7_rtm.090713-1255) As the PSCRIPT5.DLL file contains an instance of "dmacron" ASCII string, I tried to change it to "dslash" (using a hex editor) and then copied the modified PSCRIPT5.DLL to the printer drivers directory ("C:\WINDOWS\system32\spool\drivers\w32x86\3"). After the restart of Windows I run the same tests again and generated PS files contained the /dslash glyph. But, if the text was in some of the Windows common fonts, such as Arial, Times New Roman or Courier New, the generated PS files still contained /dmacron glyph. So, I also edited some of the font files and replaced "dmacron" ASCII string with "dslash". Then, in the Control Panel I replaced the original fonts with the modified ones. After that printing letter đ in the respective fonts generated PS files that contained the /dslash glyph. At the end I'd like to ask is the /dmacron really an incorrect name of the glyph? The http://en.wikipedia.org/wiki/D_with_stroke page says that in PostScript the glyph can be encoded as dcroat, dmacron and dslash. If it is correct, then could all the glyph names be implemented in GhostScript? -- rpr.
(In reply to comment #8) > At the end I'd like to ask is the /dmacron really an incorrect name of the > glyph? The http://en.wikipedia.org/wiki/D_with_stroke page says that in > PostScript the glyph can be encoded as dcroat, dmacron and dslash. If it is > correct, then could all the glyph names be implemented in GhostScript? If you can find me an Adobe document which says this I will happily change the glyph, I'm afraid I'm not prepared to treat a Wikipedia entry as gospel. If you look at : http://en.wikipedia.org/wiki/Macron You will see that a macron is either above the glyph or below it, not part of it like the dslash. You can see an example of a dmacron on that page. The Adobe Glyph List does not contain an entry for Dmacron, although it does have the same encoding entry for Dcroat and Dslash. It does have an entry for dmacron and dcroat which are the same, but no entry for dslash. This is the only Adobe document I can find which mentions these glyphs. and there is no case where the d/D slash and macron are defined the same. The fact that a third glyph can be approximated one way when upper case, and another way when lower case is not conclusive in my opinion. Of course there is nothing to stop you defining this yourself, particularly since this is a PostScript resource. You can either alter the Decoding resource and rebuild Ghostscript, or simply use the -I switch to point Ghostscript at a modified set of resources on disk instead of the built-in versions. This will allow you to have a version of Resource/Decoding/Unicode which is different to the one we ship.
(In reply to comment #9) > If you can find me an Adobe document which says this I will happily change the > glyph, I'm afraid I'm not prepared to treat a Wikipedia entry as gospel. http://en.wikipedia.org/wiki/Adobe_Glyph_List references two Adobe documents: (1) Adobe Glyph List - http://partners.adobe.com/public/developer/en/opentype/glyphlist.txt (2) Adobe Glyph List For New Fonts - http://partners.adobe.com/public/developer/en/opentype/aglfn.txt Regarding LATIN CAPITAL/SMALL LETTER D WITH STROKE the lists define the following: (1): Dcroat;0110 Dslash;0110 dcroat;0111 dmacron;0111 (2): 0110;Dcroat;LATIN CAPITAL LETTER D WITH STROKE 0111;dcroat;LATIN SMALL LETTER D WITH STROKE According to my tests the GhostScript 9.00 PDF Writer recognizes the following glyph names: Dcroat, Dslash, dcroat and dslash which means that it does not follow strictly either (1) or (2). Also, I'd say that Microsoft's PostScript Printer Driver (PSCRIPT5.DLL) uses definitions from (1). -- rpr.
(In reply to comment #10) > (In reply to comment #9) > > If you can find me an Adobe document which says this I will happily change the > > glyph, I'm afraid I'm not prepared to treat a Wikipedia entry as gospel. > > http://en.wikipedia.org/wiki/Adobe_Glyph_List references two Adobe documents: > (1) Adobe Glyph List - > http://partners.adobe.com/public/developer/en/opentype/glyphlist.txt This is the document I referenced in comment #9, it does not state that Dmacron and dmacron are the same as Dslash and dslash. > (2) Adobe Glyph List For New Fonts - > http://partners.adobe.com/public/developer/en/opentype/aglfn.txt This one doesn't reference either Dmacron or dmacron at all as far as I can see. > > Regarding LATIN CAPITAL/SMALL LETTER D WITH STROKE the lists define the > following: > (1): > Dcroat;0110 > Dslash;0110 > dcroat;0111 > dmacron;0111 Yep, so lower case dcroat is the same as dmacron and upper case Dcroat is the same as Dslash. Doesn't say anything about the relationship between Dslash and Dmacron or dslash and dmacron. > (2): > 0110;Dcroat;LATIN CAPITAL LETTER D WITH STROKE > 0111;dcroat;LATIN SMALL LETTER D WITH STROKE Yes, but this isn't referencing Dmarcon or dmafcron at all. All this says is that Dcroat and dcroat are (in effect) the same as Dslash and dslash. Which contradicts the earlier document, but that's OK because this is for newer fonts.
AFAIK, the GhostScript PDF Writer should be able to map glyph names to Unicode code points. If it uses the glyph names from the Adobe Glyph List, it should recognize dmacron and map it to U+0111 (LATIN SMALL LETTER D WITH STROKE ). In the Comment #7 above you insisted that the dmacron is an incorrect glyph name for LATIN SMALL LETTER D WITH STROKE and that it should be dslash. But, dslash is not defined in the Adobe Glyph List. I conclude that the GhostScript PDF Writer should be fixed so that it recognizes dmacron and maps it to U+0111. On the other hand, for LATIN CAPITAL LETTER D WITH STROKE (U+0110) the Adobe Glyph List defines Dcroat and Dslash. Both of them are correctly recognized by the GhostScript PDF Writer. -- rpr.
FWIW, I just tried grep(1)ing every .afm file I could find (or, more precisly, which locate(1) could find) for '[dD]macron' and looked at the corresponding .pf[ab] files. In each case the dmacron and Dmacron glyphs were what one would expect for U+0111 LATIN SMALL LETTER D WITH STROKE and U+0110 LATIN CAPITAL LETTER D WITH STROKE. So, even if Adobe doesn’t specify it as so, font designers have used the Dmacron and dmacron names for the glyphs which look like Đ and đ rather than like D̄ and d̄.