There are many differences in mappings found in Ghostscript’s ‘Resource\Decoding\Unicode’ and many missing names, compared to the current version of the Adobe Glyph List. Also, names in the Zapf Dingbats font were removed in revision 4030 - unintentionally I think, because the log message only says ‘Fix : gs/Resource/Decoding/Unicode contained wrong codes for Cyrillic and Hebrew.’ I found such problems in the context of bug #691907 ‘PDFs with TrueType fonts from Windows PostScript files not searchable’, because in the absence of a GlyphNames2Unicode generated by the driver, this resource is used to construct the PDF ToUnicode CMap that is used for searching the PDF and copying text. I’ll attach below (comment #1) an updated version of this file. I attach also my best attempt at updating the ‘FCO_Unicode’ resource too, although this one is more problematic. Details in comment #2.
Created attachment 7168 [details] Suggested new version of ‘Resource\Decoding\Unicode’ Bug #691918 : Update the mappings in the 'Unicode' Decoding resource, both fixing mappings for existing names and adding new names/ mappings. See comment in code for the various sources of glyph names. Notes: - The attachment is not a diff, but the full file. Because of formatting changes (see comment in code), a diff would be unreadable. - There are 112 changes to the mappings, and a lot of new names. No names were deleted.
Created attachment 7169 [details] Suggested new version of ‘Resource\Decoding\FCO_Unicode’ Bug #691918 : Update the mappings in the 'FCO_Unicode' Decoding resource, but without adding any new names. The updates are according to the current Adobe Glyph List v2.0. Includes changes that were made in the past to the ‘Unicode’ resource but not to ‘FCO_Unicod’ too. DETAILS ------- When it was added to the repository, ‘FCO_Unicode’ was a slightly modified copy of ‘Unicode’. Since then, mappings in ‘Unicode’ were corrected a few times but not those in ‘FCO_Unicode’, so ‘FCO_Unicode’ suffered from bitrot. To see what differences are really intended, I compared the original ‘FCO_Unicode’ with the ‘Unicode’ version at that time. The differences I found and what I did about them are listed below: - The following 2 names had different mappings in ‘FCO_Unicode’ and ‘Unicode’ at the time that ‘FCO_Unicode’ was added, and there are explicit comments these differences are desired; I preserved these differences: fraction mapped to 16#2215 instead of 16#2044 [kept] macron mapped to 16#02C9 instead of 16#00AF [kept] - The following were present in ‘FCO_Unicode’ but not in ‘Unicode’, without any comment stating why; assuming these could come from slightly wrong names in the FCO fonts, I preserved these differences: Delta mapped to 16#0394 [AGLv2: Deltagreek] Omega mapped to 16#03A9 [AGLv2: Omegagreek] periodcentered mapped to 16#2022 [AGLv2: bullet] periodcentered.1 mapped to 16#00B7 [AGLv2: periodcentered] - The following were also present in ‘FCO_Unicode’ but not in ‘Unicode’, without documenting the difference, but are now present in the new ‘Unicode’ resource with identical mapping: hyphen mapped to 16#002D [not a difference anymore] scedilla mapped to 16#015F [not a difference anymore] Scedilla mapped to 16#015E [not a difference anymore] tcommaaccent mapped to 16#0163 [not a difference anymore] Tcommaaccent mapped to 16#0162 [not a difference anymore] - The following had different mapping in ‘FCO_Unicode’ and ‘Unicode’, but ‘Unicode’ was changed to the same mapping as in the old ‘FCO_Unicode’: mu mapped to 16#00B5, not 16#03BC [not a difference anymore] - The following was present in both ‘FCO_Unicode’ and ‘Unicode’, but was removed from ‘Unicode’ since then; because originally it was not a FCO-specific difference, I removed it also from the new ‘FCO_Unicode’: idot originally mapped to 16#0069 [removed; AGLv2: i] - The following was a duplicate mapping in ‘FCO_Unicode’, absent from ‘Unicode’. Since these resource types don’t allow duplicates (only one mapping is effective, the others are silently ignored), and because the ‘Unicode’ name assigned at the time for 16#021A looks incorrect (upsilonlatin), I did not preserve this old mapping: Tcommaaccent duplicate mapping to 16#021A [removed] (From another point of view, tcommaaccent -> 16#0163 and Tcommaaccent -> 16#0162 are historical accidents. The 16#0163/ 16#0162 have a cedilla, the t/ T with comma below are at 16#021B/ 16#021A. I kept the mappings in AGLv2, I did not fix them.) Not having those ‘FCO’ fonts to check, and wanting to fix the wrong mappings but at the same time be as conservative as possible, the attached ‘FCO_Unicode’ file: - starts by being a copy of the new ‘Unicode’... - ... but does not contain any NEW names compared to the existing ‘FCO-Unicode’ (new names would be useful only if the fonts use them) ... - ... and preserves the 2+4 differences in the 1st 2 groups above. If you consider a different approach (like adding all new AGL names too) as more suitable, either drop me a note here or, at your choice, make any changes you consider necessary.
Ken should be a bounty here probably.
Adding bountiable keyword at Henry's suggestion (Henry is that all I need to do ?) I've reviewed the changes and I'm happy with them, however we are currently in code freeze for the 9.01 release so I can't commit them yet. I'll leave the bug report open until the freeze is lifted and then commit and close.
Changes adopted in revision 12139 http://ghostscript.com/pipermail/gs-cvs/2011-February/012297.html