The existing utf8 additions for the Windows platform resulted in incompatible changes of the GSAPI and GSDLL interfaces. Existing GSAPI clients could previously pass ‘command line’ options (= arguments to functions like gsapi_init_with_agrs()) that included extended characters (input filenames, include path, FONTPATH, etc) as long as these characters are part of the installed ANSII codepage. Now this does not work anymore unless the GSAPI client is specifically modified to convert ‘command line’ arguments to utf8 by itself, an encoding that is not a native one on Windows. Steps to reproduce ------- - First we need an ‘existing GSPSI client’. A gswin32c.exe from a previous release or the current gswin32c.exe compiled with WINDOWS_NO_UNICODE will show the problem. We will name this executable ‘gswin32c_ansii.exe’. - Create a PostScript file containing non-7-bit-ASCII characters in the name but still in the system’s ANSII codepage. Let's say this is "Grün.ps" ("ü" is Alt+0252 in the ‘Windows: Western’ ANSII codepage). The contents does not really matter, a single ‘showpage’ is enough. - Attempt to display this file using the ‘existing GSPSI client’ and the new *utf8* DLL: gswin32c_ansii.exe Grün.ps GPL Ghostscript GIT PRERELEASE 9.04 (2011-03-30) Copyright (C) 2010 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. Error: /undefinedfilename in (Gr\374n.ps) ... GPL Ghostscript GIT PRERELEASE 9.04: Unrecoverable error, exit code 1 The error happens because the ‘existing GSPSI client’ passes ANSII strings for the argv[] parameter. The filesystem access functions of the new DLL take these bytes, consider them an utf8 string, convert to utf16 actually destroying it, and pass the result to Windows. More on the scope and limits of this bug report ------- Current GS version introduces 2 incompatible changes. (A) PostScript operators that interact with the host expect any PostScript string parameter they take to be utf8-encoded. (B) ‘Command line arguments’ (switches, input filenames, etc) are expected to be utf8-encoded too. (More precisely, any filenames/ paths/ etc inthere must be utf8; the switch letters, PS operator names, various names recognized by GS, etc use only 7-bit ASCII characters and their codification is identical ASCII <-> ANSII <-> utf8.) The current report is about eliminating (B) and not about (A). (A) does not typically affect PostScript-as-a-page-description-language files, only PostScript-as-a-programming-language files plus some uses like concatenation of PS files with save-runfile-restore-repeat. The changes help somehow with (A) by making gsapi_revision()/gsdll_revision() to report if file PS operators expect utf8 or ANSII strings, so the GSAPI client knows what to do if generating PS code on the fly. Next steps ------- Comment #1 below describes what type of changes I consider necessary. Comment #2 is a list of all GSAPI and GSDLL interface functions/ messages/ and similar, telling which need changing and how, and which remain untouched and why. No patch yet. I hope to have something this weekend. (I hoped the last weekend too, but nothing came out for lack of time...) It’s not much code, but it’s possible the overhead to do this correctly (no platform-specific code in i*.c files) to be significant. Plus, I don’t know if anyone agrees with the design in general.
My suggestions on what to do: (i) ------- Functions that take text parameters get encoding-specific variants. To do this in a ‘Windowsly’ manner, names for the ANSII variants get an "A" suffix and for those that expect wchar_t parameters the suffix is "W". The no-suffix names are kept for backward compatibility, and are aliases for the ANSII entry points. For reasons explained later, there is another variant that takes utf8 strings, doing no conversion at all; I suggets the suffix "8" for these. Example for gsapi_run_file(): - Add: int gsapi_run_fileA (..., const char *file_name/*ANSII*/, ...); New function. All it does is convert file_name ANSII->utf8 and call the utf8 worker function (see below). Exported from the *.DEF file as both ‘gsapi_run_fileA’ (suffix ‘A’) and ‘gsapi_run_file’ (no suffix). (‘EXPORTS’ syntax: ‘externalname = internalname [@ordinal]’.) - Add: int gsapi_run_fileW (..., const wchar_t *file_name/*utf16*/, ...); New function. It converts utf16->utf8 and calls the same utf8 worker function. Exported from the *.DEF file as ‘gsapi_run_fileW’ (suffix ‘W’). - Keep unchanged: int gsapi_run_file (..., const char *file_name/*utf8*/, ...); This is the worker function called by the ‘A’ and ‘W’ variants. Also exported from the *.DEF as ‘gsapi_run_file8’ (suffix ‘8’). Why the ‘A/W’ variants: We need the ANSII version for backwards compatibility, and the Unicode version for grater coverage of filenames. Utf8 provides the same coverage as utf16 (aka ‘Unicode’ aka ‘wide character’), but utf8 is not a native Windows encoding. These variants with exactly these suffixes are ‘the Windows way’ for this kind of job. Think of these functions as helpers for calling the utf8 variant. Why the no-suffix variant [on Windows]: Ensures that existing GSAPI/ GSDLL clients can link, unmodified, to the new DLL. Note that with the name-juggling in the DEF, the no-prefix externally visible symbol links to the ANSII functions on Windows and to the utf8 one on all other platforms. Why the ‘8’ variant: Currently, the gswin32.exe and gswin32c.dll get command-line as utf16 and convert automatically to utf8. In the process, the exact bytes that represent these arguments change. We implicitely assume that nonbody will put binary bytes (true binary, not hex or ascii85) on the command line, and that this limitation is much less annoying than not being able to simply type filenames with extended characters. For ‘gsapi_init_with_args()’ the situation si a bit different, the ‘argv[]’ is [could be] generated by a program. This program can very well pass a ‘-c’ with binary tokens that would be destroyed by a charset conversion. For such an exotic use, I suggest to provide these ‘8’ entries. Note this requires exactly ZERO coding, the function is there anyway. (ii) ------- I also consider that such an important change as the encoding for filename strings passed to PostScript operators must be signaled by ‘gs_revision()’. At least the GSAPI client can know what kind of DLL it loaded and cat appropriately. Also allows writing clients that work with both ‘new’ and ‘old’ DLLs, and une the new functionality if available. (iii) ------- After doing the above changes, the gswin32c.exe and gswin32.exe need to be changed to call the new entry points. Currently they call the no-suffix function but pass utf8 parameters. What they have to do is pass the wide char strings to the ‘W’ functions directly. I consider the wchar_t -> utf8 conversion is currently misplaced: being in the frontends resolve the problems for those frontends; their place is ‘after’ the GSAPI interface, to help all clients and not introduce incompatibilities. (iv) ------- Also these have to be documented in API.htm and DLL.htm. Until now, the specs were silent about character encoding, assuming some general consensus that the encoding is the one used by the host. Now, because the encoding on Windows is forced to be a non-native one, the docs may say exactly where is this encoding required and where not.
List of GSAPI and GSDLL interface elements, and my opinion on what has to change and what can/must remain as-is: FUNCTIONS THAT CHANGE by backwards-compatile additions ------- gsapi_revision() gsdll_revision() Taking a shortcut here: the ‘product’ and ‘copyright’ strings always consist of 1-byte 7-bit ASCII characters; the character encoding used does not vary and there are no "A/W/8" variants. The ‘product’ changes to optionnaly include a ‘built-in features’ list. This list follows the product name proper, is enclosed in square brackets, and consists of space-delimited predefined tokens. There is currently only one such feature: utf8osops If present, operators that interface with the host os interpret any [PostScript] strings they take as parameters as being UTF-8 encoded. Note: no initial BOM is expected or allowed. If absent, these PostScript strings are assumed to use the same encoding as the host os. For Windows this is the ANSII codepage in effect, and currently happens (i) for GS versions 9.02 and earlier and (ii) for versions 9.04 and later when compiled with the option WIDNOWS_NO_UNICODE. For the other platforms, this is the only case currently implemented. Note: GS 9.03 does not return this information, although it can be compiled with or without this feature. FUNCTIONS THAT CHANGE because they need encoding-specific variants ------- It turs out there are only 3 of these (2 x GSAPI + 1 x GSDLL). gsapi_init_with_args() gsdll_init() gsapi_run_file() These functions need "A/W/8" variants for the possible encoding used by the ‘argv’ and ‘file_name’ parameters. The ‘callback’ parameter of gsdll_init() does not vary depending on a character encoding, see below why (GSDLL_CALLBACK detailed by message). FUNCTIONS THAT DO NOT CHANGE ------- gsapi_new_instance() GSDLL_CALLBACK message GSDLL_POLL gsapi_delete_instance() gsapi_run_string_begin() gsdll_execute_begin() gsapi_run_string_end() gsdll_execute_end() gsapi_exit() gsdll_exit() gsapi_set_poll() the poll_fn() callback passed to gsapi_set_poll() These functions do not manipulate character data, and do not have any char/ char[]/ char* parameters, so they do not need any change. gsapi_run_string_continue() gsdll_execute_cont() gsapi_run_string_with_length() gsapi_run_string() gsapi_set_stdio() the stdin_fn() for gsapi_set_stdio() GSDLL_CALLBACK message GSDLL_STDIN the stdout_fn() for gsapi_set_stdio() GSDLL_CALLBACK message GSDLL_STDOUT the stderr_fn() for gsapi_set_stdio() Character buffers handled by these functions are actually fragments from the PostScript interpreter’s input/ output/ error streams, and as such are goverend by the PostScript syntax and semantics. Nothing here depends on the encodind used by the host os for textual information, so these functions don’t change (no "A/W/8" variants of them). gsdll_lock_device() GSDLL_CALLBACK message GSDLL_DEVICE gsdll_copy_dib() GSDLL_CALLBACK message GSDLL_SYNC gsdll_copy_palette() GSDLL_CALLBACK message GSDLL_PAGE gsdll_draw() GSDLL_CALLBACK message GSDLL_SIZE gsdll_get_bitmap_row() The ‘device’ parameter does not point to textual information, it’s rather of an opaque type, so not subject to character encoding. gsdll_get_bitmap() The ‘device’ parameter is actually of an opaque type, and not textual information that may be subject to character encoding. The ‘pbitmap’ is a pointer to binary data. Also, this is an OS/2-specific entry point, and the utf8 file operators are not [yet?] implemented for this platform. gsapi_set_display_callback() the display_callback passed to gsapi_set_display_callback() The only char data I found here is the ‘const char *component_name’ parameter of the display_separation() structure member. This is used to pass SeparationNames from the PostScript program, so its contents (including any implied encoding, although this is primarily just an array of bytes) is dictated by the running PostScript code. No character encoding to be specified at the GSAPI interface level. the vd_trace_host_s pointed by a vd_trace_interface_s member This structure is completely defined and controlled by the module that draws the ‘trace image’, which is the GSAPI client. Hence, no need for a character encoding contract at the GSAPI interface level. the vd_trace_interface_s passed to gsapi_set_visual_tracer() The only member with text data is text(..., char *ASCIIZ), and the only place I found it used is in base\gxhintn.c::t1_hinter__paint_glyph(). From what I see, it is used to label the Visual Trace image. The text consists of debug information (does not come from PostScript nor the command line/ file names/ etc). I conclude the character encoding used is currently defined by the C compiler. Having a more precise specification of the encoding would be marginally useful. I consider defining "A/W/8" variants of vd_trace_interface_s to be overkill. gsapi_set_visual_tracer() If this were a WINAPI function, it would have "A/W" variants. But for the reasons explained above, I consider it does not need them.
The consensus on this was we don't want to take on multiple api's. Robin Watts has prepared a release document to warn commercial customer that we will go forward with utf-8 next release (9.05). Reassigning to him so he can attach the document and close the bug. Thank you for the detailed and complete analysis of this issue.
Created attachment 7728 [details] UnicodeNode.txt This is being included in the release notes for 9.04, which will ship with WINDOWS_NO_UNICODE. Unless we get feedback from customers to change our mind, we will proceed with the "incompatible change" to the API in 9.05. At the moment, the feeling is strongly that taking on multiple different APIs for different platforms is something we'd like to avoid.
Thanks again for your analysis.
I hope we call this UnicodeNote.txt (not UnicodeNode) when we ship it. ;-)
Sags: I have revisited the unicode handling, driven at least in part by your comments here. I have taken a slightly different approach, but I think it gives us the backwards compatibility that we want. See the following link for my current attempt: http://git.ghostscript.com/?p=user/robin/ghostpdl.git/.git;a=commitdiff;h=0d3ce254a0a03d3530623c5be5b4c1d92466f87c Any comments you may have gratefully accepted.
I took a look at the patch from comment #7, and here are a few notes about what I think. Hope these will be of some help. ------------- (A) WINDOWS_NO_UNICODE test The test in msvc.mak (and pcl6_msvc.mak) line ‘!if "$(WINDOWS_NO_UNICODE)" != "1"’ seems reversed. ------------- (B) XCFLAGS [It's not related with this patch, it’s just that I stumbled upon this while looking why WINDOWS_NO_UNICODE=1 doesn’t work.] $(XCFLAGS) is added to the compiler command line twice, once in msvc.mak ‘CFLAGS=$(CFLAGS) $(XCFLAGS) $(UNICODECFLAGS)’ and once in msvccmd.mak line ‘CCFLAGS=$(PLATOPT) $(FPFLAGS) $(CPFLAGS) $(CFLAGS) $(XCFLAGS) ...’. I think the one to delete is the 2nd one. ------------- (C) The ‘string type fuzzyness’ In C, strings are sequences of bytes ended by a byte with the value 0x00. As long as we use a single character encoding everywhere, there is no confusion about the semantic of these bytes. When we work with varying encodings, to know how to handle each string we need to know its encoding. - One way is to ‘tag’ each string with its encoding, like instead of ‘char *’ use ‘struct { char *p; enum {..} encoding;}’ - too complicated. - Another is to pass around extra parameters to specify the encoding used by each ‘char *’ or ‘char []’ parameter. - A 3rd way is to use different functions, each working with a specific encoding as in ‘fnA(char * /*ANSI*/)’ + ‘fn8(char * /*UTF-8*/)’ + ‘fnW(wchar_t * /*UTF-16*/)’. [Note: I used the suffix ‘8’ for the utf-8 variant, but anything you like (‘_utf8’) is OK with me.] The patch does not address this need, and gsapi_set_arg_encoding() makes things worse from this point of view. ------------- (C.1) ‘String type fuzzyness’ part #1: the DLL entry points plus why "A"/"W" and why the UTF-8 entry is not so usefull The exact type (including the encoding) of strings passed to some DLL entry points is not known at compile-time. For example it is not possible to write a prototype that expresses exactly what these functions expect. Workarounds exist, but are not elegant, neither 100% safe. The problem is more visible with .NET Framework clients that want to call the GS DLL via P/Invoke. The runtime can convert strings, but (a) a correct prototype is needed and (b) the only available options are ANSII and Unicode (which means UTF-16). I’m not a .NET expert, but I think the only way to pass strings as UTF-8 is to handle that parameter as if it were a generic ‘void *’ and do manual allocation of unmanaged memory and storing of the necessary bytes there. I think it’s not a big change, compared to the current patch, to separate "A"/"W"/"8" (or "_utf8" if you prefer) entries. Pseudo-code for gsapi_run_file(): - Don’t export gsapi_set_arg_encoding(), but keep it for internal use. - In iapi.c (assuming arg decoding initialized to GS_ARG_ENCODING_LOCAL): int gsapi_run_file (char *file_name) { ... the current function, unchanged } #if WINDOWS int gsapi_run_file8 (char *file_name) { gsapi_set_arg_encoding(GS_ARG_ENCODING_UTF8); i = gsapi_run_file (file_name); gsapi_set_arg_encoding(GS_ARG_ENCODING_LOCAL); return i; } int gsapi_run_fileW (wchar_t *file_name) { gsapi_set_arg_encoding(GS_ARG_ENCODING_UTF16/*future developement*/); i = gsapi_run_file ((char *)file_name); gsapi_set_arg_encoding(GS_ARG_ENCODING_LOCAL); return i; } #endif - In gsdll32.def: EXPORTS gsapi_run_file gsapi_run_fileA = gsapi_run_file gsapi_run_fileW gsapi_run_file8 Note the gsapi_*() functions are not reentrant (for the same context) and their arguments are processed to completion before they return, which makes this juggling with the GS_ARG_ENCODING_* be all that’s needed. Also note there are only 3 functions (2 × GSAPI and 1 × GSDLL, see comment #2) that need such wrappers. [The explanations here seem incomplete, but this is all I could write in the time I had.] ------------- (C.2) ‘String type fuzzyness’ part #2: handling of arguments I’m refering to gsargs.h::arg_source.chars, gsargs.c::arg_next(), etc. To analyze the arguments, arg_next() looks for spaces, quotes, backslash, etc. The code looks written with single-byte 7-bit ASCII encoding in mind, and works with encodings that are ‘compatible enough’ with ASCII. UTF-8 and single-byte ANSII encodings are ‘compatible enough’, but double-byte ANSII and UTF-16 are not. With the current solution, strings in arg_source are considered to use whatever encoding they had when coming from ‘the outside’, and are converted to UTF-8 just before returning from arg_next(). There are problems with this. - Some of these strings are ‘pushed back’ strings, and these are already in UTF-8 (being converted by the arg_next() that retrived them the 1st time). The main part of arg_next() works OK with these, but double-converts them on exit. The situation can be detected by checking arg_source.u.s.parsed and skip converting if true, but see below. Example: [In these examples gswin32c_ASNII.exe is the ‘old GSAPI client’ obtained here by compiling with WINDOWS_NO_UNICODE, but it calls the Unicode DLL. The file named ‘é’ (U+00E9) does not exist.] a) gswin32c_ANSII -fé -> Error: /undefinedfilename in (\303\251) OK (the file does not exist). ‘é’ is U+00E9 (0000 0000 1110 1001), encoded as UTF-8 gives "\xC3\xA9" (110 00011 10 101001). b) gswin32c_ANSII -c 0 -fé -> Error: /undefinedfilename in (\303\203\302\251) Wrong, the ANSII string ‘é’ ("\xE9") has beed converted to UTF-8 once "\xC3\xA9" then it was ‘pushed back’, considered to be ANSII-encoded (U+00C3 U+00A9), and converted to UTF-8 a 2nd time. - The GS_OPTIONS, when retrieved from the Registry, is also stored here encoded as UTF-8 but does not have .parsed == true. It also suffers from double-decoding. ..define ‘GS_OPTIONS=-sTEST=é’ in the Registry gswin32c_ANSII -c TEST == quit -> (\303\203\302\251) While I’m here [this problem existed before the current patch, it’s not a regression]: When values are fetched from true environment variables (not from the Registry), they are incorrectly ANSII encoded. The ‘correct’ value seen here comes from the 2nd decoding. set GS_OPTIONS=-sTEST=é gswin32c_ANSII -c TEST == quit -> (\303\251) This happens because gp_wgetv.c::gp_getenv() calls getenv(), it must call one of getenv() or _wgetenv() depending on WINDOWS_NO_UNICODE (converting the encoding when needed). The preceding 2 problems are similar to those described in bug #688965 "GS_OPTIONS env doesn't allow quoted arguments or spaces". - [All?] other strings use the encoding they came in from ‘the outside’. arg_next() does not work correctly if this encoding is a double-byte ANSII one <http://msdn.microsoft.com/en-us/goglobal/bb964654>. See bug #691117 "Gs cannot open lib files when directory has some SJIS letters in its name" for the explanation of the ‘2nd byte matched for backslash’ problem. [Not a regression, this never worked]. Also, we cannot support UTF-16 as an ‘outside encoding’ by simply adding a new arg_decode() function; all of arg_next() would need to be rewritten to handle 2-byte characters (including NUL bytes that don’t terminate strings). I think the only way to solve the above problems is to have everything in arg_source as UTF-8. This means converting strings to UTF-8 BEFORE putting them there (instead of when retrieving them), and converting what is read from @files on-the-fly (taking care of multi-byte characters) - exactly what I think you wanted to avoid :(
(In reply to comment #8) > (A) WINDOWS_NO_UNICODE test > > The test in msvc.mak (and pcl6_msvc.mak) line ‘!if "$(WINDOWS_NO_UNICODE)" != > "1"’ seems reversed. I agree that the test is wrong. I think it should simply be: !if "$(WINDOWS_NO_UNICODE)" != "" > (B) XCFLAGS > > [It's not related with this patch, it’s just that I stumbled upon this while > looking why WINDOWS_NO_UNICODE=1 doesn’t work.] > > $(XCFLAGS) is added to the compiler command line twice, once in msvc.mak > ‘CFLAGS=$(CFLAGS) $(XCFLAGS) $(UNICODECFLAGS)’ and once in msvccmd.mak line > ‘CCFLAGS=$(PLATOPT) $(FPFLAGS) $(CPFLAGS) $(CFLAGS) $(XCFLAGS) ...’. I think > the one to delete is the 2nd one. Also seems plausible. > In C, strings are sequences of bytes ended by a byte with the value 0x00. As > long as we use a single character encoding everywhere, there is no confusion > about the semantic of these bytes. The plan was to use a single encoding (utf8) for everything after arg_next, and to use a single (varying set of) encoding(s) before that. > (C.2) ‘String type fuzzyness’ part #2: handling of arguments > > > I’m refering to gsargs.h::arg_source.chars, gsargs.c::arg_next(), etc. > > To analyze the arguments, arg_next() looks for spaces, quotes, backslash, etc. > The code looks written with single-byte 7-bit ASCII encoding in mind, and > works with encodings that are ‘compatible enough’ with ASCII. UTF-8 and > single-byte ANSII encodings are ‘compatible enough’, but double-byte ANSII > and UTF-16 are not. I have a reworked patch here that moves the decoding to be earlier within arg_next; we now decode the strings a 'rune' at a time (i.e. possibly multiple bytes are consumed from the arg to create single unicode values). We then look at those runes to check for quoting etc. These unicode values are then written back as utf8 into the arg buffer. This should lift the 'compatible enough' restriction. > I think the only way to solve the above problems is to have everything in > arg_source as UTF-8. This means converting strings to UTF-8 BEFORE putting > them there (instead of when retrieving them), and converting what is read from > @files on-the-fly (taking care of multi-byte characters) - exactly what I > think you wanted to avoid :( I hope that my new patch effectively achieves this, but in a 'just in time' manner. I still have the strcmp's from imainarg to deal with, but I have a plan for this too. > (C.1) ‘String type fuzzyness’ part #1: the DLL entry points > plus why "A"/"W" and why the UTF-8 entry is not so usefull The desire to have a simple prototype for pinvoke is a good one. I am therefore leaning towards offering the gsapi helper functions you suggest (though I personally prefer leaving gsapi_set_arg_encoding as an external). I will upload a new patch for your consideration tomorrow.
Created attachment 9144 [details] 0001-Command-line-argument-character-encoding-handling-re.patch New patch to try to address Sags concerns.
I took a look at the patch, and here are my notes. I did not actually *run* the code, everything here comes from inspecting the source (apologies if some of it turns to be completely wrong). Anyway, hope that will help. --- (I) The GSDLL interface A single function needs wrappers, OK. I suggest adding a "gsdll_initA = gsdll_init" line to the EXPORTS section of the DEF, to have both "A"+"W" names as is customary on Windows. No extra C code needed (compared to what already exists in the patch), but please add the function pointer type to gsdll.h. Then, the interface specs [as seen from callers of the DLL] will be: gsdll_init Always uses the "local" encoding. On Windows, this is equivalent to gsdll_initA; the "no prefix" name is kept for backwards compatibility. gsdll_initA Always "local" encoding; Windows-specific. gsdll_initW Always "unicode" encoding; Windows-specific. gsdll_init_with_encoding The encoding is specified by an extra parameter. Note: To use UTF-8 encoding on Windows (which anyway is "unusual") one will use gsdll_init_with_encoding(.., GS_ARG_ENCODING_UTF8). As above, it looks [will look] OK to me. Notes on the implementation: (a) gsdll.c::gsdll_init() calls itself (!) and gsdll.c::gsdll_initW() calls gsdll_init() too. Looks like a typo, seems both should call gsdll_init_with_encoding(). (b) Either Add a prototype for gsdll_init_with_encoding(). The one assumed by the compiler when it finds [will find] them called from gsdll_init/W() is not suitable. Or Move gsdll_init/W() after it. (c) [added after reviewing what I wrote] This interface also suffers from the "encoding contamination" of the no-prefix entry point from the other entry points as described below for GSAPI - see (e). And even between the GSDLL and GSAPI interfaces! ---- (II) The GSAPI interface It is not clear to me how you see this specified, more precisely which function (name as visible from the GSAPI client) uses which character set. Looking at the code, my best guess is: gsapi_* (no suffix) Expects arguments to have the encoding set by the most recent call to gsapi_set_arg_encoding(). By default, the encoding is the "local" one, to be compatible with old clients. gsapi_*A Always "local" encoding; Windows-specific. gsapi_*W Always "unicode" encoding; Windows-specific. gsapi_*_utf8 Always UTF-8 encoding. If this is what you wanted, then there are problems with the wrappers: (d) I don't see a good reason for the 2 interfaces, GSDLL and GSAPI, to be so different in design. The "no prefix" functions have different specs (one is ANSI, the other is "specified by a separate call"). The "variable" encoding is specified by a separate *entry* for GSAPI but by an extra *parameter* for GSDLL. [IMO, the GSDLL way is better.] (e) The existing "W"/"_utf8" wrappers call gsapi_set_arg_encoding() internally, altering the semantic of the "no prefix" entry point. I find this neither intuitive nor desirable at all. Note the example for gsapi_run_file() in my comment #8 takes care to set the encoding back to the default, exactly to avoid interference. [There the "no prefix" is specified as "always local", but this does not matter from this point of view.] (f) You have a gsapi_run_file_utf8() but no gsapi_init_with_args_utf8(). I suggest to delete gsapi_run_file_utf8(). This brings GSAPI almost in line with GSDLL: UTF-8 obtained by explicitly setting the encoding - either by an extra parameter (GSDLL) or extra call (GSAPI). If you absolutely want the *_uf8(), add the 2nd one too. (g) You need separate wrappers for the "A" version, you cannot simply have "gsapi_run_fileA = gsapi_run_file" in the DEF because the functionality of the "A" entry point differs from that of the "no prefix" entry point - one expects "local" encoding, the other whatever was last specified by gsapi_set_arg_encoding(). (h) gsapi_run_file() has an extra problem, and both the patch and my example in comment #8 (but not the one in comment #1) have a bug. Here there are 2 different places where the encoding must be taken into account: (1) the filename itself and (2) the contents of that file. (2) is handled by your JIT-recoder in arg_next() (as an effect of the internal call to gsapi_set_arg_encoding()), but the filename parameter must be converted in the wrapper; it cannot be passed as-is to the GS "core", because the "core" expects UTF-8. [At least I didn't find where is that name converted.] ----- (III) Other notes about the implementation, in no particular order: (i) What exactly is a "rune"? According to get_rune_utf8(), utf16le_get_rune(), and put_rune(), a "rune" in an int containing the 31-bit (or 21-bit, to be more precise) Unicode Value ("UV"). But then gp_local_arg_encoding_get_rune() is wrong, it returns the 1..4 bytes used by the UTF-8 encoding of the character (including the bits that indicate lead/trail bytes). You can actually simplify the latter function, because it's easy to get the UV from the 1..2 whar_ts returned by MultiByteToWideChar(); you don't need to call WideCharToMultiByte(CP_UTF8,...). (j) Let EOF be EOF, not 0 It's a bad idea to use 0 as en EOF indicator. This is most visible in utf16le_get_rune(), because the file will be "truncated" at any *byte* with the value 0. Usually EOF is -1, and differs from any UV because valid UVs are in the range 0..0x10FFFF (so always positive 32-bit ints). If you don't want to rely on C's EOF #define, use -1 directly. (k) WINDOWS_NO_UNICODE, plus some other platforms, look broken The init done by gsapi_set_arg_encoding(GS_ARG_ENCODING_LOCAL) does not look OK. For WINDOWS_NO_UNICODE or non-Windows, it ends up setting minst->rune_from_local = NULL, which is passed to arg_init() and interpreted to mean UTF-8. Command line input: UTF-8 is native on Linux, maybe on MAc(?) but on OS/2? VMS? Also WINDOWS_NO_UNICODE should expect ANSI. File functions: arg_next() converts everything TO UTF-8 via put_rune(). So any filename (input/ output/ -I path/ font path/ etc) unexpectedly ends up as UTF-8, which file functions don't always handle (depending on platform and WINDOWS_NO_UNICODE). (l) Surrogate pairs when decoding UTF-16 The decoding in utf16le_get_rune() looks wrong. 1st, the high/low surrogates have different ranges; matters for validation and for subtracting the base values. Next, after combining the 10+10 bits add 0x10000. (m) The name "rune_from_local" [not that important...] This name (and the comment near it) in struct gs_main_instance_s is unfortunate. The function does not look to be the platform-specific "local"/"host" -> Uv conversion, but the one that converts from whatever was set by gsapi_set_arg_encoding() - which does not necessarily have anything to do with the "local" encoding.
Errata to comment #11: Forget (c), it is actually not a problem, because the encoding is stored per context and gsdll_init() *creates* this context, so a gsdll_init*() call cannot have effects that persist to the next (duh!). Note that (e) remains, because gsapi_run_file*() can be called more than once per gsapi_new_instance() and the "no prefix" copy should (see (h)) take into account the encoding set by gsapi_set_arg_encoding() (for the "filename" parameter). Also I think gsapi_init_with_args*() can be called multiple times per context. In turn, add: (n) The BOM Windows UTF-8 and UTF-16 files usually begin with a BOM, which needs to be skipped. This applies to reading @files. (Discussed a bit on IRC, but I forgot to include this here - again!) (o) Fetching values of true environment variables (not of Registry entries) See the part starting with "When values are fetched from true environment" in comment #8. To fix, use either getenv() or _wgetenv() + conversion to UTF-8, depending on WINDOWS_NO_UNICODE.
Erratum bis, about (h): The "(2) the contents of that file" is wrong, the file passed to gsapi_tun _file*() contains PostScript code so is not (and should not be) subject to gsapi_set_arg_encoding(). Also this file's contents is (correctly) not affected by the JIT-recoder. The problem with the filename itself remains.
(In reply to comment #11) > I suggest adding a "gsdll_initA = gsdll_init" line to the EXPORTS section > of the DEF, to have both "A"+"W" names as is customary on Windows. No extra > C code needed (compared to what already exists in the patch), but please > add the function pointer type to gsdll.h. Done. > (a) gsdll.c::gsdll_init() calls itself (!) and gsdll.c::gsdll_initW() > calls gsdll_init() too. Looks like a typo, seems both should call > gsdll_init_with_encoding(). D'Oh. Fixed. Thanks. > (b) Either > Add a prototype for gsdll_init_with_encoding(). The one assumed > by the compiler when it finds [will find] them called from > gsdll_init/W() is not suitable. > Or > Move gsdll_init/W() after it. Done. (The latter) > (d) I don't see a good reason for the 2 interfaces, GSDLL and GSAPI, to > be so different in design. The "no prefix" functions have different > specs (one is ANSI, the other is "specified by a separate call"). > The "variable" encoding is specified by a separate *entry* for GSAPI > but by an extra *parameter* for GSDLL. [IMO, the GSDLL way is better.] With the gsdll calls there is no concept of 'specified in a previous call', as the gsdll calls create an instance, operate on it, then delete the instance. Hence the encoding needs to be specified (possibly implicitly) at call time. The gsdll interface is now deprecated, AIUI, so I'm fixing this for completeness sake. With gsapi the structure of the interface is more 'make the context', 'set the context', 'make various calls with that context', 'destroy the context'. As such a separate call to 'set the encoding used by the context' seems in keeping to me, rather than forcing the encoding to be passed on every call. This is a personal choice, I admit, so I'm prepared to see other people weigh in with their opinions. > (e) The existing "W"/"_utf8" wrappers call gsapi_set_arg_encoding() > internally, altering the semantic of the "no prefix" entry point. I > find this neither intuitive nor desirable at all. Note the example > for gsapi_run_file() in my comment #8 takes care to set the encoding > back to the default, exactly to avoid interference. [There the "no > prefix" is specified as "always local", but this does not matter > from this point of view.] If you're calling the A or W functions once, then you should be calling them all the time. If you're calling set_arg_encoding, then you shouldn't be calling the A or W interfaces. Problems only occurs when you mix and match the two. I'm not sure that setting them back to 'always local' helps, either, as this will still go wrong. Consider: gsapi_set_arg_encoding(UTF8) gsapi_run_file("file1") gsapi_run_fileW("file2") gsapi_run_file("file3") file3 should be run in the UTF8 context, not UCS2 (as my code would run it) or local (as yours would). I've updated the code so that the old arg encoding is always maintained. > (f) You have a gsapi_run_file_utf8() but no gsapi_init_with_args_utf8(). > I suggest to delete gsapi_run_file_utf8(). This brings GSAPI almost > in line with GSDLL: UTF-8 obtained by explicitly setting the > encoding - either by an extra parameter (GSDLL) or extra call > (GSAPI). If you absolutely want the *_uf8(), add the 2nd one too. Deleted it. > (g) You need separate wrappers for the "A" version, you cannot simply > have "gsapi_run_fileA = gsapi_run_file" in the DEF because the > functionality of the "A" entry point differs from that of the "no > prefix" entry point - one expects "local" encoding, the other > whatever was last specified by gsapi_set_arg_encoding(). Added. > (h) gsapi_run_file() has an extra problem, and both the patch and my > example in comment #8 (but not the one in comment #1) have a bug. > Here there are 2 different places where the encoding must be taken > into account: (1) the filename itself and (2) the contents of that > file. (2) is handled by your JIT-recoder in arg_next() (as an > effect of the internal call to gsapi_set_arg_encoding()), but the > filename parameter must be converted in the wrapper; it cannot be > passed as-is to the GS "core", because the "core" expects UTF-8. > [At least I didn't find where is that name converted.] You are quite right. I have this fixed locally. > (i) What exactly is a "rune"? Henry also mentioned his dislike for the name. I've renamed 'rune' to 'codepoint'. > According to get_rune_utf8(), utf16le_get_rune(), and put_rune(), a > "rune" in an int containing the 31-bit (or 21-bit, to be more precise) > Unicode Value ("UV"). But then gp_local_arg_encoding_get_rune() is > wrong, it returns the 1..4 bytes used by the UTF-8 encoding of the > character (including the bits that indicate lead/trail bytes). You can > actually simplify the latter function, because it's easy to get the UV > from the 1..2 whar_ts returned by MultiByteToWideChar(); you don't > need to call WideCharToMultiByte(CP_UTF8,...). This is a mistake that crept in during my refactoring, when I changed from decoding whole args at once to single 'characters' at a time. I'll fix it. > (j) Let EOF be EOF, not 0 > > It's a bad idea to use 0 as en EOF indicator. This is most visible in > utf16le_get_rune(), because the file will be "truncated" at any *byte* > with the value 0. Usually EOF is -1, and differs from any UV because > valid UVs are in the range 0..0x10FFFF (so always positive 32-bit > ints). If you don't want to rely on C's EOF #define, use -1 directly. You're right. Peter had a macro to read chars, which I expanded, and simplified. And broke. Fixed now. > (k) WINDOWS_NO_UNICODE, plus some other platforms, look broken > > The init done by gsapi_set_arg_encoding(GS_ARG_ENCODING_LOCAL) does > not look OK. For WINDOWS_NO_UNICODE or non-Windows, it ends up > setting minst->rune_from_local = NULL, which is passed to arg_init() > and interpreted to mean UTF-8. OK. I've renamed WINDOWS_NO_UNICODE to GS_NO_UTF8. Any OS can now specify this, and it will mean that gs will build using the old style "8 bit clean" system that it did before. (Filenames etc will be accepted in simple 8 bit ascii and parroted back out to system functions in the same way). Any attempt to call W entry points or to set the encoding to anything other than local will fail in this case. > Command line input: UTF-8 is native on Linux, maybe on MAc(?) but on > OS/2? VMS? Also WINDOWS_NO_UNICODE should expect ANSI. OS/2, VMS etc will have to define GS_NO_UTF8. > File functions: arg_next() converts everything TO UTF-8 via put_rune(). > So any filename (input/ output/ -I path/ font path/ etc) unexpectedly > ends up as UTF-8, which file functions don't always handle (depending > on platform and WINDOWS_NO_UNICODE). If GS_NO_UTF8 is defined, nothing will be rewritten, and everything will work as before. If GS_NO_UTF8 is NOT defined, then the file functions will be expected to cope. > (l) Surrogate pairs when decoding UTF-16 > > The decoding in utf16le_get_rune() looks wrong. 1st, the high/low > surrogates have different ranges; matters for validation and for > subtracting the base values. Next, after combining the 10+10 bits add > 0x10000. Fixed. > (m) The name "rune_from_local" [not that important...] > > This name (and the comment near it) in struct gs_main_instance_s is > unfortunate. The function does not look to be the platform-specific > "local"/"host" -> Uv conversion, but the one that converts from whatever > was set by gsapi_set_arg_encoding() - which does not necessarily have > anything to do with the "local" encoding. This has now been renamed to 'get_codepoint' and the comment updated. > (n) The BOM > > Windows UTF-8 and UTF-16 files usually begin with a BOM, which needs to be > skipped. This applies to reading @files. (Discussed a bit on IRC, but I > forgot to include this here - again!) I now skip the BOM in both UTF-8 and UTF-16 when reading. I skip it wherever it occurs in the string, not just at the start, but really I can't see this being a problem. (Famous last words). > (o) Fetching values of true environment variables (not of Registry entries) > > See the part starting with "When values are fetched from true environment" in > comment #8. To fix, use either getenv() or _wgetenv() + conversion to UTF-8, > depending on WINDOWS_NO_UNICODE. I think we need to assume that when we getenv values, that they will be in ghostscripts internal format. I'll upload the new patch in a mo. Sags: Thanks again for your time in looking at this. I know this is a painful, long drawn out process, but I hope we'll end up with something worthwhile in the end.
Created attachment 9150 [details] 0001-Reworked-unicode-command-line-handling.patch Updated patch.
Created attachment 9151 [details] 0001-Fixed-Reworked-unicode-command-line-handling.patch Updated version of the patch that passes tests.
(In reply to comment #9) > (In reply to comment #8) > > (B) XCFLAGS > > > > [It's not related with this patch, it’s just that I stumbled upon this while > > looking why WINDOWS_NO_UNICODE=1 doesn’t work.] > > > > $(XCFLAGS) is added to the compiler command line twice, once in msvc.mak > > ‘CFLAGS=$(CFLAGS) $(XCFLAGS) $(UNICODECFLAGS)’ and once in msvccmd.mak line > > ‘CCFLAGS=$(PLATOPT) $(FPFLAGS) $(CPFLAGS) $(CFLAGS) $(XCFLAGS) ...’. I think > > the one to delete is the 2nd one. > > Also seems plausible. Turns out that removing this means that the PCL build no longer supports PCL due to the 'PCL_INCLUDED' flag being dropped.
I took a look at the new patch, and the interface design looks OK to me. There are implementation problems, but fixing these does not affect the interface itself. To get all new interface functions available, you need some fixes to the DEFs. There are 2 EXPORTS missing from gsdll32.def: gsapi_init_with_argsA and gsapi_init_with_argsW. The other 3 DEFs (the 64-bit one and the 2 "metro") are also missing some EXPORTS (most of the new functions). Note the GSDLLEXPORT modifier in the source code makes the linker export these functions automatically, but with mangled names ‘_gsapi_init_with_argsA@12’ - not what we want. Here are some of the implementation problems that I found and diagnosed [so far, at least] in no particular order. Note these come mostly from inspecting and tracing the code, I don't have any "real world" and/or "serious" tests. --- The char type may be signed In gsargs.c::arg_next() (1 occurrence) and in gsargs.c::get_codepoint_utf8() (2 places): wrong: (file ? fgetc(file) : (**astr ? *(*astr)++ : EOF)) fix: (file ? fgetc(file) : (**astr ? (int)(unsigned char)*(*astr)++ : EOF)) When the char type is signed (which for MSVC is the default), converting it to int sign-extends the value - not what you want. --- Use of character classification CRTL functions with Unicode Values isspace() and most character classification functions are defined only for 0..255 and EOF (for other arguments the debug CRTL even asserts). arg_next() works with EOF and UVs (so much larger range), and passing these to isspace() looks suspicious at least. I've also seen isprint() somewhere, but not sure it's affected. --- In iapi.c::gsapi_init_with_argsA() and iapi.c::gsapi_run_fileA(), for the GS_NO_UTF8 case, you set "local" encoding with: code = gsapi_set_arg_encoding(lib, clean8bit_get_codepoint); but gsapi_set_arg_encoding() wants an enum not a function pointer for the 2nd argument. I think here you need: gs_main_inst_arg_decode(get_minst_from_memory(ctx->memory), clean8bit_get_codepoint); code = 0; --- The GS_NO_UTF8 version of gswin32c.exe does not start, it says "Can't create Ghostscript instance". The cause is iapi.c::gsapi_set_arg_encoding(GS_ARG_ENCODING_LOCAL) (called from iapi.c::gsapi_new_instance()) returning e_Fail. Suggested fix in iapi.c::gsapi_set_arg_encoding(): #if defined(GS_NO_UTF8) if (encoding == GS_ARG_ENCODING_LOCAL) { /* For GS_NO_UTF8 builds, we don't use utf8 internally, and we assume * that all inputs are 8 bit clean. */ gs_main_inst_arg_decode(get_minst_from_memory(ctx->memory), clean8bit_get_codepoint); + return 0; <== ADD THIS } #else --- GS_OPTIONS is broken (the other envvars are affected too) There are 2 distinct problems that contribute to this. (a) The gp_getenv() functions are supposed to return strings in GS's internal encoding. The Windows implementation is partly wrong, in that if it finds a true environment variable the returned value is ANSII encoded. It must return ANSII if GS_NO_UTF8, utf-8 otherwise. This affects all environment variables, not only GS_OPTIONS. Note that, independent on whether the parent process passed an ANSI or Unicode environment, Windows does the necessary conversion based on the function used to retrieve the values: getenv() always returns ANSII, _wgetenv() always Unicode (wchar_t). (b) If I understood correctly, for the encoding of arg_source.u.s.chars you expect: - if .parsed == true, then encoding == GS's internal one (utf-8) - if .parsed == false, then encoding according to gsapi_set_arg_encoding() (readable by .get_codepoint()) The "pushed back" GS_OPTIONS has .parsed == false and encoding == utf-8 (or ANSII if affected by bug (a)), independent of gsapi_set_arg_encoding(). Note that you cannot set .parsed == true for this one, because it must be parsed to separate individual switches and filenames. --- Possible overflow of utf-8 buffers for arguments In gsargs.h::arg_list you have .cstr[arg_str_max + 1]. I think now you need more than +1, because gsargs.c::codepoint_to_utf8(), called from gsargs.c::agr_next(), may write more than 1 byte there. Maybe the tests "if (i == arg_str_max - 1) errprintf("Command too long")" need some thought too.
(In reply to comment #18) > I took a look at the new patch, and the interface design looks OK to me. > There are implementation problems, but fixing these does not affect the > interface itself. That's great news, thanks. > To get all new interface functions available, you need some fixes to the > DEFs. There are 2 EXPORTS missing from gsdll32.def: gsapi_init_with_argsA > and gsapi_init_with_argsW. The other 3 DEFs (the 64-bit one and the 2 > "metro") are also missing some EXPORTS (most of the new functions). I will update the patch shortly with a new version that fixes this (and the following issues). > The char type may be signed > > In gsargs.c::arg_next() (1 occurrence) and in gsargs.c::get_codepoint_utf8() > (2 places): Fixed. > Use of character classification CRTL functions with Unicode Values Fixed. > In iapi.c::gsapi_init_with_argsA() and iapi.c::gsapi_run_fileA(), for the > GS_NO_UTF8 case, you set "local" encoding with: Fixed (in a different way). > The GS_NO_UTF8 version of gswin32c.exe does not start, it says "Can't create > Ghostscript instance". Fixed. > GS_OPTIONS is broken (the other envvars are affected too) > > There are 2 distinct problems that contribute to this. > > (a) > The gp_getenv() functions are supposed to return strings in GS's internal > encoding. The Windows implementation is partly wrong, in that if it finds a > true environment variable the returned value is ANSII encoded. It must > return ANSII if GS_NO_UTF8, utf-8 otherwise. This affects all environment > variables, not only GS_OPTIONS. > > Note that, independent on whether the parent process passed an ANSI or > Unicode environment, Windows does the necessary conversion based on the > function used to retrieve the values: getenv() always returns ANSII, > _wgetenv() always Unicode (wchar_t). (a) requires more thought from me. > (b) > If I understood correctly, for the encoding of arg_source.u.s.chars you > expect: > - if .parsed == true, then encoding == GS's internal one (utf-8) > - if .parsed == false, then encoding according to gsapi_set_arg_encoding() > (readable by .get_codepoint()) (b) is fixed; we now have separate flags for parsed and decoded. parsed implies decoded, but we can have decoded but unparsed args in the list now. > Possible overflow of utf-8 buffers for arguments Fixed. Once again, many thanks!
(In reply to comment #19) > > GS_OPTIONS is broken (the other envvars are affected too) > > > > There are 2 distinct problems that contribute to this. > > > > (a) > > The gp_getenv() functions are supposed to return strings in GS's internal > > encoding. The Windows implementation is partly wrong, in that if it finds a > > true environment variable the returned value is ANSII encoded. It must > > return ANSII if GS_NO_UTF8, utf-8 otherwise. This affects all environment > > variables, not only GS_OPTIONS. > > > > Note that, independent on whether the parent process passed an ANSI or > > Unicode environment, Windows does the necessary conversion based on the > > function used to retrieve the values: getenv() always returns ANSII, > > _wgetenv() always Unicode (wchar_t). > > (a) requires more thought from me. So, I guess the right thing to do here is to change gp_getenv on windows to use _wgetenv() to get an environment variable as Unicode, and then convert that to utf8 before returning it into gs itself. For 8 bit environment variables, this means we'll get the bytes interpreted as the current codepage as they go into unicode, and then converted down to utf8. For 16 bit environment variables, we'll get a direct conversion down. This sounds correct to me.
Created attachment 9231 [details] 0001-Latest-unicode-handling.patch Latest version that should address all Sags comments, except for the handling of windows environment variables.
Created attachment 9232 [details] 0001-Latest-unicode-handling2.patch And again, with windows environment variable changes too. Hopefully this is complete now.
Committed in: commit 4c1a3aac35f1b4b7711c2e9e2f0f38fe121f51db Author: Robin Watts <robin.watts@artifex.com> Date: Tue Sep 25 12:39:12 2012 +0100 Command line argument character encoding handling rework. Recap of the situation: Historically gs has always taken command line args in whatever the 'local' encoding of that OS is; for windows this means they are encoded as per the currently selected codepage; for mac/linux this tends to mean utf8. Similarly, whenever gs goes to access a file, it would pass the filenames to the platform specific file opening code - on windows this would assume that it was in the local encoding, and on linux would assume utf8. There are downsides to dealing with codepage encoding - not least the fact that some files can't be accessed easily, or that the behaviour of PS programs may change according to what platform we are on. Last year we added code so that gs on windows would run as a unicode app, and would take the supplied unicode argv and convert to utf8 before passing them to the core of gs. The file reading functions would then convert from utf8 back to unicode before opening the files. This code was disabled by default, but seemed to work well in tests. 1 objection was raised to this - namely that we were redefining the meaning of the entry points (to take utf8 rather than locally encoded args) and that this might break legacy code. So, it was suggested that we add a new entry point specifically for utf8, and that we deprecate the old entry point. In starting this work, I spotted another problem; gs supports the '@file' arg (where more command line args can be given in a file). The current solution of rewriting the command line args does not address this. Instead, we update the gsargs arg handling code with a 'get_codepoint' function pointer. We update the arg_next function to use this to decode the supplied args (or OPTIONS or file contents) into simple unicode characters. We then write these into the returned arg string as utf8 (or in the case of GS_NO_UTF8 builds, as 8 bit 'clean' characters). This commit adds a new gsapi_set_arg_encoding function, that allows the caller to set the encoding used for the args/strings supplied to the subsequent functions. By default, 'local' encoding is assumed, hence preserving existing behaviour. By calling: gs_set_arg_encoding(minst, GS_ARG_ENCODING_UTF8); we can get the UTF8 encoding that we desire. We leave ourselves scope for supporting other encodings (such as EBCDIC?) in the future. The commit also changes it so that the gs always runs with UTF8 encoding internally by default; to avoid it, make with 'GS_NO_UTF8=1'. For the benefit of windows users we implement UTF16LE encoding (i.e. wchar_t's). And we expose this through the gsapi level as gsapi_run_fileW (and gsapi_init_with_argsW) so that .net can bind nicely to it without any casting trickery being required. Many thanks to Sags for his tireless efforts and comments while this work was underway. Any remaining problems etc, we'll either reopen this bug, or (if the problems are small), create a new bug just to deal with those.