Bug 692381 - utf8-Ghostscript - Incompatible changes to the GSAPI/GSDLL interfaces
Summary: utf8-Ghostscript - Incompatible changes to the GSAPI/GSDLL interfaces
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: General (show other bugs)
Version: master
Hardware: PC Windows XP
: P4 normal
Assignee: Robin Watts
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-07-28 11:14 UTC by SaGS
Modified: 2014-02-17 04:43 UTC (History)
2 users (show)

See Also:
Customer:
Word Size: ---


Attachments
UnicodeNode.txt (4.87 KB, text/plain)
2011-07-30 16:54 UTC, Robin Watts
Details
0001-Command-line-argument-character-encoding-handling-re.patch (36.07 KB, patch)
2012-12-12 17:15 UTC, Robin Watts
Details | Diff
0001-Reworked-unicode-command-line-handling.patch (64.99 KB, patch)
2012-12-17 17:14 UTC, Robin Watts
Details | Diff
0001-Fixed-Reworked-unicode-command-line-handling.patch (64.92 KB, patch)
2012-12-18 11:58 UTC, Robin Watts
Details | Diff
0001-Latest-unicode-handling.patch (69.79 KB, patch)
2013-01-18 12:53 UTC, Robin Watts
Details | Diff
0001-Latest-unicode-handling2.patch (70.62 KB, patch)
2013-01-18 15:42 UTC, Robin Watts
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description SaGS 2011-07-28 11:14:21 UTC
The existing utf8 additions for the Windows platform resulted in 
incompatible changes of the GSAPI and GSDLL interfaces. Existing GSAPI 
clients could previously pass ‘command line’ options (= arguments to 
functions like gsapi_init_with_agrs()) that included extended characters 
(input filenames, include path, FONTPATH, etc) as long as these characters 
are part of the installed ANSII codepage. Now this does not work anymore 
unless the GSAPI client is specifically modified to convert ‘command line’ 
arguments to utf8 by itself, an encoding that is not a native one on Windows.

Steps to reproduce -------

- First we need an ‘existing GSPSI client’. A gswin32c.exe from a previous
  release or the current gswin32c.exe compiled with WINDOWS_NO_UNICODE will
  show the problem. We will name this executable ‘gswin32c_ansii.exe’.

- Create a PostScript file containing non-7-bit-ASCII characters in the 
  name but still in the system’s ANSII codepage. Let's say this is "Grün.ps"
  ("ü" is Alt+0252 in the ‘Windows: Western’ ANSII codepage). The contents
  does not really matter, a single ‘showpage’ is enough.

- Attempt to display this file using the ‘existing GSPSI client’ and the 
  new *utf8* DLL:

    gswin32c_ansii.exe Grün.ps

    GPL Ghostscript GIT PRERELEASE 9.04 (2011-03-30)
    Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
    This software comes with NO WARRANTY: see the file PUBLIC for details.
    Error: /undefinedfilename in (Gr\374n.ps)
    ...
    GPL Ghostscript GIT PRERELEASE 9.04: Unrecoverable error, exit code 1

  The error happens because the ‘existing GSPSI client’ passes ANSII 
  strings for the argv[] parameter. The filesystem access functions of
  the new DLL take these bytes, consider them an utf8 string, convert to 
  utf16 actually destroying it, and pass the result to Windows.

More on the scope and limits of this bug report -------

Current GS version introduces 2 incompatible changes.

(A) PostScript operators that interact with the host expect any PostScript
    string parameter they take to be utf8-encoded.

(B) ‘Command line arguments’ (switches, input filenames, etc) are expected
    to be utf8-encoded too. (More precisely, any filenames/ paths/ etc 
    inthere must be utf8; the switch letters, PS operator names, various
    names recognized by GS, etc use only 7-bit ASCII characters and their
    codification is identical ASCII <-> ANSII <-> utf8.)

The current report is about eliminating (B) and not about (A). (A) does not 
typically affect PostScript-as-a-page-description-language files, only 
PostScript-as-a-programming-language files plus some uses like concatenation 
of PS files with save-runfile-restore-repeat. The changes help somehow 
with (A) by making gsapi_revision()/gsdll_revision() to report if file 
PS operators expect utf8 or ANSII strings, so the GSAPI client knows what 
to do if generating PS code on the fly.

Next steps -------

Comment #1 below describes what type of changes I consider necessary.
Comment #2 is a list of all GSAPI and GSDLL interface functions/ messages/
and similar, telling which need changing and how, and which remain 
untouched and why.

No patch yet. I hope to have something this weekend. (I hoped the last 
weekend too, but nothing came out for lack of time...) It’s not much code,
but it’s possible the overhead to do this correctly (no platform-specific
code in i*.c files) to be significant. Plus, I don’t know if anyone agrees 
with the design in general.
Comment 1 SaGS 2011-07-28 11:15:11 UTC
My suggestions on what to do:

(i) -------
Functions that take text parameters get encoding-specific variants. To do 
this in a ‘Windowsly’ manner, names for the ANSII variants get an "A" 
suffix and for those that expect wchar_t parameters the suffix is "W". The
no-suffix names are kept for backward compatibility, and are aliases for 
the ANSII entry points. For reasons explained later, there is another 
variant that takes utf8 strings, doing no conversion at all; I suggets the
suffix "8" for these.

Example for gsapi_run_file():

- Add:

    int gsapi_run_fileA (..., const char *file_name/*ANSII*/, ...);

  New function. All it does is convert file_name ANSII->utf8 and call the
  utf8 worker function (see below). Exported from the *.DEF file as both
  ‘gsapi_run_fileA’ (suffix ‘A’) and ‘gsapi_run_file’ (no suffix).
  (‘EXPORTS’ syntax: ‘externalname = internalname [@ordinal]’.)

- Add:

    int gsapi_run_fileW (..., const wchar_t *file_name/*utf16*/, ...);

  New function. It converts utf16->utf8 and calls the same utf8 worker 
  function. Exported from the *.DEF file as ‘gsapi_run_fileW’ (suffix ‘W’).

- Keep unchanged:

    int gsapi_run_file (..., const char *file_name/*utf8*/, ...);

  This is the worker function called by the ‘A’ and ‘W’ variants. Also 
  exported from the *.DEF as ‘gsapi_run_file8’ (suffix ‘8’).

Why the ‘A/W’ variants:
    We need the ANSII version for backwards compatibility, and the Unicode 
    version for grater coverage of filenames. Utf8 provides the same coverage 
    as utf16 (aka ‘Unicode’ aka ‘wide character’), but utf8 is not a native 
    Windows encoding. These variants with exactly these suffixes are ‘the 
    Windows way’ for this kind of job. Think of these functions as helpers
    for calling the utf8 variant.

Why the no-suffix variant [on Windows]:
    Ensures that existing GSAPI/ GSDLL clients can link, unmodified, to the
    new DLL. Note that with the name-juggling in the DEF, the no-prefix 
    externally visible symbol links to the ANSII functions on Windows and
    to the utf8 one on all other platforms.

Why the ‘8’ variant:
    Currently, the gswin32.exe and gswin32c.dll get command-line as utf16 
    and convert automatically to utf8. In the process, the exact bytes 
    that represent these arguments change. We implicitely assume that 
    nonbody will put binary bytes (true binary, not hex or ascii85) on the 
    command line, and that this limitation is much less annoying than 
    not being able to simply type filenames with extended characters.
    For ‘gsapi_init_with_args()’ the situation si a bit different, the 
    ‘argv[]’ is [could be] generated by a program. This program can very
    well pass a ‘-c’ with binary tokens that would be destroyed by a 
    charset conversion. For such an exotic use, I suggest to provide these
    ‘8’ entries. Note this requires exactly ZERO coding, the function is 
    there anyway.

(ii) -------
I also consider that such an important change as the encoding for 
filename strings passed to PostScript operators must be signaled by 
‘gs_revision()’. At least the GSAPI client can know what kind of DLL it 
loaded and cat appropriately. Also allows writing clients that work with 
both ‘new’ and ‘old’ DLLs, and une the new functionality if available.

(iii) -------
After doing the above changes, the gswin32c.exe and gswin32.exe need to be 
changed to call the new entry points. Currently they call the no-suffix 
function but pass utf8 parameters. What they have to do is pass the 
wide char strings to the ‘W’ functions directly. I consider the 
wchar_t -> utf8 conversion is currently misplaced: being in the frontends
resolve the problems for those frontends; their place is ‘after’ the 
GSAPI interface, to help all clients and not introduce incompatibilities.

(iv) -------
Also these have to be documented in API.htm and DLL.htm. Until now, the 
specs were silent about character encoding, assuming some general 
consensus that the encoding is the one used by the host. Now, because the 
encoding on Windows is forced to be a non-native one, the docs may say 
exactly where is this encoding required and where not.
Comment 2 SaGS 2011-07-28 11:17:11 UTC
List of GSAPI and GSDLL interface elements, and my opinion on what has to 
change and what can/must remain as-is:

FUNCTIONS THAT CHANGE by backwards-compatile additions -------

gsapi_revision()
gsdll_revision()
    Taking a shortcut here: the ‘product’ and ‘copyright’ strings always 
    consist of 1-byte 7-bit ASCII characters; the character encoding used 
    does not vary and there are no "A/W/8" variants.

    The ‘product’ changes to optionnaly include a ‘built-in features’ list.
    This list follows the product name proper, is enclosed in square 
    brackets, and consists of space-delimited predefined tokens.

    There is currently only one such feature:

    utf8osops   If present, operators that interface with the host os 
                interpret any [PostScript] strings they take as parameters
                as being UTF-8 encoded. Note: no initial BOM is expected or 
                allowed.
                If absent, these PostScript strings are assumed to use the
                same encoding as the host os. For Windows this is the ANSII
                codepage in effect, and currently happens (i) for GS versions 
                9.02 and earlier and (ii) for versions 9.04 and later when 
                compiled with the option WIDNOWS_NO_UNICODE. For the other 
                platforms, this is the only case currently implemented.
                Note: GS 9.03 does not return this information, although it
                      can be compiled with or without this feature.

FUNCTIONS THAT CHANGE because they need encoding-specific variants -------

It turs out there are only 3 of these (2 x GSAPI + 1 x GSDLL).

gsapi_init_with_args()      gsdll_init()
gsapi_run_file()
    These functions need "A/W/8" variants for the possible encoding used by 
    the ‘argv’ and ‘file_name’ parameters. The ‘callback’ parameter of 
    gsdll_init() does not vary depending on a character encoding, see below
    why (GSDLL_CALLBACK detailed by message).

FUNCTIONS THAT DO NOT CHANGE -------

gsapi_new_instance()        GSDLL_CALLBACK message GSDLL_POLL
gsapi_delete_instance()
gsapi_run_string_begin()    gsdll_execute_begin()
gsapi_run_string_end()      gsdll_execute_end()
gsapi_exit()                gsdll_exit()
gsapi_set_poll()
the poll_fn() callback passed to gsapi_set_poll()
    These functions do not manipulate character data, and do not have any
    char/ char[]/ char* parameters, so they do not need any change.

gsapi_run_string_continue()            gsdll_execute_cont()
gsapi_run_string_with_length()
gsapi_run_string()
gsapi_set_stdio()
the stdin_fn() for gsapi_set_stdio()   GSDLL_CALLBACK message GSDLL_STDIN
the stdout_fn() for gsapi_set_stdio()  GSDLL_CALLBACK message GSDLL_STDOUT
the stderr_fn() for gsapi_set_stdio()
    Character buffers handled by these functions are actually fragments from
    the PostScript interpreter’s input/ output/ error streams, and as such
    are goverend by the PostScript syntax and semantics. Nothing here depends
    on the encodind used by the host os for textual information, so these 
    functions don’t change (no "A/W/8" variants of them).

gsdll_lock_device()         GSDLL_CALLBACK message GSDLL_DEVICE
gsdll_copy_dib()            GSDLL_CALLBACK message GSDLL_SYNC
gsdll_copy_palette()        GSDLL_CALLBACK message GSDLL_PAGE
gsdll_draw()                GSDLL_CALLBACK message GSDLL_SIZE
gsdll_get_bitmap_row()
    The ‘device’ parameter does not point to textual information, it’s rather
    of an opaque type, so not subject to character encoding. 
gsdll_get_bitmap()
    The ‘device’ parameter is actually of an opaque type, and not textual 
    information that may be subject to character encoding. The ‘pbitmap’
    is a pointer to binary data. Also, this is an OS/2-specific entry 
    point, and the utf8 file operators are not [yet?] implemented for this 
    platform.

gsapi_set_display_callback()
the display_callback passed to gsapi_set_display_callback()
    The only char data I found here is the ‘const char *component_name’
    parameter of the display_separation() structure member. This is used
    to pass SeparationNames from the PostScript program, so its contents
    (including any implied encoding, although this is primarily just an array
    of bytes) is dictated by the running PostScript code. No character 
    encoding to be specified at the GSAPI interface level.

the vd_trace_host_s pointed by a vd_trace_interface_s member
    This structure is completely defined and controlled by the module that
    draws the ‘trace image’, which is the GSAPI client. Hence, no need for 
    a character encoding contract at the GSAPI interface level.
the vd_trace_interface_s passed to gsapi_set_visual_tracer()
    The only member with text data is text(..., char *ASCIIZ), and the only 
    place I found it used is in base\gxhintn.c::t1_hinter__paint_glyph().
    From what I see, it is used to label the Visual Trace image. The text
    consists of debug information (does not come from PostScript nor the
    command line/ file names/ etc). I conclude the character encoding used
    is currently defined by the C compiler. Having a more precise 
    specification of the encoding would be marginally useful. I consider 
    defining "A/W/8" variants of vd_trace_interface_s to be overkill.
gsapi_set_visual_tracer()
    If this were a WINAPI function, it would have "A/W" variants.
    But for the reasons explained above, I consider it does not need them.
Comment 3 Henry Stiles 2011-07-29 21:04:38 UTC
The consensus on this was we don't want to take on multiple api's.  Robin Watts has prepared a release document to warn commercial customer that we will go forward with utf-8 next release (9.05).  Reassigning to him so he can attach the document and close the bug.  Thank you for the detailed and complete analysis of this issue.
Comment 4 Robin Watts 2011-07-30 16:54:17 UTC
Created attachment 7728 [details]
UnicodeNode.txt

This is being included in the release notes for 9.04, which will ship with WINDOWS_NO_UNICODE. Unless we get feedback from customers to change our mind, we will proceed with the "incompatible change" to the API in 9.05.

At the moment, the feeling is strongly that taking on multiple different APIs for different platforms is something we'd like to avoid.
Comment 5 Robin Watts 2011-07-30 16:54:49 UTC
Thanks again for your analysis.
Comment 6 Ray Johnston 2011-07-30 18:25:48 UTC
I hope we call this UnicodeNote.txt (not UnicodeNode) when we ship it. ;-)
Comment 7 Robin Watts 2012-09-25 19:25:48 UTC
Sags: I have revisited the unicode handling, driven at least in part by your comments here. I have taken a slightly different approach, but I think it gives us the backwards compatibility that we want.

See the following link for my current attempt:

http://git.ghostscript.com/?p=user/robin/ghostpdl.git/.git;a=commitdiff;h=0d3ce254a0a03d3530623c5be5b4c1d92466f87c

Any comments you may have gratefully accepted.
Comment 8 SaGS 2012-10-14 22:01:29 UTC
I took a look at the patch from comment #7, and here are a few notes about what I think. Hope these will be of some help.


-------------
(A) WINDOWS_NO_UNICODE test

The test in msvc.mak (and pcl6_msvc.mak) line ‘!if "$(WINDOWS_NO_UNICODE)" != "1"’ seems reversed.


-------------
(B) XCFLAGS

[It's not related with this patch, it’s just that I stumbled upon this while looking why WINDOWS_NO_UNICODE=1 doesn’t work.]

$(XCFLAGS) is added to the compiler command line twice, once in msvc.mak ‘CFLAGS=$(CFLAGS) $(XCFLAGS) $(UNICODECFLAGS)’ and once in msvccmd.mak line ‘CCFLAGS=$(PLATOPT) $(FPFLAGS) $(CPFLAGS) $(CFLAGS) $(XCFLAGS) ...’. I think the one to delete is the 2nd one.


-------------
(C) The ‘string type fuzzyness’

In C, strings are sequences of bytes ended by a byte with the value 0x00. As long as we use a single character encoding everywhere, there is no confusion about the semantic of these bytes.

When we work with varying encodings, to know how to handle each string we need to know its encoding.

- One way is to ‘tag’ each string with its encoding, like instead of 
  ‘char *’ use ‘struct { char *p; enum {..} encoding;}’ - too complicated.

- Another is to pass around extra parameters to specify the encoding used 
  by each ‘char *’ or ‘char []’ parameter.

- A 3rd way is to use different functions, each working with a specific 
  encoding as in ‘fnA(char * /*ANSI*/)’ + ‘fn8(char * /*UTF-8*/)’ + 
  ‘fnW(wchar_t * /*UTF-16*/)’.
  [Note: I used the suffix ‘8’ for the utf-8 variant, but anything you 
  like (‘_utf8’) is OK with me.]

The patch does not address this need, and gsapi_set_arg_encoding() makes things worse from this point of view.


-------------
(C.1) ‘String type fuzzyness’ part #1: the DLL entry points
      plus why "A"/"W" and why the UTF-8 entry is not so usefull

The exact type (including the encoding) of strings passed to some DLL entry points is not known at compile-time. For example it is not possible to write a prototype that expresses exactly what these functions expect. Workarounds exist, but are not elegant, neither 100% safe.

The problem is more visible with .NET Framework clients that want to call the GS DLL via P/Invoke. The runtime can convert strings, but (a) a correct prototype is needed and (b) the only available options are ANSII and Unicode (which means UTF-16). I’m not a .NET expert, but I think the only way to pass strings as UTF-8 is to handle that parameter as if it were a generic ‘void *’ and do manual allocation of unmanaged memory and storing of the necessary bytes there.

I think it’s not a big change, compared to the current patch, to separate "A"/"W"/"8" (or "_utf8" if you prefer) entries.
Pseudo-code for gsapi_run_file():

  - Don’t export gsapi_set_arg_encoding(), but keep it for internal use.

  - In iapi.c (assuming arg decoding initialized to GS_ARG_ENCODING_LOCAL):

    int gsapi_run_file (char *file_name)
    {
        ... the current function, unchanged
    }

    #if WINDOWS
    int gsapi_run_file8 (char *file_name)
    {
        gsapi_set_arg_encoding(GS_ARG_ENCODING_UTF8);
        i = gsapi_run_file (file_name);
        gsapi_set_arg_encoding(GS_ARG_ENCODING_LOCAL);
        return i;
    }

    int gsapi_run_fileW (wchar_t *file_name)
    {
        gsapi_set_arg_encoding(GS_ARG_ENCODING_UTF16/*future developement*/);
        i = gsapi_run_file ((char *)file_name);
        gsapi_set_arg_encoding(GS_ARG_ENCODING_LOCAL);
        return i;
    }
    #endif

  - In gsdll32.def:

    EXPORTS
        gsapi_run_file
        gsapi_run_fileA = gsapi_run_file
        gsapi_run_fileW
        gsapi_run_file8

Note the gsapi_*() functions are not reentrant (for the same context) and their arguments are processed to completion before they return, which makes this juggling with the GS_ARG_ENCODING_* be all that’s needed.
Also note there are only 3 functions (2 × GSAPI and 1 × GSDLL, see comment #2) that need such wrappers.

[The explanations here seem incomplete, but this is all I could write in the time I had.]


-------------
(C.2) ‘String type fuzzyness’ part #2: handling of arguments


I’m refering to gsargs.h::arg_source.chars, gsargs.c::arg_next(), etc.

To analyze the arguments, arg_next() looks for spaces, quotes, backslash, etc. The code looks written with single-byte 7-bit ASCII encoding in mind, and works with encodings that are ‘compatible enough’ with ASCII. UTF-8 and single-byte ANSII encodings are ‘compatible enough’, but double-byte ANSII and UTF-16 are not.

With the current solution, strings in arg_source are considered to use whatever encoding they had when coming from ‘the outside’, and are converted to UTF-8 just before returning from arg_next(). There are problems with this.

- Some of these strings are ‘pushed back’ strings, and these are already 
  in UTF-8 (being converted by the arg_next() that retrived them the 1st 
  time). The main part of arg_next() works OK with these, but 
  double-converts them on exit. The situation can be detected by checking
  arg_source.u.s.parsed and skip converting if true, but see below.

  Example:
  [In these examples gswin32c_ASNII.exe is the ‘old GSAPI client’ obtained
  here by compiling with WINDOWS_NO_UNICODE, but it calls the Unicode DLL.
  The file named ‘é’ (U+00E9) does not exist.]

   a) gswin32c_ANSII -fé

   -> Error: /undefinedfilename in (\303\251)

      OK (the file does not exist). ‘é’ is U+00E9 (0000 0000 1110 1001), 
      encoded as UTF-8 gives "\xC3\xA9" (110 00011 10 101001).

   b) gswin32c_ANSII -c 0 -fé

   -> Error: /undefinedfilename in (\303\203\302\251)

      Wrong, the ANSII string ‘é’ ("\xE9") has beed converted to UTF-8 
      once "\xC3\xA9" then it was ‘pushed back’, considered to be 
      ANSII-encoded (U+00C3 U+00A9), and converted to UTF-8 a 2nd time.

- The GS_OPTIONS, when retrieved from the Registry, is also stored here 
  encoded as UTF-8 but does not have .parsed == true. It also suffers
  from double-decoding.

      ..define ‘GS_OPTIONS=-sTEST=é’ in the Registry
      gswin32c_ANSII -c TEST == quit

   -> (\303\203\302\251)

  While I’m here [this problem existed before the current patch, it’s 
  not a regression]: When values are fetched from true environment 
  variables (not from the Registry), they are incorrectly ANSII encoded.
  The ‘correct’ value seen here comes from the 2nd decoding.

      set GS_OPTIONS=-sTEST=é
      gswin32c_ANSII -c TEST == quit

   -> (\303\251)

  This happens because gp_wgetv.c::gp_getenv() calls getenv(), it must
  call one of getenv() or _wgetenv() depending on WINDOWS_NO_UNICODE
  (converting the encoding when needed).

The preceding 2 problems are similar to those described in bug #688965 "GS_OPTIONS env doesn't allow quoted arguments or spaces".

- [All?] other strings use the encoding they came in from ‘the outside’.

  arg_next() does not work correctly if this encoding is a double-byte 
  ANSII one <http://msdn.microsoft.com/en-us/goglobal/bb964654>. See 
  bug #691117 "Gs cannot open lib files when directory has some SJIS 
  letters in its name" for the explanation of the ‘2nd byte matched for 
  backslash’ problem.
  [Not a regression, this never worked].

  Also, we cannot support UTF-16 as an ‘outside encoding’ by simply 
  adding a new arg_decode() function; all of arg_next() would need to 
  be rewritten to handle 2-byte characters (including NUL bytes that 
  don’t terminate strings).

I think the only way to solve the above problems is to have everything in arg_source as UTF-8. This means converting strings to UTF-8 BEFORE putting them there (instead of when retrieving them), and converting what is read from @files on-the-fly (taking care of multi-byte characters) - exactly what I think you wanted to avoid :(
Comment 9 Robin Watts 2012-12-12 00:55:43 UTC
(In reply to comment #8)
> (A) WINDOWS_NO_UNICODE test
> 
> The test in msvc.mak (and pcl6_msvc.mak) line ‘!if "$(WINDOWS_NO_UNICODE)" !=
> "1"’ seems reversed.

I agree that the test is wrong. I think it should simply be:

  !if "$(WINDOWS_NO_UNICODE)" != ""

> (B) XCFLAGS
> 
> [It's not related with this patch, it’s just that I stumbled upon this while
> looking why WINDOWS_NO_UNICODE=1 doesn’t work.]
> 
> $(XCFLAGS) is added to the compiler command line twice, once in msvc.mak
> ‘CFLAGS=$(CFLAGS) $(XCFLAGS) $(UNICODECFLAGS)’ and once in msvccmd.mak line
> ‘CCFLAGS=$(PLATOPT) $(FPFLAGS) $(CPFLAGS) $(CFLAGS) $(XCFLAGS) ...’. I think
> the one to delete is the 2nd one.

Also seems plausible.

> In C, strings are sequences of bytes ended by a byte with the value 0x00. As
> long as we use a single character encoding everywhere, there is no confusion
> about the semantic of these bytes.

The plan was to use a single encoding (utf8) for everything after arg_next, and to use a single (varying set of) encoding(s) before that.

> (C.2) ‘String type fuzzyness’ part #2: handling of arguments
> 
> 
> I’m refering to gsargs.h::arg_source.chars, gsargs.c::arg_next(), etc.
> 
> To analyze the arguments, arg_next() looks for spaces, quotes, backslash, etc.
> The code looks written with single-byte 7-bit ASCII encoding in mind, and
> works with encodings that are ‘compatible enough’ with ASCII. UTF-8 and
> single-byte ANSII encodings are ‘compatible enough’, but double-byte ANSII
> and UTF-16 are not.

I have a reworked patch here that moves the decoding to be earlier within arg_next; we now decode the strings a 'rune' at a time (i.e. possibly multiple bytes are consumed from the arg to create single unicode values). We then look at those runes to check for quoting etc. These unicode values are then written back as utf8 into the arg buffer.

This should lift the 'compatible enough' restriction.

> I think the only way to solve the above problems is to have everything in
> arg_source as UTF-8. This means converting strings to UTF-8 BEFORE putting
> them there (instead of when retrieving them), and converting what is read from
> @files on-the-fly (taking care of multi-byte characters) - exactly what I
> think you wanted to avoid :(

I hope that my new patch effectively achieves this, but in a 'just in time' manner.

I still have the strcmp's from imainarg to deal with, but I have a plan for this too.

> (C.1) ‘String type fuzzyness’ part #1: the DLL entry points
>       plus why "A"/"W" and why the UTF-8 entry is not so usefull

The desire to have a simple prototype for pinvoke is a good one. I am therefore leaning towards offering the gsapi helper functions you suggest (though I personally prefer leaving gsapi_set_arg_encoding as an external).

I will upload a new patch for your consideration tomorrow.
Comment 10 Robin Watts 2012-12-12 17:15:54 UTC
Created attachment 9144 [details]
0001-Command-line-argument-character-encoding-handling-re.patch

New patch to try to address Sags concerns.
Comment 11 SaGS 2012-12-16 22:41:24 UTC
I took a look at the patch, and here are my notes. I did not actually
*run* the code, everything here comes from inspecting the source 
(apologies if some of it turns to be completely wrong). Anyway, hope
that will help.


---
(I) The GSDLL interface

A single function needs wrappers, OK.

I suggest adding a "gsdll_initA = gsdll_init" line to the EXPORTS section 
of the DEF, to have both "A"+"W" names as is customary on Windows. No extra 
C code needed (compared to what already exists in the patch), but please 
add the function pointer type to gsdll.h.

Then, the interface specs [as seen from callers of the DLL] will be:

    gsdll_init      Always uses the "local" encoding. On Windows, this is
                    equivalent to gsdll_initA; the "no prefix" name is kept
                    for backwards compatibility.

    gsdll_initA     Always "local" encoding; Windows-specific.
    gsdll_initW     Always "unicode" encoding; Windows-specific.

    gsdll_init_with_encoding
                    The encoding is specified by an extra parameter.

Note:
    To use UTF-8 encoding on Windows (which anyway is "unusual") one 
    will use gsdll_init_with_encoding(.., GS_ARG_ENCODING_UTF8).

As above, it looks [will look] OK to me.

Notes on the implementation:

(a) gsdll.c::gsdll_init() calls itself (!) and gsdll.c::gsdll_initW() 
    calls gsdll_init() too. Looks like a typo, seems both should call 
    gsdll_init_with_encoding().

(b) Either
        Add a prototype for gsdll_init_with_encoding(). The one assumed
        by the compiler when it finds [will find] them called from 
        gsdll_init/W() is not suitable.
    Or
        Move gsdll_init/W() after it.

(c) [added after reviewing what I wrote] This interface also suffers 
    from the "encoding contamination" of the no-prefix entry point from
    the other entry points as described below for GSAPI - see (e). And
    even between the GSDLL and GSAPI interfaces!


----
(II) The GSAPI interface

It is not clear to me how you see this specified, more precisely which 
function (name as visible from the GSAPI client) uses which character 
set. Looking at the code, my best guess is:

    gsapi_* (no suffix)
                    Expects arguments to have the encoding set by the
                    most recent call to gsapi_set_arg_encoding().
                    By default, the encoding is the "local" one,
                    to be compatible with old clients.

    gsapi_*A        Always "local" encoding; Windows-specific.
    gsapi_*W        Always "unicode" encoding; Windows-specific.
    gsapi_*_utf8    Always UTF-8 encoding.

If this is what you wanted, then there are problems with the wrappers:

(d) I don't see a good reason for the 2 interfaces, GSDLL and GSAPI, to
    be so different in design. The "no prefix" functions have different 
    specs (one is ANSI, the other is "specified by a separate call").
    The "variable" encoding is specified by a separate *entry* for GSAPI
    but by an extra *parameter* for GSDLL. [IMO, the GSDLL way is better.]

(e) The existing "W"/"_utf8" wrappers call gsapi_set_arg_encoding() 
    internally, altering the semantic of the "no prefix" entry point. I
    find this neither intuitive nor desirable at all. Note the example 
    for gsapi_run_file() in my comment #8 takes care to set the encoding 
    back to the default, exactly to avoid interference. [There the "no 
    prefix" is specified as "always local", but this does not matter 
    from this point of view.]

(f) You have a gsapi_run_file_utf8() but no gsapi_init_with_args_utf8().
    I suggest to delete gsapi_run_file_utf8(). This brings GSAPI almost
    in line with GSDLL: UTF-8 obtained by explicitly setting the
    encoding - either by an extra parameter (GSDLL) or extra call
    (GSAPI). If you absolutely want the *_uf8(), add the 2nd one too.

(g) You need separate wrappers for the "A" version, you cannot simply
    have "gsapi_run_fileA = gsapi_run_file" in the DEF because the
    functionality of the "A" entry point differs from that of the "no
    prefix" entry point - one expects "local" encoding, the other 
    whatever was last specified by gsapi_set_arg_encoding().

(h) gsapi_run_file() has an extra problem, and both the patch and my
    example in comment #8 (but not the one in comment #1) have a bug.
    Here there are 2 different places where the encoding must be taken
    into account: (1) the filename itself and (2) the contents of that
    file. (2) is handled by your JIT-recoder in arg_next() (as an
    effect of the internal call to gsapi_set_arg_encoding()), but the
    filename parameter must be converted in the wrapper; it cannot be
    passed as-is to the GS "core", because the "core" expects UTF-8.
    [At least I didn't find where is that name converted.]


-----
(III) Other notes about the implementation, in no particular order:

(i) What exactly is a "rune"?

According to get_rune_utf8(), utf16le_get_rune(), and put_rune(), a 
"rune" in an int containing the 31-bit (or 21-bit, to be more precise) 
Unicode Value ("UV"). But then gp_local_arg_encoding_get_rune() is
wrong, it returns the 1..4 bytes used by the UTF-8 encoding of the
character (including the bits that indicate lead/trail bytes). You can
actually simplify the latter function, because it's easy to get the UV
from the 1..2 whar_ts returned by MultiByteToWideChar(); you don't 
need to call WideCharToMultiByte(CP_UTF8,...).

(j) Let EOF be EOF, not 0

It's a bad idea to use 0 as en EOF indicator. This is most visible in
utf16le_get_rune(), because the file will be "truncated" at any *byte*
with the value 0. Usually EOF is -1, and differs from any UV because
valid UVs are in the range 0..0x10FFFF (so always positive 32-bit 
ints). If you don't want to rely on C's EOF #define, use -1 directly.

(k) WINDOWS_NO_UNICODE, plus some other platforms, look broken

The init done by gsapi_set_arg_encoding(GS_ARG_ENCODING_LOCAL) does 
not look OK. For WINDOWS_NO_UNICODE or non-Windows, it ends up 
setting minst->rune_from_local = NULL, which is passed to arg_init()
and interpreted to mean UTF-8.

Command line input: UTF-8 is native on Linux, maybe on MAc(?) but on
OS/2? VMS? Also WINDOWS_NO_UNICODE should expect ANSI.

File functions: arg_next() converts everything TO UTF-8 via put_rune().
So any filename (input/ output/ -I path/ font path/ etc) unexpectedly
ends up as UTF-8, which file functions don't always handle (depending
on platform and WINDOWS_NO_UNICODE).

(l) Surrogate pairs when decoding UTF-16

The decoding in utf16le_get_rune() looks wrong. 1st, the high/low
surrogates have different ranges; matters for validation and for 
subtracting the base values. Next, after combining the 10+10 bits add
0x10000.

(m) The name "rune_from_local" [not that important...]

This name (and the comment near it) in struct gs_main_instance_s is 
unfortunate. The function does not look to be the platform-specific 
"local"/"host" -> Uv conversion, but the one that converts from whatever
was set by gsapi_set_arg_encoding() - which does not necessarily have 
anything to do with the "local" encoding.
Comment 12 SaGS 2012-12-17 05:24:56 UTC
Errata to comment #11:

Forget (c), it is actually not a problem, because the encoding is stored per context and gsdll_init() *creates* this context, so a gsdll_init*() call cannot have effects that persist to the next (duh!).

Note that (e) remains, because gsapi_run_file*() can be called more than once per gsapi_new_instance() and the "no prefix" copy should (see (h)) take into account the encoding set by gsapi_set_arg_encoding() (for the "filename" parameter). Also I think gsapi_init_with_args*() can be called multiple times per context.


In turn, add:

(n) The BOM

Windows UTF-8 and UTF-16 files usually begin with a BOM, which needs to be skipped. This applies to reading @files. (Discussed a bit on IRC, but I forgot to include this here - again!)

(o) Fetching values of true environment variables (not of Registry entries)

See the part starting with "When values are fetched from true environment" in comment #8. To fix, use either getenv() or _wgetenv() + conversion to UTF-8, depending on WINDOWS_NO_UNICODE.
Comment 13 SaGS 2012-12-17 05:36:30 UTC
Erratum bis, about (h):

The "(2) the contents of that file" is wrong, the file passed to gsapi_tun _file*() contains PostScript code so is not (and should not be) subject to gsapi_set_arg_encoding(). Also this file's contents is (correctly) not affected by the JIT-recoder. The problem with the filename itself remains.
Comment 14 Robin Watts 2012-12-17 17:06:56 UTC
(In reply to comment #11)
> I suggest adding a "gsdll_initA = gsdll_init" line to the EXPORTS section 
> of the DEF, to have both "A"+"W" names as is customary on Windows. No extra 
> C code needed (compared to what already exists in the patch), but please 
> add the function pointer type to gsdll.h.

Done.

> (a) gsdll.c::gsdll_init() calls itself (!) and gsdll.c::gsdll_initW() 
>     calls gsdll_init() too. Looks like a typo, seems both should call 
>     gsdll_init_with_encoding().

D'Oh. Fixed. Thanks.

> (b) Either
>         Add a prototype for gsdll_init_with_encoding(). The one assumed
>         by the compiler when it finds [will find] them called from 
>         gsdll_init/W() is not suitable.
>     Or
>         Move gsdll_init/W() after it.

Done. (The latter)

> (d) I don't see a good reason for the 2 interfaces, GSDLL and GSAPI, to
>     be so different in design. The "no prefix" functions have different 
>     specs (one is ANSI, the other is "specified by a separate call").
>     The "variable" encoding is specified by a separate *entry* for GSAPI
>     but by an extra *parameter* for GSDLL. [IMO, the GSDLL way is better.]

With the gsdll calls there is no concept of 'specified in a previous call', as the gsdll calls create an instance, operate on it, then delete the instance. Hence the encoding needs to be specified (possibly implicitly) at call time. The gsdll interface is now deprecated, AIUI, so I'm fixing this for completeness sake.

With gsapi the structure of the interface is more 'make the context', 'set the context', 'make various calls with that context', 'destroy the context'. As such a separate call to 'set the encoding used by the context' seems in keeping to me, rather than forcing the encoding to be passed on every call.

This is a personal choice, I admit, so I'm prepared to see other people weigh in with their opinions.

> (e) The existing "W"/"_utf8" wrappers call gsapi_set_arg_encoding() 
>     internally, altering the semantic of the "no prefix" entry point. I
>     find this neither intuitive nor desirable at all. Note the example 
>     for gsapi_run_file() in my comment #8 takes care to set the encoding 
>     back to the default, exactly to avoid interference. [There the "no 
>     prefix" is specified as "always local", but this does not matter 
>     from this point of view.]

If you're calling the A or W functions once, then you should be calling them all the time. If you're calling set_arg_encoding, then you shouldn't be calling the A or W interfaces. Problems only occurs when you mix and match the two.

I'm not sure that setting them back to 'always local' helps, either, as this will still go wrong. Consider:

  gsapi_set_arg_encoding(UTF8)
  gsapi_run_file("file1")
  gsapi_run_fileW("file2")
  gsapi_run_file("file3")
  
file3 should be run in the UTF8 context, not UCS2 (as my code would run it) or local (as yours would).

I've updated the code so that the old arg encoding is always maintained.

> (f) You have a gsapi_run_file_utf8() but no gsapi_init_with_args_utf8().
>     I suggest to delete gsapi_run_file_utf8(). This brings GSAPI almost
>     in line with GSDLL: UTF-8 obtained by explicitly setting the
>     encoding - either by an extra parameter (GSDLL) or extra call
>     (GSAPI). If you absolutely want the *_uf8(), add the 2nd one too.

Deleted it.

> (g) You need separate wrappers for the "A" version, you cannot simply
>     have "gsapi_run_fileA = gsapi_run_file" in the DEF because the
>     functionality of the "A" entry point differs from that of the "no
>     prefix" entry point - one expects "local" encoding, the other 
>     whatever was last specified by gsapi_set_arg_encoding().

Added.

> (h) gsapi_run_file() has an extra problem, and both the patch and my
>     example in comment #8 (but not the one in comment #1) have a bug.
>     Here there are 2 different places where the encoding must be taken
>     into account: (1) the filename itself and (2) the contents of that
>     file. (2) is handled by your JIT-recoder in arg_next() (as an
>     effect of the internal call to gsapi_set_arg_encoding()), but the
>     filename parameter must be converted in the wrapper; it cannot be
>     passed as-is to the GS "core", because the "core" expects UTF-8.
>     [At least I didn't find where is that name converted.]

You are quite right. I have this fixed locally.

> (i) What exactly is a "rune"?

Henry also mentioned his dislike for the name. I've renamed 'rune' to 'codepoint'.

> According to get_rune_utf8(), utf16le_get_rune(), and put_rune(), a 
> "rune" in an int containing the 31-bit (or 21-bit, to be more precise) 
> Unicode Value ("UV"). But then gp_local_arg_encoding_get_rune() is
> wrong, it returns the 1..4 bytes used by the UTF-8 encoding of the
> character (including the bits that indicate lead/trail bytes). You can
> actually simplify the latter function, because it's easy to get the UV
> from the 1..2 whar_ts returned by MultiByteToWideChar(); you don't 
> need to call WideCharToMultiByte(CP_UTF8,...).

This is a mistake that crept in during my refactoring, when I changed from decoding whole args at once to single 'characters' at a time. I'll fix it.

> (j) Let EOF be EOF, not 0
> 
> It's a bad idea to use 0 as en EOF indicator. This is most visible in
> utf16le_get_rune(), because the file will be "truncated" at any *byte*
> with the value 0. Usually EOF is -1, and differs from any UV because
> valid UVs are in the range 0..0x10FFFF (so always positive 32-bit 
> ints). If you don't want to rely on C's EOF #define, use -1 directly.

You're right. Peter had a macro to read chars, which I expanded, and simplified. And broke. Fixed now.

> (k) WINDOWS_NO_UNICODE, plus some other platforms, look broken
> 
> The init done by gsapi_set_arg_encoding(GS_ARG_ENCODING_LOCAL) does 
> not look OK. For WINDOWS_NO_UNICODE or non-Windows, it ends up 
> setting minst->rune_from_local = NULL, which is passed to arg_init()
> and interpreted to mean UTF-8.

OK. I've renamed WINDOWS_NO_UNICODE to GS_NO_UTF8. Any OS can now specify this, and it will mean that gs will build using the old style "8 bit clean" system that it did before. (Filenames etc will be accepted in simple 8 bit ascii and parroted back out to system functions in the same way).

Any attempt to call W entry points or to set the encoding to anything other than local will fail in this case.

> Command line input: UTF-8 is native on Linux, maybe on MAc(?) but on
> OS/2? VMS? Also WINDOWS_NO_UNICODE should expect ANSI.

OS/2, VMS etc will have to define GS_NO_UTF8.

> File functions: arg_next() converts everything TO UTF-8 via put_rune().
> So any filename (input/ output/ -I path/ font path/ etc) unexpectedly
> ends up as UTF-8, which file functions don't always handle (depending
> on platform and WINDOWS_NO_UNICODE).

If GS_NO_UTF8 is defined, nothing will be rewritten, and everything will work as before. If GS_NO_UTF8 is NOT defined, then the file functions will be expected to cope.

> (l) Surrogate pairs when decoding UTF-16
> 
> The decoding in utf16le_get_rune() looks wrong. 1st, the high/low
> surrogates have different ranges; matters for validation and for 
> subtracting the base values. Next, after combining the 10+10 bits add
> 0x10000.

Fixed.

> (m) The name "rune_from_local" [not that important...]
> 
> This name (and the comment near it) in struct gs_main_instance_s is 
> unfortunate. The function does not look to be the platform-specific 
> "local"/"host" -> Uv conversion, but the one that converts from whatever
> was set by gsapi_set_arg_encoding() - which does not necessarily have 
> anything to do with the "local" encoding.

This has now been renamed to 'get_codepoint' and the comment updated.

> (n) The BOM
> 
> Windows UTF-8 and UTF-16 files usually begin with a BOM, which needs to be
> skipped. This applies to reading @files. (Discussed a bit on IRC, but I
> forgot to include this here - again!)

I now skip the BOM in both UTF-8 and UTF-16 when reading. I skip it wherever it occurs in the string, not just at the start, but really I can't see this being a problem. (Famous last words).

> (o) Fetching values of true environment variables (not of Registry entries)
>
> See the part starting with "When values are fetched from true environment" in
> comment #8. To fix, use either getenv() or _wgetenv() + conversion to UTF-8,
> depending on WINDOWS_NO_UNICODE.

I think we need to assume that when we getenv values, that they will be in ghostscripts internal format.

I'll upload the new patch in a mo.

Sags: Thanks again for your time in looking at this. I know this is a painful, long drawn out process, but I hope we'll end up with something worthwhile in the end.
Comment 15 Robin Watts 2012-12-17 17:14:01 UTC
Created attachment 9150 [details]
0001-Reworked-unicode-command-line-handling.patch

Updated patch.
Comment 16 Robin Watts 2012-12-18 11:58:54 UTC
Created attachment 9151 [details]
0001-Fixed-Reworked-unicode-command-line-handling.patch

Updated version of the patch that passes tests.
Comment 17 Robin Watts 2013-01-10 12:26:10 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > (B) XCFLAGS
> > 
> > [It's not related with this patch, it’s just that I stumbled upon this while
> > looking why WINDOWS_NO_UNICODE=1 doesn’t work.]
> > 
> > $(XCFLAGS) is added to the compiler command line twice, once in msvc.mak
> > ‘CFLAGS=$(CFLAGS) $(XCFLAGS) $(UNICODECFLAGS)’ and once in msvccmd.mak line
> > ‘CCFLAGS=$(PLATOPT) $(FPFLAGS) $(CPFLAGS) $(CFLAGS) $(XCFLAGS) ...’. I think
> > the one to delete is the 2nd one.
> 
> Also seems plausible.

Turns out that removing this means that the PCL build no longer supports PCL due to the 'PCL_INCLUDED' flag being dropped.
Comment 18 SaGS 2013-01-13 23:47:28 UTC
I took a look at the new patch, and the interface design looks OK to me. There are implementation problems, but fixing these does not affect the interface itself.

To get all new interface functions available, you need some fixes to the DEFs. There are 2 EXPORTS missing from gsdll32.def: gsapi_init_with_argsA and gsapi_init_with_argsW. The other 3 DEFs (the 64-bit one and the 2 "metro") are also missing some EXPORTS (most of the new functions).

Note the GSDLLEXPORT modifier in the source code makes the linker export these functions automatically, but with mangled names ‘_gsapi_init_with_argsA@12’ - not what we want.



Here are some of the implementation problems that I found and diagnosed [so far, at least] in no particular order. Note these come mostly from inspecting and tracing the code, I don't have any "real world" and/or "serious" tests.

---
The char type may be signed

In gsargs.c::arg_next() (1 occurrence) and in gsargs.c::get_codepoint_utf8() (2 places):

wrong:  (file ? fgetc(file) : (**astr ? *(*astr)++ : EOF))
fix:    (file ? fgetc(file) : (**astr ? (int)(unsigned char)*(*astr)++ : EOF))

When the char type is signed (which for MSVC is the default), converting it to int sign-extends the value - not what you want.

---
Use of character classification CRTL functions with Unicode Values

isspace() and most character classification functions are defined only for 0..255 and EOF (for other arguments the debug CRTL even asserts). arg_next() works with EOF and UVs (so much larger range), and passing these to isspace() looks suspicious at least. I've also seen isprint() somewhere, but not sure it's affected.

---
In iapi.c::gsapi_init_with_argsA() and iapi.c::gsapi_run_fileA(), for the GS_NO_UTF8 case, you set "local" encoding with:

    code = gsapi_set_arg_encoding(lib, clean8bit_get_codepoint);

but gsapi_set_arg_encoding() wants an enum not a function pointer for the 2nd argument. I think here you need:

    gs_main_inst_arg_decode(get_minst_from_memory(ctx->memory), clean8bit_get_codepoint);
    code = 0;

---
The GS_NO_UTF8 version of gswin32c.exe does not start, it says "Can't create Ghostscript instance". The cause is iapi.c::gsapi_set_arg_encoding(GS_ARG_ENCODING_LOCAL) (called from iapi.c::gsapi_new_instance()) returning e_Fail.
Suggested fix in iapi.c::gsapi_set_arg_encoding():

    #if defined(GS_NO_UTF8)
        if (encoding == GS_ARG_ENCODING_LOCAL) {
            /* For GS_NO_UTF8 builds, we don't use utf8 internally, and we assume
             * that all inputs are 8 bit clean. */
            gs_main_inst_arg_decode(get_minst_from_memory(ctx->memory), clean8bit_get_codepoint);
+           return 0; <== ADD THIS
        }
    #else

---
GS_OPTIONS is broken (the other envvars are affected too)

There are 2 distinct problems that contribute to this.

(a)
The gp_getenv() functions are supposed to return strings in GS's internal encoding. The Windows implementation is partly wrong, in that if it finds a true environment variable the returned value is ANSII encoded. It must return ANSII if GS_NO_UTF8, utf-8 otherwise. This affects all environment variables, not only GS_OPTIONS.

Note that, independent on whether the parent process passed an ANSI or Unicode environment, Windows does the necessary conversion based on the function used to retrieve the values: getenv() always returns ANSII, _wgetenv() always Unicode (wchar_t).

(b)
If I understood correctly, for the encoding of arg_source.u.s.chars you expect:
- if .parsed == true, then encoding == GS's internal one (utf-8)
- if .parsed == false, then encoding according to gsapi_set_arg_encoding() (readable by .get_codepoint())

The "pushed back" GS_OPTIONS has .parsed == false and encoding == utf-8 (or ANSII if affected by bug (a)), independent of gsapi_set_arg_encoding(). Note that you cannot set .parsed == true for this one, because it must be parsed to separate individual switches and filenames.

---
Possible overflow of utf-8 buffers for arguments

In gsargs.h::arg_list you have .cstr[arg_str_max + 1]. I think now you need more than +1, because gsargs.c::codepoint_to_utf8(), called from gsargs.c::agr_next(), may write more than 1 byte there. Maybe the tests "if (i == arg_str_max - 1) errprintf("Command too long")" need some thought too.
Comment 19 Robin Watts 2013-01-15 13:55:24 UTC
(In reply to comment #18)
> I took a look at the new patch, and the interface design looks OK to me.
> There are implementation problems, but fixing these does not affect the
> interface itself.

That's great news, thanks.

> To get all new interface functions available, you need some fixes to the
> DEFs. There are 2 EXPORTS missing from gsdll32.def: gsapi_init_with_argsA
> and gsapi_init_with_argsW. The other 3 DEFs (the 64-bit one and the 2
> "metro") are also missing some EXPORTS (most of the new functions).

I will update the patch shortly with a new version that fixes this (and the following issues).

> The char type may be signed
> 
> In gsargs.c::arg_next() (1 occurrence) and in gsargs.c::get_codepoint_utf8()
> (2 places):

Fixed.

> Use of character classification CRTL functions with Unicode Values

Fixed.

> In iapi.c::gsapi_init_with_argsA() and iapi.c::gsapi_run_fileA(), for the
> GS_NO_UTF8 case, you set "local" encoding with:

Fixed (in a different way).

> The GS_NO_UTF8 version of gswin32c.exe does not start, it says "Can't create
> Ghostscript instance".

Fixed.

> GS_OPTIONS is broken (the other envvars are affected too)
> 
> There are 2 distinct problems that contribute to this.
> 
> (a)
> The gp_getenv() functions are supposed to return strings in GS's internal
> encoding. The Windows implementation is partly wrong, in that if it finds a
> true environment variable the returned value is ANSII encoded. It must
> return ANSII if GS_NO_UTF8, utf-8 otherwise. This affects all environment
> variables, not only GS_OPTIONS.
> 
> Note that, independent on whether the parent process passed an ANSI or
> Unicode environment, Windows does the necessary conversion based on the
> function used to retrieve the values: getenv() always returns ANSII,
> _wgetenv() always Unicode (wchar_t).

(a) requires more thought from me.

> (b)
> If I understood correctly, for the encoding of arg_source.u.s.chars you
> expect:
> - if .parsed == true, then encoding == GS's internal one (utf-8)
> - if .parsed == false, then encoding according to gsapi_set_arg_encoding()
> (readable by .get_codepoint())

(b) is fixed; we now have separate flags for parsed and decoded. parsed implies decoded, but we can have decoded but unparsed args in the list now.

> Possible overflow of utf-8 buffers for arguments

Fixed.

Once again, many thanks!
Comment 20 Robin Watts 2013-01-18 12:48:11 UTC
(In reply to comment #19)
> > GS_OPTIONS is broken (the other envvars are affected too)
> > 
> > There are 2 distinct problems that contribute to this.
> > 
> > (a)
> > The gp_getenv() functions are supposed to return strings in GS's internal
> > encoding. The Windows implementation is partly wrong, in that if it finds a
> > true environment variable the returned value is ANSII encoded. It must
> > return ANSII if GS_NO_UTF8, utf-8 otherwise. This affects all environment
> > variables, not only GS_OPTIONS.
> > 
> > Note that, independent on whether the parent process passed an ANSI or
> > Unicode environment, Windows does the necessary conversion based on the
> > function used to retrieve the values: getenv() always returns ANSII,
> > _wgetenv() always Unicode (wchar_t).
> 
> (a) requires more thought from me.

So, I guess the right thing to do here is to change gp_getenv on windows to use _wgetenv() to get an environment variable as Unicode, and then convert that to utf8 before returning it into gs itself.

For 8 bit environment variables, this means we'll get the bytes interpreted as the current codepage as they go into unicode, and then converted down to utf8.

For 16 bit environment variables, we'll get a direct conversion down.

This sounds correct to me.
Comment 21 Robin Watts 2013-01-18 12:53:38 UTC
Created attachment 9231 [details]
0001-Latest-unicode-handling.patch

Latest version that should address all Sags comments, except for the handling of windows environment variables.
Comment 22 Robin Watts 2013-01-18 15:42:20 UTC
Created attachment 9232 [details]
0001-Latest-unicode-handling2.patch

And again, with windows environment variable changes too.

Hopefully this is complete now.
Comment 23 Robin Watts 2013-02-18 11:32:14 UTC
Committed in:

commit 4c1a3aac35f1b4b7711c2e9e2f0f38fe121f51db
Author: Robin Watts <robin.watts@artifex.com>
Date:   Tue Sep 25 12:39:12 2012 +0100

    Command line argument character encoding handling rework.

    Recap of the situation: Historically gs has always taken command
    line args in whatever the 'local' encoding of that OS is; for windows
    this means they are encoded as per the currently selected codepage;
    for mac/linux this tends to mean utf8.

    Similarly, whenever gs goes to access a file, it would pass the filenames
    to the platform specific file opening code - on windows this would
    assume that it was in the local encoding, and on linux would assume
    utf8.

    There are downsides to dealing with codepage encoding - not least the
    fact that some files can't be accessed easily, or that the behaviour
    of PS programs may change according to what platform we are on.

    Last year we added code so that gs on windows would run as a unicode app,
    and would take the supplied unicode argv and convert to utf8 before
    passing them to the core of gs. The file reading functions would then
    convert from utf8 back to unicode before opening the files. This
    code was disabled by default, but seemed to work well in tests.

    1 objection was raised to this - namely that we were redefining the
    meaning of the entry points (to take utf8 rather than locally encoded
    args) and that this might break legacy code. So, it was suggested that
    we add a new entry point specifically for utf8, and that we deprecate
    the old entry point.

    In starting this work, I spotted another problem; gs supports the
    '@file' arg (where more command line args can be given in a file).
    The current solution of rewriting the command line args does not
    address this.

    Instead, we update the gsargs arg handling code with a 'get_codepoint'
    function pointer. We update the arg_next function to use this to
    decode the supplied args (or OPTIONS or file contents) into simple
    unicode characters. We then write these into the returned arg
    string as utf8 (or in the case of GS_NO_UTF8 builds, as 8 bit 'clean'
    characters).

    This commit adds a new gsapi_set_arg_encoding function, that allows
    the caller to set the encoding used for the args/strings supplied
    to the subsequent functions. By default, 'local' encoding is assumed,
    hence preserving existing behaviour. By calling:

     gs_set_arg_encoding(minst, GS_ARG_ENCODING_UTF8);

    we can get the UTF8 encoding that we desire. We leave ourselves scope
    for supporting other encodings (such as EBCDIC?) in the future.

    The commit also changes it so that the gs always runs with UTF8 encoding
    internally by default; to avoid it, make with 'GS_NO_UTF8=1'.

    For the benefit of windows users we implement UTF16LE encoding (i.e.
    wchar_t's). And we expose this through the gsapi level as
    gsapi_run_fileW (and gsapi_init_with_argsW) so that .net can bind
    nicely to it without any casting trickery being required.

    Many thanks to Sags for his tireless efforts and comments while
    this work was underway.

Any remaining problems etc, we'll either reopen this bug, or (if the problems are small), create a new bug just to deal with those.