Bug 691222 - Gs cannot open Unicode file names on Windows
Summary: Gs cannot open Unicode file names on Windows
Status: NOTIFIED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: General (show other bugs)
Version: master
Hardware: PC Windows XP
: P4 enhancement
Assignee: Robin Watts
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-27 09:57 UTC by Masaki Ushizaka
Modified: 2012-04-12 17:13 UTC (History)
3 users (show)

See Also:
Customer: 73
Word Size: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Masaki Ushizaka 2010-03-27 09:57:55 UTC
Ghostscript uses ANSI series APIs on Win32.  As a natural result, ghostscript can open files only those names are in current code page letters.  If those file names are provided by command line, ghostscript would generate an 'undefinedfilename' error.
On the other hand, Explorer runs in UNICODE.  It can create any files with any Unicode letters. But some of those files become unable to open by ghostscript.

The customer requested ghostscript to open any files even its name has Unicode letters not in current code page.
Comment 1 Masaki Ushizaka 2010-03-29 12:47:57 UTC
Using stdin redirect might work as a temporary solution.

- If you are using cmd.exe or batch file, then:

  C>gswin32c - <unicode_filename.ps

- If you launch gs from your code:
  1) Launch gs using cmd.exe and its redirection:

    CreateProcessW(NULL, L"C:\WINDOWS\system32\cmd.exe /c gswin32c.exe - < unicode_filename.ps", ...);

  2) or, redirect stdin by yourself:

    HANDLE hFile = CreateFileW(L"unicode_filename.ps", ...);
    STARTUPINFO si = { 0, };
    ...
    si.hStdInput = hFile;
    ...
    CreateProcess(NULL, _T("gswin32c.exe -"), ... &si, ..);
Comment 2 Masaki Ushizaka 2010-03-29 13:17:21 UTC
There is a disadvantage in stdin redirection in comment #1.  When using stdin, gs may spool its contents into temporary file, and may take more disk space and time to process.
Comment 3 Ray Johnston 2010-04-01 16:52:10 UTC
question 1: Do the file names come in as Windows 'Wide character' strings or as UTF-8?

Question 2: What is the "correct" file open function on windows that will work with this kind of file name? (it may be a simple change in 'base/gp_mswin.c'  to gp_fopen() would suffice (and maybe find any calls in the code that directly call fopen instead of gp_fopen).
Comment 4 Masaki Ushizaka 2010-04-03 12:03:21 UTC
> question 1: Do the file names come in as Windows 'Wide character' strings 
> or as UTF-8?

Current gswin32*.exe recieves file names in "Multibyte string", which is in
multibyte code set (MBCS), through its main() funcion.
A multibyte string is just repetition of single bytes for most europian
language Windows.  For Chinese, Japanese, and Korian, its a mixture of
single byte and double byte.  Multibyte would be UTF-8 if current code page
is 65001.
In most cases, letters in multibyte code set is a subset of Unicode letters.
NT-series Windows manages filenames in Unicode (Wide character).
Windows converts Unicode to multibyte when it calls mail() or WinMain().
For letters that cannot convert to multibyte, it replaces them with '?' (ASCII question letter).
Such filenames with replaced letter are no longer good for fopen().

There are two ways to get command line filenames in Unicode.
1) Use wmain() or wWinMain instead of main() or WinMain(). or,
2) After the program launch, call LPWSTR GetCommandLineW(VOID).


> Question 2: What is the "correct" file open function on windows that will 
> work with this kind of file name? (it may be a simple change in 
> 'base/gp_mswin.c' to gp_fopen() would suffice (and maybe find any calls 
> in the code that directly call fopen instead of gp_fopen).

FILE* _wfopen(const wchar_t *filename, const wchar_t *mode) would be the one.
This is defined in stdio.h and do not need Windows.h.
(We can use CreateFileW() for sure, but this would not match our style)

To handle Unicode file names, we need to do three things right.

a) Get file names in Unicode.
b) During the process, do not lose Unicode file names.
c) Use file open API that accept Unicode file name.

Making ghostscript's inside text code to UTF-8 might be an option,
but if we did that, we may need to add series of text conversion
anytime we get the filename from "outside".  (From directory search, from PS code, ...)


These are the issues on Windows.  I need some study to know what is going on Unix world.
Comment 5 Martin Osieka 2011-06-03 13:08:00 UTC
May I ask if someone is looking for this issue on Windows?

My external solution is to convert the filename and path to the short form (8+3) representation before I call gs. But this dirty trick does only work if the filesystem supports aliases.
Comment 6 Henry Stiles 2011-06-03 16:41:33 UTC
> 
> These are the issues on Windows.  I need some study to know what is going on
> Unix world.

On Unix (MacOS X and Linux) we open unicode filenames properly.  So it looks like a fix could be localized (no pun intended) to gp_mswin.c.

Hello Martin, yes we are looking into this problem.
Comment 7 Martin Osieka 2011-06-03 17:45:20 UTC
(In reply to comment #6)
> > 
> > These are the issues on Windows.  I need some study to know what is going on
> > Unix world.
> 
> On Unix (MacOS X and Linux) we open unicode filenames properly.  So it looks
> like a fix could be localized (no pun intended) to gp_mswin.c.
> 
> Hello Martin, yes we are looking into this problem.

Hi Henry, does this mean that you are using UTF-8 internally? 

If this is the case then I could do a quick test by providing a utf-8 filename
to gsapi_init_with_args() and patch my current clone of gs so that gd_fopen()
converts the utf-8 filename to wchar_t* and uses _wfopen(). 

I'm sure there are more important things on your todo list but it would really
be fine to close this issue on Windows. In the moment I'm able to spend some
time looking into gs issues on Windows, so if you need some support...
Comment 8 Henry Stiles 2011-06-03 17:55:07 UTC
I think there is consensus on the staff we should fix this properly.  Robin Watts is going to try and fix it, we'll appreciate your help testing when he has a solution.  Thanks Martin.
Comment 9 Robin Watts 2011-06-04 23:08:30 UTC
Believed fixed in:

commit 0ea739147fd02ee0e63e58c036bb63fa841ddd3c
Author: Robin Watts <Robin.Watts@artifex.com>
Date:   Sat Jun 4 22:04:12 2011 +0100

    Bug 691222: Make windows build use UTF-8 encoding.

    We change the windows builds to use the 'wmain' rather than 'main'
    entrypoints. This means we get the command line supplied in 'wchar_t's
    rather than chars. We convert back to chars using UTF-8 encoding, and
    call (what was) the main entrypoint.

    This means that we can cope with unicode filenames/paths etc.

    To match the encoding, we therefore need to wrap every use of the
    filenames with the associated utf-8 -> wchar_t conversion and use
    the unicode file access functions (_wfopen instead of fopen etc)
    instead.

    Simple testing seems to indicate that this works. I think I've got
    every occurence of file access, but it's possible I've missed some. If so
    I'll fix them piecemeal as they are reported.

    This should solve bug 691222, and hopefully 691117.
Comment 10 Martin Osieka 2011-06-05 09:46:45 UTC
(In reply to comment #9)
> Believed fixed in:
> 
> commit 0ea739147fd02ee0e63e58c036bb63fa841ddd3c
> Author: Robin Watts <Robin.Watts@artifex.com>
> Date:   Sat Jun 4 22:04:12 2011 +0100
> 
>     This should solve bug 691222, and hopefully 691117.

Great.

What does it mean for the api interface? I guess you have to provide UTF-8 strings. If this is the case then this should be documented in api.html. Same for stdin, stdout, stderr texts.

Maybe also the copy/paste commands in the gs window systemmenu extension could then put/get wchar_t from clipboard and convert them to UTF-8. This would allow to copy unicode filenames via clipboard.

The fix should should solve Bug 690026 too.

Remark 1: gd_fopen() wmode[] should not use a fixed length. I know it should never be longer than 3, but...

Remark 2: Is the call of setlocale() in main() still usefull?
Comment 11 Ray Johnston 2011-06-05 18:35:05 UTC
So, how does this work for paths defined as either environment variables or
PostScript strings (e.g., -I___ or -sOutputFile= -sGenericResourceDir=___ or GS_FONTPATH environment variable/registry key) ?

Where in the process are strings UTF-8 vs. wchar ?
Comment 12 Martin Osieka 2011-06-06 05:40:03 UTC
I did a quick test in the morning feeding the api with utf8 arguments to access .ps and .pdf files with unicode file and folder names. This works fine now.
Comment 13 Martin Osieka 2011-06-06 06:32:16 UTC
(In reply to comment #11)
> So, how does this work for paths defined as either environment variables or
> PostScript strings (e.g., -I___ or -sOutputFile= -sGenericResourceDir=___ or
> GS_FONTPATH environment variable/registry key) ?
> 
> Where in the process are strings UTF-8 vs. wchar ?

wchar world     conversion               utf8 world
---------------------------------------------------------
Windows                                  ghostscript
Call gs         wmain( argv) => main
Filenames       _wfopen <= gd_fopen      Open a file
Registry        RegQueryValueExW <= ?    Get/set a value
Environment     _wgetenv <= ?            Get a value
Clipboard       CF_UNICODETEXT <= ?      Get/set a text snippet
Call gs_api     none                     
                                         File content (depending on a BOM?)
Comment 14 Robin Watts 2011-06-07 11:45:41 UTC
(In reply to comment #7)
> Hi Henry, does this mean that you are using UTF-8 internally? 

Yes, exactly.

> If this is the case then I could do a quick test by providing a utf-8 filename
> to gsapi_init_with_args() and patch my current clone of gs so that gd_fopen()
> converts the utf-8 filename to wchar_t* and uses _wfopen(). 

That's all done now.

(In reply to comment #9)
> What does it mean for the api interface? I guess you have to provide UTF-8
> strings. If this is the case then this should be documented in api.html. Same
> for stdin, stdout, stderr texts.

You're absolutely right. I'll get onto that.

> Maybe also the copy/paste commands in the gs window systemmenu extension could
> then put/get wchar_t from clipboard and convert them to UTF-8. This would
> allow to copy unicode filenames via clipboard.

I am unfamiliar with that extension.

> The fix should should solve Bug 690026 too.

I'll look into that.

> Remark 1: gd_fopen() wmode[] should not use a fixed length. I know it should
> never be longer than 3, but...

This is an internal API, and I didn't feel the count/malloc/free overhead was justified here.

> Remark 2: Is the call of setlocale() in main() still useful?

Not a clue.

(In reply to comment #11)
> So, how does this work for paths defined as either environment variables or
> PostScript strings (e.g., -I___ or -sOutputFile= -sGenericResourceDir=___ or
> GS_FONTPATH environment variable/registry key) ?
> 
> Where in the process are strings UTF-8 vs. wchar ?

In Unix, the environment always passes us UTF-8 encoded values (environment keys, command lines etc), and gp_fopen (which just calls fopen) expects encoded values too.

In windows, we are (as far as possible) in the same situation. 'main' (now actually called main_utf8) is called with UTF-8 encoded values. Environment keys are similarly assumed to be UTF-8. gp_fopen likewise assumes encoded values too.

The difference between windows and unix is that we have a thin shim layer in there to do the conversion for us (wmain converts from wchar to UTF-8 and calls main_utf8, gp_fopen converts from UTF-8 to wchar and calls _wfopen).

(In reply to comment #12)
> I did a quick test in the morning feeding the api with utf8 arguments to
> access .ps and .pdf files with unicode file and folder names. This works fine
> now.

Fabulous. Many thanks for your help testing this. I don't have a non-english version of windows, so I am particularly grateful for any testing/suggestions/ pointing-out-of-stupid-errors you can offer!
Comment 15 Robin Watts 2011-06-07 12:17:14 UTC
I've spun bug 692259 out with the copy/paste system menu suggestion for discussion. All comments welcome.
Comment 16 Martin Osieka 2011-06-07 13:14:11 UTC
(In reply to comment #14)

> Environment
> keys are similarly assumed to be UTF-8.

Respect that the Windows environment is using wchar_t. So use _wgetenv to get the variables and convert them to utf8.

I would also check if the registry access has to be adapted (using the ...W variants of function calls).

> Fabulous. Many thanks for your help testing this. I don't have a non-english
> version of windows, so I am particularly grateful for any testing/suggestions/
> pointing-out-of-stupid-errors you can offer!

I'm located in Switzerland and prefer english versions too. It is difficult to work in different locations like San Diego and to share equipment.
But you can always create unicode filenames of documents (Windows >= W2K ;-).
Comment 17 Robin Watts 2011-06-07 18:45:35 UTC
(In reply to comment #16)
> > Environment keys are similarly assumed to be UTF-8.
> Respect that the Windows environment is using wchar_t. So use _wgetenv to get
> the variables and convert them to utf8.

Yes, I've been coding that up today, together with fixes to a few other things I spotted.
 
> I'm located in Switzerland and prefer english versions too. It is difficult to
> work in different locations like San Diego and to share equipment.
> But you can always create unicode filenames of documents (Windows >= W2K ;-).

Yes, that much I've done.

Thanks again.
Comment 18 Russell Lang 2011-06-08 12:42:06 UTC
Similar changes are also needed to gp_msprn.c.  This will need to pass the
printer name as UTF8, and then be converted to Unicode before OpenPrinterW().