Ghostscript uses ANSI series APIs on Win32. As a natural result, ghostscript can open files only those names are in current code page letters. If those file names are provided by command line, ghostscript would generate an 'undefinedfilename' error. On the other hand, Explorer runs in UNICODE. It can create any files with any Unicode letters. But some of those files become unable to open by ghostscript. The customer requested ghostscript to open any files even its name has Unicode letters not in current code page.
Using stdin redirect might work as a temporary solution. - If you are using cmd.exe or batch file, then: C>gswin32c - <unicode_filename.ps - If you launch gs from your code: 1) Launch gs using cmd.exe and its redirection: CreateProcessW(NULL, L"C:\WINDOWS\system32\cmd.exe /c gswin32c.exe - < unicode_filename.ps", ...); 2) or, redirect stdin by yourself: HANDLE hFile = CreateFileW(L"unicode_filename.ps", ...); STARTUPINFO si = { 0, }; ... si.hStdInput = hFile; ... CreateProcess(NULL, _T("gswin32c.exe -"), ... &si, ..);
There is a disadvantage in stdin redirection in comment #1. When using stdin, gs may spool its contents into temporary file, and may take more disk space and time to process.
question 1: Do the file names come in as Windows 'Wide character' strings or as UTF-8? Question 2: What is the "correct" file open function on windows that will work with this kind of file name? (it may be a simple change in 'base/gp_mswin.c' to gp_fopen() would suffice (and maybe find any calls in the code that directly call fopen instead of gp_fopen).
> question 1: Do the file names come in as Windows 'Wide character' strings > or as UTF-8? Current gswin32*.exe recieves file names in "Multibyte string", which is in multibyte code set (MBCS), through its main() funcion. A multibyte string is just repetition of single bytes for most europian language Windows. For Chinese, Japanese, and Korian, its a mixture of single byte and double byte. Multibyte would be UTF-8 if current code page is 65001. In most cases, letters in multibyte code set is a subset of Unicode letters. NT-series Windows manages filenames in Unicode (Wide character). Windows converts Unicode to multibyte when it calls mail() or WinMain(). For letters that cannot convert to multibyte, it replaces them with '?' (ASCII question letter). Such filenames with replaced letter are no longer good for fopen(). There are two ways to get command line filenames in Unicode. 1) Use wmain() or wWinMain instead of main() or WinMain(). or, 2) After the program launch, call LPWSTR GetCommandLineW(VOID). > Question 2: What is the "correct" file open function on windows that will > work with this kind of file name? (it may be a simple change in > 'base/gp_mswin.c' to gp_fopen() would suffice (and maybe find any calls > in the code that directly call fopen instead of gp_fopen). FILE* _wfopen(const wchar_t *filename, const wchar_t *mode) would be the one. This is defined in stdio.h and do not need Windows.h. (We can use CreateFileW() for sure, but this would not match our style) To handle Unicode file names, we need to do three things right. a) Get file names in Unicode. b) During the process, do not lose Unicode file names. c) Use file open API that accept Unicode file name. Making ghostscript's inside text code to UTF-8 might be an option, but if we did that, we may need to add series of text conversion anytime we get the filename from "outside". (From directory search, from PS code, ...) These are the issues on Windows. I need some study to know what is going on Unix world.
May I ask if someone is looking for this issue on Windows? My external solution is to convert the filename and path to the short form (8+3) representation before I call gs. But this dirty trick does only work if the filesystem supports aliases.
> > These are the issues on Windows. I need some study to know what is going on > Unix world. On Unix (MacOS X and Linux) we open unicode filenames properly. So it looks like a fix could be localized (no pun intended) to gp_mswin.c. Hello Martin, yes we are looking into this problem.
(In reply to comment #6) > > > > These are the issues on Windows. I need some study to know what is going on > > Unix world. > > On Unix (MacOS X and Linux) we open unicode filenames properly. So it looks > like a fix could be localized (no pun intended) to gp_mswin.c. > > Hello Martin, yes we are looking into this problem. Hi Henry, does this mean that you are using UTF-8 internally? If this is the case then I could do a quick test by providing a utf-8 filename to gsapi_init_with_args() and patch my current clone of gs so that gd_fopen() converts the utf-8 filename to wchar_t* and uses _wfopen(). I'm sure there are more important things on your todo list but it would really be fine to close this issue on Windows. In the moment I'm able to spend some time looking into gs issues on Windows, so if you need some support...
I think there is consensus on the staff we should fix this properly. Robin Watts is going to try and fix it, we'll appreciate your help testing when he has a solution. Thanks Martin.
Believed fixed in: commit 0ea739147fd02ee0e63e58c036bb63fa841ddd3c Author: Robin Watts <Robin.Watts@artifex.com> Date: Sat Jun 4 22:04:12 2011 +0100 Bug 691222: Make windows build use UTF-8 encoding. We change the windows builds to use the 'wmain' rather than 'main' entrypoints. This means we get the command line supplied in 'wchar_t's rather than chars. We convert back to chars using UTF-8 encoding, and call (what was) the main entrypoint. This means that we can cope with unicode filenames/paths etc. To match the encoding, we therefore need to wrap every use of the filenames with the associated utf-8 -> wchar_t conversion and use the unicode file access functions (_wfopen instead of fopen etc) instead. Simple testing seems to indicate that this works. I think I've got every occurence of file access, but it's possible I've missed some. If so I'll fix them piecemeal as they are reported. This should solve bug 691222, and hopefully 691117.
(In reply to comment #9) > Believed fixed in: > > commit 0ea739147fd02ee0e63e58c036bb63fa841ddd3c > Author: Robin Watts <Robin.Watts@artifex.com> > Date: Sat Jun 4 22:04:12 2011 +0100 > > This should solve bug 691222, and hopefully 691117. Great. What does it mean for the api interface? I guess you have to provide UTF-8 strings. If this is the case then this should be documented in api.html. Same for stdin, stdout, stderr texts. Maybe also the copy/paste commands in the gs window systemmenu extension could then put/get wchar_t from clipboard and convert them to UTF-8. This would allow to copy unicode filenames via clipboard. The fix should should solve Bug 690026 too. Remark 1: gd_fopen() wmode[] should not use a fixed length. I know it should never be longer than 3, but... Remark 2: Is the call of setlocale() in main() still usefull?
So, how does this work for paths defined as either environment variables or PostScript strings (e.g., -I___ or -sOutputFile= -sGenericResourceDir=___ or GS_FONTPATH environment variable/registry key) ? Where in the process are strings UTF-8 vs. wchar ?
I did a quick test in the morning feeding the api with utf8 arguments to access .ps and .pdf files with unicode file and folder names. This works fine now.
(In reply to comment #11) > So, how does this work for paths defined as either environment variables or > PostScript strings (e.g., -I___ or -sOutputFile= -sGenericResourceDir=___ or > GS_FONTPATH environment variable/registry key) ? > > Where in the process are strings UTF-8 vs. wchar ? wchar world conversion utf8 world --------------------------------------------------------- Windows ghostscript Call gs wmain( argv) => main Filenames _wfopen <= gd_fopen Open a file Registry RegQueryValueExW <= ? Get/set a value Environment _wgetenv <= ? Get a value Clipboard CF_UNICODETEXT <= ? Get/set a text snippet Call gs_api none File content (depending on a BOM?)
(In reply to comment #7) > Hi Henry, does this mean that you are using UTF-8 internally? Yes, exactly. > If this is the case then I could do a quick test by providing a utf-8 filename > to gsapi_init_with_args() and patch my current clone of gs so that gd_fopen() > converts the utf-8 filename to wchar_t* and uses _wfopen(). That's all done now. (In reply to comment #9) > What does it mean for the api interface? I guess you have to provide UTF-8 > strings. If this is the case then this should be documented in api.html. Same > for stdin, stdout, stderr texts. You're absolutely right. I'll get onto that. > Maybe also the copy/paste commands in the gs window systemmenu extension could > then put/get wchar_t from clipboard and convert them to UTF-8. This would > allow to copy unicode filenames via clipboard. I am unfamiliar with that extension. > The fix should should solve Bug 690026 too. I'll look into that. > Remark 1: gd_fopen() wmode[] should not use a fixed length. I know it should > never be longer than 3, but... This is an internal API, and I didn't feel the count/malloc/free overhead was justified here. > Remark 2: Is the call of setlocale() in main() still useful? Not a clue. (In reply to comment #11) > So, how does this work for paths defined as either environment variables or > PostScript strings (e.g., -I___ or -sOutputFile= -sGenericResourceDir=___ or > GS_FONTPATH environment variable/registry key) ? > > Where in the process are strings UTF-8 vs. wchar ? In Unix, the environment always passes us UTF-8 encoded values (environment keys, command lines etc), and gp_fopen (which just calls fopen) expects encoded values too. In windows, we are (as far as possible) in the same situation. 'main' (now actually called main_utf8) is called with UTF-8 encoded values. Environment keys are similarly assumed to be UTF-8. gp_fopen likewise assumes encoded values too. The difference between windows and unix is that we have a thin shim layer in there to do the conversion for us (wmain converts from wchar to UTF-8 and calls main_utf8, gp_fopen converts from UTF-8 to wchar and calls _wfopen). (In reply to comment #12) > I did a quick test in the morning feeding the api with utf8 arguments to > access .ps and .pdf files with unicode file and folder names. This works fine > now. Fabulous. Many thanks for your help testing this. I don't have a non-english version of windows, so I am particularly grateful for any testing/suggestions/ pointing-out-of-stupid-errors you can offer!
I've spun bug 692259 out with the copy/paste system menu suggestion for discussion. All comments welcome.
(In reply to comment #14) > Environment > keys are similarly assumed to be UTF-8. Respect that the Windows environment is using wchar_t. So use _wgetenv to get the variables and convert them to utf8. I would also check if the registry access has to be adapted (using the ...W variants of function calls). > Fabulous. Many thanks for your help testing this. I don't have a non-english > version of windows, so I am particularly grateful for any testing/suggestions/ > pointing-out-of-stupid-errors you can offer! I'm located in Switzerland and prefer english versions too. It is difficult to work in different locations like San Diego and to share equipment. But you can always create unicode filenames of documents (Windows >= W2K ;-).
(In reply to comment #16) > > Environment keys are similarly assumed to be UTF-8. > Respect that the Windows environment is using wchar_t. So use _wgetenv to get > the variables and convert them to utf8. Yes, I've been coding that up today, together with fixes to a few other things I spotted. > I'm located in Switzerland and prefer english versions too. It is difficult to > work in different locations like San Diego and to share equipment. > But you can always create unicode filenames of documents (Windows >= W2K ;-). Yes, that much I've done. Thanks again.
Similar changes are also needed to gp_msprn.c. This will need to pass the printer name as UTF8, and then be converted to Unicode before OpenPrinterW().