690026 – Pass unicode text to ghostscript

Bug 690026 - Pass unicode text to ghostscript

Summary: Pass unicode text to ghostscript

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	General (show other bugs)
Version:	8.63
Hardware:	PC Windows 2000

Importance:	P4 enhancement
Assignee:	Ray Johnston

URL:
Keywords:

Depends on:
Blocks:

Reported:	2008-08-19 08:30 UTC by Ryan
Modified:	2012-05-08 19:04 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
example (18 bytes, application/postscript) 2008-08-20 05:46 UTC, Ryan	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ryan 2008-08-19 08:30:18 UTC

When trying to use a ps file named with unicode text in the file name, 
ghostscript will not accept the text and read it as ?????.

For example, if I paste in Japanese unicode formated text (テスト材料.ps), it 
will display as question marks (?????.ps) into the ghostscript window and not 
find the file.

Comment 1 Alex Cherepanov 2008-08-19 17:55:19 UTC

Ghostscript doesn't know or care about any character ancoding. All it sees
is a sequence of octets represented as a PostScript string. This string
is passed intact to the C library.

Probably, you need to encode your file name as UTF-8 .

You can check what encoding you get from the system by enumerating the current
directory using the following PS program.

(*) {==} =string filenameforall

Comment 2 Ryan 2008-08-20 05:46:13 UTC

Created attachment 4296 [details]
example

Comment 3 Ryan 2008-08-20 05:51:45 UTC

I created a UTF-8 text document, named it テスト材料.ps and when I paste that 
into ghostscript i get ?????.ps

The Windows Server 2003 kernel is Unicode UTF-16, so i'm guessing that is how 
the file name is stored (independent of what encoding is used inside the file).

Comment 4 Ray Johnston 2008-08-20 08:36:52 UTC

Postscript strings have their own escape convention that can be difficult to
master. I recommend passing unicode to Ghostscript's command line GS> prompt
as hex strings enclosed in < >. For example, examples/tiger.eps would be:
    <6578616d706c65732f74696765722e657073>

Comment 5 SaGS 2008-08-21 04:02:51 UTC

I think the problem here is not connected to PostScript, PS string encoding, 
etc, and it’s not even specific to Ghostscript, but appears with all ANSI 
Windows applications.

In Windows, many functions, including file functions that take strings, come 
in 2 flavours: an "ANSI" version that expects strings to use the Windows 
installed ANSI codepage(*), and the "Unicode" version that takes Unicode 
strings. The names for the former end in "A", as in "FindFirstFileA()", and 
the latter in "W" (from "wide char"), like "FindFirstFileW()", 
The "undecorated" names, as in "FindFirstFile()", are just #defines that map 
to one or the other, depending on "UNICODE" being defined or not. Note that 
most "W" functions do not exist on Windows 95/98/ME, and simply using them 
makes the executable not even start on those systems.

Ghostscript is an 'ANSI' application, meaning it uses, by default, the ANSI 
installed codepage. If it were a Unicode app, then it would use Unicode 
internally, "FindFirstFile()" would map to "FindFirstFileW()", and could 
access all filenames. As I see it, being an ANSI and not Unicode app is the 
only GS-specific thing that is part of the problem signaled.

The ANSI codepages use a single-byte charset for western (and other) versions 
of Windows. The filesystem does support Unicode filenames, accepting 
characters outside the installed ANSI codepage. So the "ANSI" file functions 
convert filenames between Unicode (used by the filesystem) and the installed 
ANSI codepage (used by "ANSI" apps), and in the process characters that have 
no equivalent are mapped to a default character. Ghostscript and all ANSI apps 
really receive "?????.ps" (<3F3F3F3F3F2E7073>) from "FindFirst/NextFileA()", 
so cannot access the file. The file would (should!) be accessible on a 
Japanese version of Windows, but I do not have such a system to verify.

---
(*) Sometimes the OEM codepage, see "SetFileApisToANSI/OEM()" in Platform SDK.

Comment 6 Ryan 2008-08-21 08:03:26 UTC

I appreciate all the info you've provided. Are there any plans to have a future 
version of ghostscript with unicode support?

Also, I have tested this on a Japanese version of Windows 2003 and it behaves 
the same way with the ?'s

Comment 7 Ryan 2008-08-21 11:17:26 UTC

I'm having a problem with entering hex values using <>. For example if I want 
to use the file c:\temp\テスト材料.ps I entered the following:

gswin32.exe -sDEVICE=pdfwrite -sOutputFile="c:\temp\test.pdf" -dBATCH -
dNOPAUSE -q <633A5C74656D705C30C630B930C8675065992E7073>

or 
63 3A 5C 74 65 6D 70 5C 30C6 30B9 30C8 6750 6599 2E 70 73
c  :  \  t  e  m  p  \  テ    ス   ト    材    料    .  p  s

Does it have something to do with the japanese characters being 4 hex values 
instead of the standard latin characters that are only 2?

Comment 8 Ray Johnston 2008-08-21 11:27:16 UTC

To use hex from the command line to run a file, you must use:

gswin32.exe -sDEVICE=pdfwrite -sOutputFile="c:\temp\test.pdf" -dBATCH -
dNOPAUSE -q -c "<633A5C74656D705C30C630B930C8675065992E7073> run"

The < > syntax to encode a hex string is PostScript syntax, not Windows shell
syntax. The -c option feeds PostScript to the interpreter from the command
line, and the 'run' PostScript operator runs the file.

Comment 9 Ryan 2008-08-21 12:07:13 UTC

Sorry, but it's still not reading the file.

If i use "a.ps" translated to hex as: 61 2E 70 73 it works fine. 
But when using a unicode character with 4 bytes such as テ.ps: 30C6 2E 70 73 
ghostscript tries to read the first character 30C6 as 30 C6 which leads to the 
file not being found.

Comment 10 SaGS 2008-08-22 03:02:10 UTC

It's wrong to mix Unicode Values and ASCII as in comment #9 
("<3036 2E ...>"). The encoding for filenames must be according to 
the Windows installed ANSI codepage, and this codepage must support 
the characters you want to use.

Example:

I assume the original poster wants to open a file who's name 
(5 chars) is contained in attachment #4296 [details], extension ".ps".

As I said, 1st condition is that the Windows default codepage supports 
those characters. I got this on an English version of Widows XP with 
SP3 by going to "Control Panel"/ "Regional and Language Options" and:

- Check "Language" tab/ "Install Files for East Asian Languages";
- Set "Advanced" tab/ "Language for non-Unicode programs" to 
  "Japanese". WARNING: This affect ALL ANSI apps, so you may want to 
  restore the previous value after this test. Especially since 932 
  is a double-byte character set, and some applications are only 
  able to handle single-byte charsets.

For verification, open the following Registry key:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage

and see that "ACP" is 932. The codepage identified by this Registry 
entry is the one used (among other things) for filenames by the ANSI 
apps (as GS is), and 932 supports the characters in attachment #4296 [details].

Attachment #4296 [details] is encoded as UTF-8. Let's translate that to cp932 
("Japanese (Shift-JIS)"). We will translate UTF-8 -> Unicode Values, 
then UVs -> Codepage 932. The correspondence Unicode <-> CP932 is at
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

UTF-8 from        Unicode                        Japanese
attachment #4296 [details]                                 CP 932
----------------  -----------------------------  -----------
ef bb bf          U+feff "Byte Order Mark"
e3 83 86          U+30c6 "Katakana Letter Te"    83 65
e3 82 b9          U+30b9 "Katakana Letter Su"    83 58
e3 83 88          U+30c8 "Katakana Letter To"    83 67
e6 9d 90          U+6750 -                       8D DE
e6 96 99          U+6599 -                       97 BF
                  U+002E "Full Stop"             2E
                  U+0070 "Latin Small Letter P"  70
                  U+0073 "Latin Small Letter S"  73

So to open that file in Postscript you have to code:

    <8365835883678DDE97BF2E7073> (r) file

This did work fine for me [in the conditions I mentioned].

Comment 11 Robin Watts 2011-06-08 14:07:08 UTC

The latest version of gs in git (607afb7) contains changes to make windows builds operate as consistently as possible with unix ones.

Under unix, all filenames/options/command lines/environment variables are UTF8 encoded. This means gs deals with them as UTF8 internally, and calls out to system functions that accept the UTF8 and convert it back to the true unicode format before processing.

Now, under windows, we get called through the unicode entrypoints, encode everything into UTF8 and operate internally as UTF8. When we make calls to system functions, we convert from UTF8 to unicode and call the unicode version of such functions.

We'd be grateful for any testing that can be done on this code - hopefully this should help here?

Comment 12 Ray Johnston 2012-05-08 19:04:16 UTC

As far as we know building gs on Windows with USEUNICODE=1 (which causes the
msvc.mak to not include the /DWINDOWS_O_UNICODE C flag) works and will resolve
this issue.

Closing as FIXED. Anyone testing this and finding a problem can open a new bug,
or REOPEN this bug with the details of the problem.