697555 – Ghostscript input to read filenames with umlaute from file in CMD

Bug 697555 - Ghostscript input to read filenames with umlaute from file in CMD

Summary: Ghostscript input to read filenames with umlaute from file in CMD

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	Client API (show other bugs)
Version:	unspecified
Hardware:	PC Windows 10

Importance:	P4 normal
Assignee:	Default assignee

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-02-06 07:44 UTC by oliver.majchrzak
Modified:	2017-04-13 04:41 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
Jürgen1.pdf+Jürgen2.pdf, files.lst and DOS batch (6.64 KB, application/x-zip-compressed) 2017-02-06 08:24 UTC, oliver.majchrzak	Details
files.lst converted from WINDOWS-1252 to UTF-8 with iconv (32 bytes, application/octet-stream) 2017-02-06 10:48 UTC, oliver.majchrzak	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description oliver.majchrzak 2017-02-06 07:44:20 UTC

I am using Ghostscript 9.20 in Windows command prompt. Ghostscript shall read filenames from a file, though some of the files have "Umlaute" for e.g. üäö such in a filename "Jürgen1.pdf" "Jürgen2.pdf". But Ghostscript 9.20 swallows the umlaut ü and can't read the filenames with Umlaute at all. On http://stackoverflow.com/questions/41978376/ghostscript-input-to-read-filenames-with-umlaute-from-file-in-cmd/41981486?noredirect=1#comment71235061_41981486 I submitted the question, but the people there told me to post this as a bug in Bugzilla. The code on DOS that failed below:

chcp 1252
set file_output=Jürgen_merged
dir "Jürgen*.pdf" /b /o:n > files.txt
"C:\Program Files (x86)\Gawk\gawk4.1\gawk" "{ print \"\042\" $0 \"\042\" }" files.txt > files.lst
"C:\Program Files (x86)\gs\gs9.20\bin\gswin64c" -sPAPERSIZE=a4 -sDEVICE=pdfwrite -o "%file_output%.pdf" @files.lst
del files.lst

Comment 1 Ken Sharp 2017-02-06 07:53:10 UTC

We're going to need a file to reproduce the problem. Ideally please supply a file (just one file please) with a name containing an umlaut, and the files.lst file you use to try and access it.

Comment 2 oliver.majchrzak 2017-02-06 08:24:44 UTC

Created attachment 13360 [details]
Jürgen1.pdf+Jürgen2.pdf, files.lst and DOS batch

Comment 3 oliver.majchrzak 2017-02-06 08:28:14 UTC

(In reply to Ken Sharp from comment #1)
> We're going to need a file to reproduce the problem. Ideally please supply a
> file (just one file please) with a name containing an umlaut, and the
> files.lst file you use to try and access it.

Here you go it's the sample files with the name Jürgen1.pdf+Jürgen2.pdf the dos batch script and the @file.lst. I use (g)awk only to put the files in apostrophe, since files with blanks otherwise can't be read properly. On DOS I use this version: GPL Ghostscript 9.20 (2016-09-26)

Comment 4 Ken Sharp 2017-02-06 09:13:28 UTC

The first problem is that the content of the file specified by the @file syntax need to be in UTF-8 format, the code reading it expects the data to be UTF-8, and the files.lst file is not UTF-8 encoded.

However, even when that is fixed, the problem doesn't go away because (I believe) there's a bug in the UTF-8 processing. In gsargs.c, get_codepoint_utf8(), at around line 99:

        } while (((c & 0xC0) == 0xC0) && --len);
        if (len) {

The problem is that if 'c & 0xc0' is not equal to 0xC0, then the code doesn't execute --len and simply exits, even though it did consume a byte. The next line tests len and because it wasn't decremented it is not 0, so decides the rune was improperly formatted and goes round again, neatly discarding the UTF-8 codes.

Comment 5 Robin Watts 2017-02-06 09:48:13 UTC

(In reply to Ken Sharp from comment #1)
> We're going to need a file to reproduce the problem. Ideally please supply a
> file (just one file please) with a name containing an umlaut, and the
> files.lst file you use to try and access it.

I don't want a script to produce files.lst. I want a copy of files.lst.

I don't have gawk on my machine, and I don't want to have to install it just to get this working.

Comment 6 Robin Watts 2017-02-06 10:04:34 UTC

(In reply to Ken Sharp from comment #4)
> The first problem is that the content of the file specified by the @file
> syntax need to be in UTF-8 format, the code reading it expects the data to
> be UTF-8, and the files.lst file is not UTF-8 encoded.

That is indeed the problem.

> However, even when that is fixed, the problem doesn't go away because (I
> believe) there's a bug in the UTF-8 processing.

I think the UTF-8 processing is fine. The problem is just that we are expecting files.lst to be in UTF8 format.

Comment 7 oliver.majchrzak 2017-02-06 10:10:49 UTC

It doesn't matter if I create the files.lst with dir or manually. I converted the files.lst with iconv to UTF8 and still ghostscript can't handle the file.lst with filenames with umlaute.

Comment 8 Robin Watts 2017-02-06 10:21:08 UTC

(In reply to oliver.majchrzak from comment #7)
> It doesn't matter if I create the files.lst with dir or manually. I
> converted the files.lst with iconv to UTF8 and still ghostscript can't
> handle the file.lst with filenames with umlaute.

Ok, that doesn't change my point. Give me a set of example files so I know that I'm running EXACTLY the same as you are.

Every extra step you make me take ("oh, you'll need to install gawk" etc), makes it less likely that I will look into your bug. Make it easy for me.

I'm not trying to be awkward or difficult here. The GS developers are all very busy, and while we try to respond to free user bugs in a timely fashion, having to jump through hoops to reproduce things eats our time.

Comment 9 oliver.majchrzak 2017-02-06 10:48:22 UTC

Created attachment 13363 [details]
files.lst converted from WINDOWS-1252 to UTF-8 with iconv

I was just kindly asked by the stackoverflow community to pass this bug on to you. I got my workaround already and I am only a non-profesional with no programming experience! I am sorry to hassle you with this bug. But since I use ghostscript often and think this is a reasonable bug, I posted it onto this forum.

I converted the files.lst with iconv -f WINDOWS-1252 -t UTF-8 files.txt > files.lst The resulting files.lst attached.

Thanks

Comment 10 Robin Watts 2017-02-06 11:24:23 UTC

(In reply to oliver.majchrzak from comment #9)
> I was just kindly asked by the stackoverflow community to pass this bug on
> to you. I got my workaround already and I am only a non-profesional with no
> programming experience! I am sorry to hassle you with this bug. But since I
> use ghostscript often and think this is a reasonable bug, I posted it onto
> this forum.

It's certainly a reasonable bug. And having supplied the missing file, I can see that it is a genuine one.

> I converted the files.lst with iconv -f WINDOWS-1252 -t UTF-8 files.txt >
> files.lst The resulting files.lst attached.

Many thanks.

This DOES show a thinko in my utf-8 handling. There are therefore 2 problems here. 1) That the @file is assumed to be in utf-8 format, and 2) that the utf-8 handling is broken.

Fixing 2 is simple enough:

commit a65893f973c65d2ba22f8b2a2c6cf0822fc8c1da
Author: Robin Watts <robin.watts@artifex.com>
Date:   Mon Feb 6 19:20:40 2017 +0000

    Bug 697555: Fix UTF-8 handling of args.

    The logic for checking for continuation bytes in UTF-8 was
    broken. Continuation bytes have the top bit set, but not the top 2
    bits set.

    This leaves the issue of @files on windows always being taken
    as UTF-8.

Thanks for bringing this to our attention.

Comment 11 Chris Liddell (chrisl) 2017-02-06 12:01:15 UTC

(In reply to Robin Watts from comment #10)
> (In reply to oliver.majchrzak from comment #9)
> > I was just kindly asked by the stackoverflow community to pass this bug on
> > to you. I got my workaround already and I am only a non-profesional with no
> > programming experience! I am sorry to hassle you with this bug. But since I
> > use ghostscript often and think this is a reasonable bug, I posted it onto
> > this forum.
> 
> It's certainly a reasonable bug. And having supplied the missing file, I can
> see that it is a genuine one.
> 
> > I converted the files.lst with iconv -f WINDOWS-1252 -t UTF-8 files.txt >
> > files.lst The resulting files.lst attached.
> 
> Many thanks.
> 
> This DOES show a thinko in my utf-8 handling. There are therefore 2 problems
> here. 1) That the @file is assumed to be in utf-8 format, and 2) that the
> utf-8 handling is broken.
<SNIP>

The 1) "problem" was a known limitation imposed on us because the contents of the @file.lst argument are (effectively) handled in Postscript - once we're in Postscript, I don't think there's a way to get strings from the current Windows codepage to UTF-8 that we usefully handle in Postscript.

Comment 12 Robin Watts 2017-02-09 02:50:44 UTC

(In reply to Chris Liddell (chrisl) from comment #11)
> The 1) "problem" was a known limitation imposed on us because the contents
> of the @file.lst argument are (effectively) handled in Postscript - once
> we're in Postscript, I don't think there's a way to get strings from the
> current Windows codepage to UTF-8 that we usefully handle in Postscript.

In theory the handling of the contents of an @file come through the same conversion functions as are used for decoding command line arguments.

The problem is that the windows exes convert the arguments to utf-8 and pass them in, so the conversion functions are set up to expect utf-8 in that case...

Comment 13 oliver.majchrzak 2017-04-13 04:41:52 UTC

Now with the release of Ghostscript 9.21 the bug is fixed. Ghostscript under Windows (in my case MS-Windows 10) can read the filelist from a @file. The file format even does not need to be UTF-8. Thanks!!!