I am using Ghostscript 9.20 in Windows command prompt. Ghostscript shall read filenames from a file, though some of the files have "Umlaute" for e.g. üäö such in a filename "Jürgen1.pdf" "Jürgen2.pdf". But Ghostscript 9.20 swallows the umlaut ü and can't read the filenames with Umlaute at all. On http://stackoverflow.com/questions/41978376/ghostscript-input-to-read-filenames-with-umlaute-from-file-in-cmd/41981486?noredirect=1#comment71235061_41981486 I submitted the question, but the people there told me to post this as a bug in Bugzilla. The code on DOS that failed below: chcp 1252 set file_output=Jürgen_merged dir "Jürgen*.pdf" /b /o:n > files.txt "C:\Program Files (x86)\Gawk\gawk4.1\gawk" "{ print \"\042\" $0 \"\042\" }" files.txt > files.lst "C:\Program Files (x86)\gs\gs9.20\bin\gswin64c" -sPAPERSIZE=a4 -sDEVICE=pdfwrite -o "%file_output%.pdf" @files.lst del files.lst
We're going to need a file to reproduce the problem. Ideally please supply a file (just one file please) with a name containing an umlaut, and the files.lst file you use to try and access it.
Created attachment 13360 [details] Jürgen1.pdf+Jürgen2.pdf, files.lst and DOS batch
(In reply to Ken Sharp from comment #1) > We're going to need a file to reproduce the problem. Ideally please supply a > file (just one file please) with a name containing an umlaut, and the > files.lst file you use to try and access it. Here you go it's the sample files with the name Jürgen1.pdf+Jürgen2.pdf the dos batch script and the @file.lst. I use (g)awk only to put the files in apostrophe, since files with blanks otherwise can't be read properly. On DOS I use this version: GPL Ghostscript 9.20 (2016-09-26)
The first problem is that the content of the file specified by the @file syntax need to be in UTF-8 format, the code reading it expects the data to be UTF-8, and the files.lst file is not UTF-8 encoded. However, even when that is fixed, the problem doesn't go away because (I believe) there's a bug in the UTF-8 processing. In gsargs.c, get_codepoint_utf8(), at around line 99: } while (((c & 0xC0) == 0xC0) && --len); if (len) { The problem is that if 'c & 0xc0' is not equal to 0xC0, then the code doesn't execute --len and simply exits, even though it did consume a byte. The next line tests len and because it wasn't decremented it is not 0, so decides the rune was improperly formatted and goes round again, neatly discarding the UTF-8 codes.
(In reply to Ken Sharp from comment #1) > We're going to need a file to reproduce the problem. Ideally please supply a > file (just one file please) with a name containing an umlaut, and the > files.lst file you use to try and access it. I don't want a script to produce files.lst. I want a copy of files.lst. I don't have gawk on my machine, and I don't want to have to install it just to get this working.
(In reply to Ken Sharp from comment #4) > The first problem is that the content of the file specified by the @file > syntax need to be in UTF-8 format, the code reading it expects the data to > be UTF-8, and the files.lst file is not UTF-8 encoded. That is indeed the problem. > However, even when that is fixed, the problem doesn't go away because (I > believe) there's a bug in the UTF-8 processing. I think the UTF-8 processing is fine. The problem is just that we are expecting files.lst to be in UTF8 format.
It doesn't matter if I create the files.lst with dir or manually. I converted the files.lst with iconv to UTF8 and still ghostscript can't handle the file.lst with filenames with umlaute.
(In reply to oliver.majchrzak from comment #7) > It doesn't matter if I create the files.lst with dir or manually. I > converted the files.lst with iconv to UTF8 and still ghostscript can't > handle the file.lst with filenames with umlaute. Ok, that doesn't change my point. Give me a set of example files so I know that I'm running EXACTLY the same as you are. Every extra step you make me take ("oh, you'll need to install gawk" etc), makes it less likely that I will look into your bug. Make it easy for me. I'm not trying to be awkward or difficult here. The GS developers are all very busy, and while we try to respond to free user bugs in a timely fashion, having to jump through hoops to reproduce things eats our time.
Created attachment 13363 [details] files.lst converted from WINDOWS-1252 to UTF-8 with iconv I was just kindly asked by the stackoverflow community to pass this bug on to you. I got my workaround already and I am only a non-profesional with no programming experience! I am sorry to hassle you with this bug. But since I use ghostscript often and think this is a reasonable bug, I posted it onto this forum. I converted the files.lst with iconv -f WINDOWS-1252 -t UTF-8 files.txt > files.lst The resulting files.lst attached. Thanks
(In reply to oliver.majchrzak from comment #9) > I was just kindly asked by the stackoverflow community to pass this bug on > to you. I got my workaround already and I am only a non-profesional with no > programming experience! I am sorry to hassle you with this bug. But since I > use ghostscript often and think this is a reasonable bug, I posted it onto > this forum. It's certainly a reasonable bug. And having supplied the missing file, I can see that it is a genuine one. > I converted the files.lst with iconv -f WINDOWS-1252 -t UTF-8 files.txt > > files.lst The resulting files.lst attached. Many thanks. This DOES show a thinko in my utf-8 handling. There are therefore 2 problems here. 1) That the @file is assumed to be in utf-8 format, and 2) that the utf-8 handling is broken. Fixing 2 is simple enough: commit a65893f973c65d2ba22f8b2a2c6cf0822fc8c1da Author: Robin Watts <robin.watts@artifex.com> Date: Mon Feb 6 19:20:40 2017 +0000 Bug 697555: Fix UTF-8 handling of args. The logic for checking for continuation bytes in UTF-8 was broken. Continuation bytes have the top bit set, but not the top 2 bits set. This leaves the issue of @files on windows always being taken as UTF-8. Thanks for bringing this to our attention.
(In reply to Robin Watts from comment #10) > (In reply to oliver.majchrzak from comment #9) > > I was just kindly asked by the stackoverflow community to pass this bug on > > to you. I got my workaround already and I am only a non-profesional with no > > programming experience! I am sorry to hassle you with this bug. But since I > > use ghostscript often and think this is a reasonable bug, I posted it onto > > this forum. > > It's certainly a reasonable bug. And having supplied the missing file, I can > see that it is a genuine one. > > > I converted the files.lst with iconv -f WINDOWS-1252 -t UTF-8 files.txt > > > files.lst The resulting files.lst attached. > > Many thanks. > > This DOES show a thinko in my utf-8 handling. There are therefore 2 problems > here. 1) That the @file is assumed to be in utf-8 format, and 2) that the > utf-8 handling is broken. <SNIP> The 1) "problem" was a known limitation imposed on us because the contents of the @file.lst argument are (effectively) handled in Postscript - once we're in Postscript, I don't think there's a way to get strings from the current Windows codepage to UTF-8 that we usefully handle in Postscript.
(In reply to Chris Liddell (chrisl) from comment #11) > The 1) "problem" was a known limitation imposed on us because the contents > of the @file.lst argument are (effectively) handled in Postscript - once > we're in Postscript, I don't think there's a way to get strings from the > current Windows codepage to UTF-8 that we usefully handle in Postscript. In theory the handling of the contents of an @file come through the same conversion functions as are used for decoding command line arguments. The problem is that the windows exes convert the arguments to utf-8 and pass them in, so the conversion functions are set up to expect utf-8 in that case...
Now with the release of Ghostscript 9.21 the bug is fixed. Ghostscript under Windows (in my case MS-Windows 10) can read the filelist from a @file. The file format even does not need to be UTF-8. Thanks!!!