Bug 686853 - filenameforall enumerates incompatible with Adobe
Summary: filenameforall enumerates incompatible with Adobe
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PS Interpreter (show other bugs)
Version: master
Hardware: PC Windows 2000
: P4 normal
Assignee: Chris Liddell (chrisl)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-05-04 03:25 UTC by Igor Melichev
Modified: 2014-03-13 04:47 UTC (History)
2 users (show)

See Also:
Customer:
Word Size: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Igor Melichev 2003-05-04 03:25:32 UTC
filenameforall enumerates directory items at on Windows.
It must not. I didn't test other platforms.
Comment 1 Russell Lang 2003-05-07 00:42:02 UTC
GS 8.00 on Unix also returns directory names.  Unix file enumeration 
has provision for handling (*/*), while Windows does not.
Changing Windows to not return directory names is easy.  
If it is decided to change this, I'm willing to fix the 
Windows version since I wrote half of it.
Comment 2 Russell Lang 2003-05-08 02:55:21 UTC
Fix for Windows submitted to code review.  OS/2 doesn't need fixing.
Comment 3 Ray Johnston 2003-05-08 09:47:42 UTC
Adobe enmumerates files in subdirectories, recursively.

Rather than just not returning directories, the correct fix in
this area is to enumerate files in subdirectories. Note that 
the 'bare' filename should *NOT* be enumerated, so empty
sub-directories will not be enumerated, but sudirectories
with files in them will be enumerated.

Also on Windows, we can arbitrarily use the '/' character
as the path separator.
Comment 4 Igor Melichev 2003-05-08 13:56:43 UTC
The committed patch
http://www.ghostscript.com/pipermail/gs-cvs/2003-May/003246.html
is a partial fix for this bug on Windows.
Comment 5 Ray Johnston 2003-05-08 15:26:23 UTC
For example, if the files are
 file1
 subdir1/
 subdir1/file2
 subdir1/subdir2/
 subdir1/subdir2/file3


It should list the 'leaves' of the directory tree:

   file1	subdir1/file2		subdir1/subdir2/file3

Note that also if there is:

   subdir3/
   subdir3/subdir4
   subdir5/subdir6/file6

the only thing that should be enumerated is:	subdir5/sbudir6/file6
Comment 6 Ray Johnston 2006-03-01 10:21:22 UTC
When implementing recursive enumeration of files in subdirectories,
we need to make sure that the file enumaration of resourceforall does NOT
go into subdirectories. Probably the simplest way to handle this is to
add a 'recurse' boolean to the gp_enumerate_init and then have a
separate operator for file enumeration that does not recurse to be used
internally by the resourceforall searching.

BTW, see comments in bug 688565 for more commentary on why this is needed.
Comment 7 Michael (Micksa) Slade 2006-06-25 01:53:49 UTC
Also, I just found this in mkromfs.c:296:

/* This relies on the gp_enumerate_* which should not return directories, nor */
/* should it recurse into directories (unlike Adobe's implementation)         */
void process_path(char *path, const char *prefix, const char *add_prefix,
Xlist_element *Xlist_head,
                int compression, int *inode_count, int *totlen, FILE *out)
Comment 8 Ken Sharp 2013-06-26 16:20:32 UTC
Commits bcc1052afdec7b4da66c832c97bcba990a1863e0
and f13bfba957c536630a241351df49c5007a0664d9

Fix this for Linux and NTFS file systems respectively. After consideration we have decided not to attempt this for DOS, OS/2 or VMS file systems. These are effectively obselete operating systems, and we can''t even check that modified code would compile.

If anyone wants to supply us with tested code for these we will adopt it, but this is no longer bountiable.
Comment 9 SaGS 2013-08-20 06:06:20 UTC
Some more things that seem wrong:

(A) Current code recurses only in some directories. In GS's source directory (gs\) execute

    DEBUGBIN\gswin32c -c "(t*) dup = {=} 256 string filenameforall quit"

Note that files are listed only if they start with "t" AND all eventual intermediary directories also start with "t". There are many other files that start with "t", like "doc\thirdparty.htm" and "tiff\html\TIFFTechNote2.html".

The cause is that at the point the code checks for directories, FindFirst/NextFile() already filtered them. So the code only sees, and recurses into, directories that themselves match the pattern.

Strictly speaking the 1st of these does NOT match the pattern, because the whole string starts with "d" not "t", but the 2nd one does.

And if we talk about "the whole result string", then why filenameforall does not return ALL files in tiff\, toolbin\, trio\ and all their subdirectories at any depth. All of these start with a "t" (the one in "[T]iff", "[T]oolbin", or "[T]rio"), so they match the pattern.


(B) Extra slashes:

    DEBUGBIN\gswin32c -c "(doc\\\\*) dup = {=} 256 string filenameforall quit"

returns strings like "doc\/API.htm" (with "\/").

Note there are other doc\ dirs (in expat\, freetype\, contrib\japanese\, and others) but it did not "recurse" to find them.

Eventually a 3rd, but this is not a regression:
(C) Escaped "*?" are currently treated the same as non escaped ones. In debugbin\ (less files...) execute the following 2 commands:

    gswin32c -c "(*) dup = {=} 256 string filenameforall quit"
    gswin32c -c "(\\*) dup = {=} 256 string filenameforall quit"

Both return all files. OK for (*). But the template (\\*) specifies an escaped "*", so a filename must consist of an actual asterisc to match the template.

That's relatively easy to fix because on Windows "*?" are not allowed in filenames. In the loop that removes the "\" escapes check if the char that is escaped is a "?" or "*". If it's the case, set a flag so that gp_enumerate_files_next() returns the verdict "no [more] files" immediately, without attempting any FindFirstFile().
Comment 10 Ken Sharp 2013-08-21 02:21:03 UTC
(In reply to comment #9)

> Strictly speaking the 1st of these does NOT match the pattern, because the
> whole string starts with "d" not "t", but the 2nd one does.
> 
> And if we talk about "the whole result string", then why filenameforall does
> not return ALL files in tiff\, toolbin\, trio\ and all their subdirectories
> at any depth. All of these start with a "t" (the one in "[T]iff",
> "[T]oolbin", or "[T]rio"), so they match the pattern.

According to the spec, template matching is device dependent, so anything we do is OK. However, its not the way Adobe Acrobat Distiller works. The code has been modified so that the template is applied at the current directory level *only*, sub directories which match the template will have their contents fully enumerated. This seems to be what Adobe does.

Note that Adobe Distiller only returns the segment of the path relative to the current directory when the template does not contain a full filespec, I don't intend to try and duplicate that.

> (B) Extra slashes:
> 
>     DEBUGBIN\gswin32c -c "(doc\\\\*) dup = {=} 256 string filenameforall
> quit"
> 
> returns strings like "doc\/API.htm" (with "\/").

This should now only ever return strings containing single path separators, no matter how many and what combination of separators are supplied. If you use '\\' then the supplied string will be filled in with '\\' up to the point where directory recursion begins, after that it will use the PostScript file separator of '/', however mixed  separators like '\/' should not occur.

 
> Note there are other doc\ dirs (in expat\, freetype\, contrib\japanese\, and
> others) but it did not "recurse" to find them.

These are not children of the current directory (debugbin), so no it won't find those. See (A) for more on the matching scheme.

 
> Eventually a 3rd, but this is not a regression:
> (C) Escaped "*?" are currently treated the same as non escaped ones. In
> debugbin\ (less files...) execute the following 2 commands:

[...]
> That's relatively easy to fix because on Windows "*?" are not allowed in
> filenames. In the loop that removes the "\" escapes check if the char that
> is escaped is a "?" or "*".

I'm afraid its not quite that simple. The string is not in fact escaped at all at this point (at some time in the past it may have been) so there is no reason to remove escapes at all and I've removed the code which does. I have added an additional check to parse the pattern up to the final separator looking for illegal characters and set an abort if it finds any.

The code was then tested using COMPILE_INITS=0, so that we could be sure that file based resources are being exercised and with COMPILE_INITS=1, and in both cases with GS_NO_UTF8=1 and without specifying GS_NO_UTF8 to ensure that UTF8 handling worked correctly.

I tested with '/' and with '\\', with multiple copies of each in the pattern string and with illegal characters in the string. This all worked as expected (taking the notes above into account)

commit 735589e5b16cf3d31c6bf73ebd75259db9d18f3c does all this for Windows, assigning to Chris to evaluate what (if anything) needs to be done for Unix systems.
Comment 11 SaGS 2013-08-21 04:11:32 UTC
First a simple [to explain and check] thing:
By
     j > 0 && pattern[j-1] == '/' || pattern[j-1] == '\\'
you mean
    (j > 0 && pattern[j-1] == '/') || pattern[j-1] == '\\'
or
    j > 0 && (pattern[j-1] == '/' || pattern[j-1] == '\\')

I have the impression you want the 2nd, but the C operator precedence votes for the 1st.


Next: sorry, but the latest patch (735589e5b16cf3d31c6bf73ebd75259db9d18f3c) broke escape processing again.

In GS's source directory (the gs\) do:

    subst T: .
    cd /d T:\debugbin
    gswin32c -c "(g*) dup = {=} 256 string filenameforall quit"
    gswin32c -c "(\g*) dup = {=} 256 string filenameforall quit"
    gswin32c -c "(\\g*) dup = {=} 256 string filenameforall quit"
    gswin32c -c "(\\\g*) dup = {=} 256 string filenameforall quit"
    gswin32c -c "(\\\\g*) dup = {=} 256 string filenameforall quit"

Actual results:

 - the 1st 2 cases return results from DEBUGBIN\;
 - the other 3 return results from T:\.

Expected results:

 - all but the last return files from DEBUGBIN\ with no path included
   in the result strings, that is "gsdll32.dll" etc;
 - the last returns files from the drive's T: root with a "\" at the
   start of the result strings, that is "\ghostscript-ufst.vcproj" etc.

Explanation:

When using the "()" for a filenameforall template, there are TWO types of escapes:

[1] The PostsScript "()" string syntax escapes. The character used is "\".
    Tha PS syntax scanner processes these, and the result is a sequence of
    bytes that form the string object.

    In the 5 cases, those bytes are (shown here as chars, for readability):

        "g" "*"
        "g" "*" 	-- the 1st "\" is useless, "escapes" a regular char
        "\" "g" "*"	-- 1st "\" escapes the 2nd
        "\" "g" "*"	-- escaped "\" + unnecessary-escaped "g"
        "\" "\" "g"	-- 2 escaped "\"

    You can see this on the first line of each output, printed by "dup =".

    Note that you can write string litterals in other ways, and then this 
    type of escapes does not apply. For example (\\\g*) can also be written
    as <5C672A> - the PS syntax scanner produces identical string objects.

[2] The filenameforall template has its own [processing of] escapes, and
    the escape character used happens to be the same "\". This processing 
    is done on the bytes of the string object, no matter how it was created.

    The templates above are interpreted as:

    first 2: files in current directory that start with a "g"
    next 2:  "\"+"g" -> "g"; the escape char is unnecessary in this case 
             (but not forbidden), because the character after it is a 
             regular one; so these templates also mean files in current 
             directory that start with a "g".
    last:    "\"+"\" -> a single "\" to be matched literraly; so this asks
             for files in the root, with names that start with a "g".

The loop after

    "/* translate the template into a pattern discarding the escape  */"

was dealing with type [2]. Type [1], indeed, is done very early, by the syntax scanner, before even the string object is created. It was simply removing the "\" that are escape chars, but not the escap"ED" ones. This is OK for all chars except "*?".

The problem I signaled in comment #9 at (C) is that this loop cancels the difference between a type-[2]-escaped "*?" which must be matched litterally, and unescaped "*?" that are wildcards. There is no way to tell FindFirst/NextFile() to treat "*?" litterally, so the former cannot be passed successfully to these functions. As it turns out "*?" are not allowd at all in Windows filenames, if you encounter a type-[2]-escaped "*" or "?" you know there are no files to be found.

What you do with

    /* Scan for illegal characters in the directory path */

is a completely different thing: shortcut the use of wildcards in paths, because it is not implemented anyway. It has no relation with type-[2]-escaped-"*?".
Comment 12 Ken Sharp 2013-08-21 05:08:23 UTC
(In reply to comment #11)
> First a simple [to explain and check] thing:
> By
>      j > 0 && pattern[j-1] == '/' || pattern[j-1] == '\\'
> you mean
>     (j > 0 && pattern[j-1] == '/') || pattern[j-1] == '\\'
> or
>     j > 0 && (pattern[j-1] == '/' || pattern[j-1] == '\\')
> 
> I have the impression you want the 2nd, but the C operator precedence votes
> for the 1st.

Well that's merely some safety first stuff but you're correct, it was meant to be the second.


> In GS's source directory (the gs\) do:
> 
>     subst T: .
>     cd /d T:\debugbin
>     gswin32c -c "(g*) dup = {=} 256 string filenameforall quit"
>     gswin32c -c "(\g*) dup = {=} 256 string filenameforall quit"
>     gswin32c -c "(\\g*) dup = {=} 256 string filenameforall quit"
>     gswin32c -c "(\\\g*) dup = {=} 256 string filenameforall quit"
>     gswin32c -c "(\\\\g*) dup = {=} 256 string filenameforall quit"
> 
> Actual results:
> 
>  - the 1st 2 cases return results from DEBUGBIN\;
>  - the other 3 return results from T:\.

This appears correct to me, in the first 2 cases we enumerate from the current directory, in the latter three we start from '\', since /d changes the current drive that means we should enumerate from T:\.

Given that the processing of escapes is defined as device-dependent, and that the usage of '*' or '?' in the path is illegal in Windows, *and* that this code only runs on Windows, I don't see what's to be gained by altering the code so that we honour the escapes.

Yes its true that the (stupid) lines 3 and 4 would have a different result, but frankly I don't care. The current result is *way* better than anything previously, I don't plan to do any further work on this.
Comment 13 Chris Liddell (chrisl) 2013-08-21 05:12:36 UTC
Reopened as a reminder to review the Unix code in the light of changes outlined above.
Comment 14 Chris Liddell (chrisl) 2014-03-13 04:47:28 UTC
I believe the Unix code produces output sufficiently close to Adobe to be considered "correct".

If anyone stumbles across somewhere it's not, then please open a *new* bug for it.