Bug 703905 - configure --with-tessdata does not support paths with double slashes
Summary: configure --with-tessdata does not support paths with double slashes
Status: UNCONFIRMED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: Build Process (show other bugs)
Version: 9.54.0
Hardware: PC All
: P4 normal
Assignee: Robin Watts
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-06-03 01:34 UTC by Jérome Perrin
Modified: 2021-06-03 07:03 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
"proof of concept" patch (3.13 KB, application/mbox)
2021-06-03 01:34 UTC, Jérome Perrin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jérome Perrin 2021-06-03 01:34:15 UTC
Created attachment 21044 [details]
"proof of concept" patch

Hello,

./configure supports passing the default path to lookup for tessdata files for the OCR device.

This is intended to be used with something like:
  ./configure --with-tessdata=/path/to/tessdata

but if it is used with a double slash like this:
  ./configure --with-tessdata=/path//to/tessdata

this will not work as expected. At runtime the tessdata files will be looked up in /path/ instead if /path//to/tessdata.

I agree that using double slashs like does not really make sense and I should be using /path/to/tessdata and not /path//to/tessdata but this took me some time to find out what was wrong, so I figured it would be better to report this. I have an environment where I install a lot of softwares from source with ./configure --prefix=/path//with//double/slashes and generally everything works.


Here are what happens. It starts with the --with-tessdata handled by configure

https://git.ghostscript.com/?p=ghostpdl.git;a=blob;f=configure.ac;h=1532609f344ca523735c76d100c3cd2a92a15652;hb=bbe40fcc5a89ca45bf8d988ca87613627e21d2d6#l3162


dnl look for default tessdata...
AC_ARG_WITH([tessdata],  AC_HELP_STRING([--with-tessdata],
    [set tesseract data search path]), tessdata="$withval", tessdata="")

if test "x$tessdata" = "x"; then
        tessdata="${datadir}/tessdata"
fi

AC_SUBST(tessdata)



then substituted in the top level Makefile:

https://git.ghostscript.com/?p=ghostpdl.git;a=blob;f=Makefile.in;h=5a64799050d4ff751ddcd880c0fd093800030a86;hb=bbe40fcc5a89ca45bf8d988ca87613627e21d2d6#l123


then from base/ocr.mak this is passed as a -D flag to c compiler:

https://git.ghostscript.com/?p=ghostpdl.git;a=blob;f=base/ocr.mak;h=3c9e278758dfc31ef9dd93d29cc2b2b2ed3efc90;hb=bbe40fcc5a89ca45bf8d988ca87613627e21d2d6#l31

31         $(TESSCXX) $(D_)LEPTONICA_INTERCEPT_MALLOC=1$(_D) $(I_)$(LEPTONICADIR)$(D)src$(_I) $(GLO_)tessocr.$(OBJ) $(C_) $(D_)TESSDATA="$(TESSDATA)"$(_D) $(GLSRC)tessocr.cpp

and then in base/tessocr.cpp it will be made a char * variables thanks to two "stringizing" macros:

https://git.ghostscript.com/?p=ghostpdl.git;a=blob;f=base/tessocr.cpp;h=fec5979d6972b1388ce06f0485d9dc1010c58911;hb=bbe40fcc5a89ca45bf8d988ca87613627e21d2d6#l188

#ifndef TESSDATA
#define TESSDATA tessdata
#endif
#define STRINGIFY2(S) #S
#define STRINGIFY(S) STRINGIFY2(S)
static char *tessdata_prefix = STRINGIFY(TESSDATA);

The problem here is that if TESSDATA contained a double slash, everything after the double slash will be stripped, because // is the a line comment in C.

From gcc doc https://gcc.gnu.org/onlinedocs/cpp/Stringizing.html :

> . Comments are replaced by whitespace long before stringizing happens, so they never appear in stringized text.

I don't have so much experience with C and autotools, especially when it comes to portability on exotic platforms, but it seems that if we use autoconf macro like AC_DEFINE, it would do produce a variable with all the -D to pass to the c compiler, doing the proper escaping.

I attach a "proof of concept" patch that I did not test much, but seems to solve the issue for me.

Thanks !