Bug 696116

Summary:	letters of (hidden) ocr text get separated by spaces
Product:	Ghostscript	Reporter:	RegEchse
Component:	PDF Writer	Assignee:	Ken Sharp <ken.sharp>
Status:	RESOLVED WONTFIX
Severity:	normal	CC:	bart.libert
Priority:	P4
Version:	master
Hardware:	PC
OS:	Linux
Customer:		Word Size:	---
Attachments:	pdf output of tesseract

Description RegEchse 2015-07-26 06:04:56 UTC

Created attachment 11825 [details]
pdf output of tesseract

The attached pdf is some output from tesseract (3.04.00).
If it is processed with gs (gs -sDEVICE=pdfwrite -o test_gs.pdf test.pdf)
the "hidden text" becomes essentially useless because in test_gs.pdf the
text contains additional spaces between every character, e.g. "h o m o t o p y"
instead of "homotopy" (as it was before).

I noticed the bug with v9.16 but also checked that it still occurs
with the current master branch (e32fe0b1).

Comment 1 Ken Sharp 2015-07-28 06:58:38 UTC

The output file produced by pdfwrite does not, in fact contain additional spaces. The font being used is a CIDFont, which means that character codes are 2 bytes. The first byte of each of the codes is 0x00, which some editors may render as a white space, however there is no space there. Indeed if you look at a (decompressed) version of the original file you can see that ths too contains the same construction, eg:

BT
3 Tr 1 0 0 1 153.257 1635.286 Tm /f-0-0 2 Tf 95554.286 Tz [ <0020> ] TJ 
ET

The equivalent in the pdfwrite output is:

BT
/R10 2 Tf
955.543 0 0 1 153.257 1635.29 Tm
3 Tr
[(  )-500]TJ

Note that the first 'space' in the () pair is actually a 0x00.

The original PDF file does not contain spaces (actually not quite true, ans the first glyph is a space, but still...), so the resulting PDF file produced by pdfwrite *also* does not contain spaces. pdfwrite is unable to determine where the original file has a BT and ET so simply emits the whole text in one Text group (ET/BT pair).

There is something distinctly odd about the way that Acrobat treats our output. If I try and copy just "The Homotopy Extension Property" then Acrobat puts it on the clipboard with spaces. If I copy the top two line, then Acrobat copies it without spaces. The different behaviour is confusing to say the least! Possibly this is due to the fact that there are no spaces, and that there is now only a single text group.

Comment 2 RegEchse 2015-07-31 16:05:21 UTC

First of all, i have to say that i (unfortunately) don't
understand most of the technical pieces you gave.
(I have virtually no knowledge of ps and/or pdf and none about pdf internals.)

> (...) so the resulting PDF file produced by pdfwrite *also* does not contain spaces.
As i said, i can't extract any information from your two source samples
but all i notice is that this "*also*" is somewhat dubious; because what i
observe is that the original (tesseract) pdf _is_ nicely searchable and the
gs processed one is simply not (which contradicts the whole point of ocr-ing).
So something has to be different between how these "no white spaces" are represented
before and after gs.
In addition i don't understand/see why this should change. What can gs possibly
change here so that it works before but not afterwards?

> (...) pdfwrite is unable to determine where the original file has a BT and ET (...)
That sounds very strange to me: how can a standardized file format be
ambiguous in such a sense that it can't be parsed in a unique way!?
Or are you saying that the tesseract pdf isn't valid to begin with?
I'm a bit confused about this.

I'd be thankful for any further explanation on this, because i'd
really like to understand better what the actual problem is here.
(In case my questions can't be answered in a reasonable way for a
"pdf noob" i also accept "read/learn the pdf spec" as an answer. ;))

Comment 3 Ken Sharp 2015-08-01 01:58:19 UTC

(In reply to RegEchse from comment #2)

> > (...) so the resulting PDF file produced by pdfwrite *also* does not contain spaces.
> As i said, i can't extract any information from your two source samples
> but all i notice is that this "*also*" is somewhat dubious; because what i
> observe is that the original (tesseract) pdf _is_ nicely searchable and the
> gs processed one is simply not (which contradicts the whole point of
> ocr-ing).
> So something has to be different between how these "no white spaces" are
> represented
> before and after gs.

I didn't say that there were no differences. In fact there is something decidedly weird about the way Acrobat (I'm assuming you are using Acrobat) is dealing with the output file.

Under some conditions copy and paste does result in Acrobat inserting spaces in the copied text (but note that this has nothing to do with the actual character codes in the PDF file). Under other conditions, this doesn't happen. However, in this case the initial character is dropped. I did mention this in the last paragraph of comment #1.

The presence of the spaces in the pasted text is an artefact of the process, caused by Acrobat, it doesn't indicate the presence of a space character code in the PDF file. (though I can understand why you would think that it does)


> In addition i don't understand/see why this should change. What can gs
> possibly
> change here so that it works before but not afterwards?

Ghostscript is changing *everything* in your PDF file.

The way the pdfwrite device works is that the original input is broken down into marking operations (by the relevant Page Description Language interpreter), which are sent to the device. The device then processes these. In the case of a rendering device it renders to a bitmap, in the case of pdfwrite it builds these graphics primitives into a brand new PDF file. Note that a graphics primitive is something like 'move here', 'draw a line to here', 'draw this bitmap', 'draw some text'. It is not the same as the input page description language.

Many people think that GS+pdfwrite is 'processing' or 'manipulating' or 'editing' their original PDF file, it isn't, its creating a brand new one.

In the case of the input being a PDF file, the output PDF file should have the same visual appearance as the input, that's the design goal of pdfwrite, anything else is a bonus.


> > (...) pdfwrite is unable to determine where the original file has a BT and ET (...)
> That sounds very strange to me: how can a standardized file format be
> ambiguous in such a sense that it can't be parsed in a unique way!?

See above. The PDF interpreter can see the BT and ET tokens. However these do not correspond to graphics primitives and so are not passed to the graphics library. As a result the pdfwrite device never sees them. It simply sees a sequence of text operations.


> Or are you saying that the tesseract pdf isn't valid to begin with?

No, I have not said this at any point.


> I'd be thankful for any further explanation on this, because i'd
> really like to understand better what the actual problem is here.

I don't know what the problem is yet, I haven't had time to have a look at the problem other than a quick peek to see if it was something obvious; it isn't.


> (In case my questions can't be answered in a reasonable way for a
> "pdf noob" i also accept "read/learn the pdf spec" as an answer. ;))

That would be an unreasonable answer, the specification is *huge* and very complex. I do try to explain my findings, but right now I don't have anything to pass on other than my quick explanation that there are no extra spaces in the output PDF file.

Comment 4 Bart Libert 2015-08-05 04:01:46 UTC

Also not a pdf expert here, but I see the same thing in evince on linux (debian).

The original file is generated by tesseract 3.04.00 and converted with ghostscript 3.04.00.

I also tried using "pdftotext" on the files and I see that on the tesseract file, the resulting text file is "normal", while on the ghostscript file, spaces are added there as well.

I did note that a small percentage of words were correctly rendered in both files.

The file I see this in contains some sensitive information, so I cannot share it, but if it would help you, I can try to make a "censored" version and try to reproduce it with that one.

Comment 5 Ken Sharp 2015-08-05 04:52:22 UTC

(In reply to Bart Libert from comment #4)

> The file I see this in contains some sensitive information, so I cannot
> share it, but if it would help you, I can try to make a "censored" version
> and try to reproduce it with that one.

Unless you are certain that this is a different problem (in which case, please open a separate bug) then more examples don't help.

As I said, I'll get to it when I have time.

Comment 6 Bart Libert 2015-08-05 07:20:45 UTC

I think it's the same issue, just wanted to mention it also occurs with other viewers.
No rush, take your time

Comment 7 Ray Johnston 2015-08-05 07:29:37 UTC

BTW, I noticed in:  Comment 4 Bart Libert 2015-08-05 04:01:46 PDT

> The original file is generated by tesseract 3.04.00 and converted with
> ghostscript 3.04.00.

There is no such thing as ghostscript 3.04.00 (our versions are x.xx and the
latest AGPL release was 9.16). It's probably just a typo, but I mention it
in case you weren't using a "real" version of gs, but instead were using
something that may have been modified.

Comment 8 Bart Libert 2015-08-05 07:33:19 UTC

It was 9.16 indeed, I made a mistake there, my apologies.

Comment 9 Ken Sharp 2015-08-07 07:20:14 UTC

There is, I'm sorry to say, nothing I can do about this.

The original PDF file uses a CIDFontType 2, drawn in rendering mode 3 and the font is defined in such a way that the only glyph is the /.notdef. The width of that glyph is 0, the /DW entry in the font is set to 500 and there is no /Widths array.

Now, there is no way currently for us to pass the /DW from the PDF font to the internal font structure. So in order to get the positioning of each glyph correct, we draw the glyph, then we override the current point by adding the Width (or DW) value to the value of the current point before the glyph was drawn, and then make that the new currentpoint.

What this means in practice is that we emit text which looks like this:

[(T)-500(h)-500(e)-500]TJ

The reason the 500 is negative is because the TJ operator subtracts it from the current x position before drawing the next glyph. Obviously if its negative we are effectively adding it on.

It seems that Acrobat cannot handle this kind of construction in its search.

If I alter the DW to 0, or modify the TrueType font hmtx table so that the glyph has a width of 500, then we write the text in a different way altogether:

1.17942 0 0 1 438.171 1647.63 Tm
(T)Tj
12.4999 0 Td
(h)Tj
12.4999 0 Td
(e)Tj
12.4999 0 Td

And in this case Acrobat *can* search for the text (with some caveats regarding highlighting).

So what you appear to be facing is a limitation in the Acrobat search facility, which is exposed by the way we emit the text.

Now I don't know whether Tesseract always emits its text with the same default spacing, if it does then its possible that it could be modified to use a /DW of 0, or the GlyphLessFont could be modified as I've done here so that the one and only glyph used gets a width of 500.

Unfortunately I don't see any prospect for resolving this in Ghostscript.