706426 – Text selection in PDF mishandled for Arabic (and probably other RTL languages)

Bug 706426 - Text selection in PDF mishandled for Arabic (and probably other RTL languages)

Summary: Text selection in PDF mishandled for Arabic (and probably other RTL languages)

Status:	RESOLVED FIXED

Alias:	None

Product:	MuPDF
Classification:	Unclassified
Component:	AppKit (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P4 normal
Assignee:	Robin Watts

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-02-22 17:36 UTC by Fred Ross-Perry
Modified:	2023-04-21 19:04 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
sample.pdf (4.08 MB, application/pdf) 2023-02-22 17:36 UTC, Fred Ross-Perry	Details
sample video (506.61 KB, video/mp4) 2023-02-22 17:36 UTC, Fred Ross-Perry	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Fred Ross-Perry 2023-02-22 17:36:03 UTC

Created attachment 23804 [details]
sample.pdf

Open the attached sample PDF and attempt to select text of varying lengths.
The selection handles are misplaced, and the selected text is often wrong while you are dragging one of the handles.

We see something similar in iOS as well.

Also attached is a video from a customer.

Text selection in PDF relies on mupdf's fz_highlight_selection(), which may be causing the problem.

Comment 1 Fred Ross-Perry 2023-02-22 17:36:50 UTC

Created attachment 23805 [details]
sample video

Comment 2 Fred Ross-Perry 2023-02-22 17:37:15 UTC

Here is the related Zen ticket:

https://artifex-smartoffice.zendesk.com/agent/tickets/5242

Comment 3 Fred Ross-Perry 2023-02-22 17:39:11 UTC

Sebastian reports that mupdf-gl behaves a bit weirdly too.

Comment 4 Fred Ross-Perry 2023-02-23 19:30:37 UTC

You can see the strange behavior in mupdf-gl by entering redaction mode and attempting to select a run of text.

Comment 5 Robin Watts 2023-02-23 19:51:46 UTC

So, the first thing to be said, is that this is a very broken file.

The primary information in a PDF file for text is of the form:

Select font XXX
Move to X,Y
Put glyph number N
Move to X,Y
Put glyph number M
...

i.e. PDF encodes the position of glyphs. It doesn't tell us anything about the logical ordering of those glyphs.

Note that it talks about 'glyph numbers' not Unicode values. So a file might say glyph number 56 and mean "x" or "?" or gamma or Aleph, or whatever particular glyph the author of the font has chosen to use that number for.

In order to know what unicode value the glyph corresponds to (i.e. the character that the glyph represents), PDF files allow fonts to provide "ToUnicode" tables. These allow a PDF consumer (such as MuPDF) to map from the glyph to the Unicode value (or values!) that this glyph came from.

Frequently, PDF files are badly constructed in that these tables are omitted. When they aren't there, consumers are forced to make wilder and wilder guesses, frequently resulting in complete garbage being extracted.

In this case, the file does have a ToUnicode table. BUT... it's full of rubbish.

This can be verified by going to Acrobat, selecting some of the text, copying, and trying to paste it elsewhere. Random gibberish is given.

This means that any attempt to search within or cut-and-paste from this file is doomed to failure. As such, one wonders what the possible point of selecting can be...

As it is, none of the characters that the PDF file claims are within it, are identified as right-to-left characters, so any special right-to-left processing we could possibly do would be incorrect here.

If you imagine that the output on the page is:

ABCDEFG

that's actually achieved by the PDF file doing:

Move to X,Y
Display G.
Move left a bit.
Display F.
Move left a bit.
Display E.
Move left a bit.
Display D.
... etc

That's a reasonable thing to do for R2L chars, because they are coming out in the "logical" order, but it's a very strange thing to do for L2R chars.

I suspect that if the text was properly identified as R2L characters, the way we handled selection would actually be pretty good, but the file is so hopelessly broken we're really stymied here.

Comment 6 Robin Watts 2023-02-23 20:09:21 UTC

Actually, on subsequent review, this claim:

> As it is, none of the characters that the PDF file claims are within it, are
> identified as right-to-left characters

is false. Continuing to investigate.

Comment 7 Robin Watts 2023-02-23 20:21:11 UTC

And actually:

> In this case, the file does have a ToUnicode table. BUT... it's full of rubbish.

That's wrong too. On second glance the ToUnicode table looks reasonable. I am confused...

Comment 8 Robin Watts 2023-02-24 17:32:56 UTC

(In reply to Robin Watts from comment #7)
> And actually:
> 
> > In this case, the file does have a ToUnicode table. BUT... it's full of rubbish.
> 
> That's wrong too. On second glance the ToUnicode table looks reasonable. I
> am confused...

The first couple of fonts have a decent ToUnicode table. The latter ones don't.

At any rate, I think I have a solution to work around this a bit.

Comment 9 Robin Watts 2023-03-01 10:50:42 UTC

Fixed with:

commit 500c0299af8d086d0a20bb2c3a1e7d5c72872cc9
Author: Robin Watts <Robin.Watts@artifex.com>
Date:   Fri Feb 24 17:31:48 2023 +0000

    Bug 706426: Workaround strange text order in extraction.

    The example file has a load of text that purports to be Arabic.
    Unfortunately, while it appears as arabic, the ToUnicode tables
    for many (but not all) of the fonts in the document are horribly
    broken, and so when cut and pasted, you get garbage characters.

    Stranger still, in the lines where such text is used, the
    characters within words are sent left to right, but the words
    (and the spaces between words) within a line themselves are
    sent right to left.

    So, for the characters A, B, C, SPACE, D, E, F, SPACE, G, H, I
    we might actually get the following appearance:  GHI DEF ABC

    The stext extraction for this is then:

      <line>ABC</line>
      <line> </line>
      <line>DEF</line>
      <line> </line>
      <line>GHI</line>

    which gives unexpected results when trying to select it.

    As a workaround against this, we tweak the behaviour of the
    stext device.

    Typically as each character arrives, we consider adding it onto
    the end of the "current line". If it doesn't fit, then we abandon
    that line, and move to a new one.

    The solution implemented here is to perform some extra processing.
    Whenever we are finished with the current line, before we start a
    new one, we check to see if the entirety of the current line fits
    before one of the previous lines. If so, we join the two.

    Clearly, we could go further with this processing and consider
    whether the newly extended line then fits before/after other lines,
    but this seems enough for now.

    This gives sensible behaviour for text selection for this file,
    and shouldn't hurt anything else. That's as much as we can hope
    for, because cut-and-paste is clearly a lost cause thanks to
    the broken ToUnicodes.

Improvements in similar (less broken) files additionally given by:

commit 48bf9c0cb2bd8e3008abb82f4cf69cbeaaf56a36
Author: Robin Watts <Robin.Watts@artifex.com>
Date:   Mon Feb 27 15:17:16 2023 +0000

    Fix problems when selecting/highlighting right to left text.

    When selecting right to left text (such as trying to apply
    a highlight to arabic text in mupdf-gl), the selection box
    would behave very oddly.

    If the selection point was in the right hand half of a glyph
    we'd assume that it should be included, and in the left hand
    half, that it should be excluded from the selection. This is
    correct for l2r, but wrong for r2l text.

    Fixed by considering the directionality of text.

    The l2r/r2l direction detection could possibly be better as
    it may trip over neutral stuff at the start of lines, but this
    is MUCH better than before.

    Also, when returning the list of highlight quads for a selection
    we were assuming that successive chars would always 'merge'
    new quads on the right. Again, this is wrong for r2l text.

Comment 10 Mark Mentovai 2023-04-21 19:04:00 UTC

The fix for this bug caused bug 706642.