Summary: | Text selection in PDF mishandled for Arabic (and probably other RTL languages) | ||
---|---|---|---|
Product: | MuPDF | Reporter: | Fred Ross-Perry <fred.ross-perry> |
Component: | AppKit | Assignee: | Robin Watts <robin.watts> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | mark, robin.watts |
Priority: | P4 | ||
Version: | unspecified | ||
Hardware: | PC | ||
OS: | All | ||
Customer: | Word Size: | --- | |
Attachments: |
sample.pdf
sample video |
Created attachment 23805 [details]
sample video
Here is the related Zen ticket: https://artifex-smartoffice.zendesk.com/agent/tickets/5242 Sebastian reports that mupdf-gl behaves a bit weirdly too. You can see the strange behavior in mupdf-gl by entering redaction mode and attempting to select a run of text. So, the first thing to be said, is that this is a very broken file. The primary information in a PDF file for text is of the form: Select font XXX Move to X,Y Put glyph number N Move to X,Y Put glyph number M ... i.e. PDF encodes the position of glyphs. It doesn't tell us anything about the logical ordering of those glyphs. Note that it talks about 'glyph numbers' not Unicode values. So a file might say glyph number 56 and mean "x" or "?" or gamma or Aleph, or whatever particular glyph the author of the font has chosen to use that number for. In order to know what unicode value the glyph corresponds to (i.e. the character that the glyph represents), PDF files allow fonts to provide "ToUnicode" tables. These allow a PDF consumer (such as MuPDF) to map from the glyph to the Unicode value (or values!) that this glyph came from. Frequently, PDF files are badly constructed in that these tables are omitted. When they aren't there, consumers are forced to make wilder and wilder guesses, frequently resulting in complete garbage being extracted. In this case, the file does have a ToUnicode table. BUT... it's full of rubbish. This can be verified by going to Acrobat, selecting some of the text, copying, and trying to paste it elsewhere. Random gibberish is given. This means that any attempt to search within or cut-and-paste from this file is doomed to failure. As such, one wonders what the possible point of selecting can be... As it is, none of the characters that the PDF file claims are within it, are identified as right-to-left characters, so any special right-to-left processing we could possibly do would be incorrect here. If you imagine that the output on the page is: ABCDEFG that's actually achieved by the PDF file doing: Move to X,Y Display G. Move left a bit. Display F. Move left a bit. Display E. Move left a bit. Display D. ... etc That's a reasonable thing to do for R2L chars, because they are coming out in the "logical" order, but it's a very strange thing to do for L2R chars. I suspect that if the text was properly identified as R2L characters, the way we handled selection would actually be pretty good, but the file is so hopelessly broken we're really stymied here. Actually, on subsequent review, this claim:
> As it is, none of the characters that the PDF file claims are within it, are
> identified as right-to-left characters
is false. Continuing to investigate.
And actually:
> In this case, the file does have a ToUnicode table. BUT... it's full of rubbish.
That's wrong too. On second glance the ToUnicode table looks reasonable. I am confused...
(In reply to Robin Watts from comment #7) > And actually: > > > In this case, the file does have a ToUnicode table. BUT... it's full of rubbish. > > That's wrong too. On second glance the ToUnicode table looks reasonable. I > am confused... The first couple of fonts have a decent ToUnicode table. The latter ones don't. At any rate, I think I have a solution to work around this a bit. Fixed with: commit 500c0299af8d086d0a20bb2c3a1e7d5c72872cc9 Author: Robin Watts <Robin.Watts@artifex.com> Date: Fri Feb 24 17:31:48 2023 +0000 Bug 706426: Workaround strange text order in extraction. The example file has a load of text that purports to be Arabic. Unfortunately, while it appears as arabic, the ToUnicode tables for many (but not all) of the fonts in the document are horribly broken, and so when cut and pasted, you get garbage characters. Stranger still, in the lines where such text is used, the characters within words are sent left to right, but the words (and the spaces between words) within a line themselves are sent right to left. So, for the characters A, B, C, SPACE, D, E, F, SPACE, G, H, I we might actually get the following appearance: GHI DEF ABC The stext extraction for this is then: <line>ABC</line> <line> </line> <line>DEF</line> <line> </line> <line>GHI</line> which gives unexpected results when trying to select it. As a workaround against this, we tweak the behaviour of the stext device. Typically as each character arrives, we consider adding it onto the end of the "current line". If it doesn't fit, then we abandon that line, and move to a new one. The solution implemented here is to perform some extra processing. Whenever we are finished with the current line, before we start a new one, we check to see if the entirety of the current line fits before one of the previous lines. If so, we join the two. Clearly, we could go further with this processing and consider whether the newly extended line then fits before/after other lines, but this seems enough for now. This gives sensible behaviour for text selection for this file, and shouldn't hurt anything else. That's as much as we can hope for, because cut-and-paste is clearly a lost cause thanks to the broken ToUnicodes. Improvements in similar (less broken) files additionally given by: commit 48bf9c0cb2bd8e3008abb82f4cf69cbeaaf56a36 Author: Robin Watts <Robin.Watts@artifex.com> Date: Mon Feb 27 15:17:16 2023 +0000 Fix problems when selecting/highlighting right to left text. When selecting right to left text (such as trying to apply a highlight to arabic text in mupdf-gl), the selection box would behave very oddly. If the selection point was in the right hand half of a glyph we'd assume that it should be included, and in the left hand half, that it should be excluded from the selection. This is correct for l2r, but wrong for r2l text. Fixed by considering the directionality of text. The l2r/r2l direction detection could possibly be better as it may trip over neutral stuff at the start of lines, but this is MUCH better than before. Also, when returning the list of highlight quads for a selection we were assuming that successive chars would always 'merge' new quads on the right. Again, this is wrong for r2l text. The fix for this bug caused bug 706642. |
Created attachment 23804 [details] sample.pdf Open the attached sample PDF and attempt to select text of varying lengths. The selection handles are misplaced, and the selected text is often wrong while you are dragging one of the handles. We see something similar in iOS as well. Also attached is a video from a customer. Text selection in PDF relies on mupdf's fz_highlight_selection(), which may be causing the problem.