Bug 703890 - NARROW NO-BREAK SPACE (U+202F) breaks pdfwrite output
Summary: NARROW NO-BREAK SPACE (U+202F) breaks pdfwrite output
Status: RESOLVED WONTFIX
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.53.3
Hardware: PC Linux
: P4 normal
Assignee: Default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-27 18:21 UTC by 9l8jrjr3hqsl1lxeizhchk5k
Modified: 2021-07-06 16:20 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
source files, generated pdfs and previews (40.33 KB, application/zip)
2021-05-27 18:21 UTC, 9l8jrjr3hqsl1lxeizhchk5k
Details

Note You need to log in before you can comment on or make changes to this bug.
Description 9l8jrjr3hqsl1lxeizhchk5k 2021-05-27 18:21:17 UTC
Created attachment 21028 [details]
source files, generated pdfs and previews

When working with a libreoffice generated pdf with a "narrow No-Break Space" (U+202F) inside the resulting pdf is broken after the space.


Steps to Reproduce:
1. Create a U+202F.odt File with U+202F and text
2. Generate PDF: libreoffice --convert-to pdf U+202F.odt 
3. gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=U+202F.gs.pdf U+202F.pdf


Actual Results:
- pdf is broken after U+202F
- gs/pdfwriter throws error:

   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.


Expected Results: 
- no errors


Additional Information:
$ uname -r 
5.11.21-200.fc33.x86_64

$ gs --version
9.53.3

$ libreoffice --version
LibreOffice 7.0.5.2 00(Build:2)


The attachement zip contains:

- Test odt: U+202F.odt
- libreoffice generated pdf: U+202F.pdf
- output from gs/pdfwriter: U+202F.gs.pdf
- images of U+202F.pdf and U+202F.gs.pdf for reference


Please contact me if more information is needed.
Comment 1 Ray Johnston 2021-05-27 21:34:38 UTC
Running:
debugbin/gswin64c -sDEVICE=pdfwrite -o out.pdf -dPDFSTOPONERROR U+202F.pdf

shows: Error: /invalidfont in /--pdfshowpage_finish--

However this font is able to be processed for rendering. The pdfwrite uses
more information than is used for rendering. Using -dPDFDEBUG indicates that
GlyphNames2Unicode is being used, but the exact source of the problem will
have to be determined by digging into pdfwrite and/or the font processing
code.
Comment 2 Ken Sharp 2021-05-28 06:52:34 UTC
Open the file with Adobe Acrobat and it gives a font erro.

The font is invalid, not out fault.
Comment 3 Ken Sharp 2021-05-31 16:03:02 UTC
Some additional information;

The problem is precisely the definition of the 'non-breaking space' in the embedded TrueType font. Either the original font (LiberationSans) has a broken definition of the glyph description, or the process of subsetting the font for embedding in the PDF file has broken the definition. I suspect it's the latter but I can't be certain.

The font is set up so that character code 1 is defined as the non-breaking space, which maps to GID 1 in the TrueType font. The PDF file uses that character code as the first character in the text:

56.8 773.989 Td /F1 12 Tf[<0102> ]TJ

Character code 2 is the 'N'. Its common practice to number the glyphs in the order they are used when creating a subset font. So the non-breaking space gets character code 1, the 'N' gets character code 2, 'o' is 3 and so on.

Using ttfdump on the embedded subset font we can see that the GLYF table begins like this (glyph 0 is the /.notdef):

'glyf' Table - Glyph Data
-------------------------
Size = 6420 bytes, 29 entries
	Glyph   0: off = 0x00000000, len = 0

	Glyph   1: off = 0x00000000, len = 16
	  numberOfContours:	-1  (Composite)
	  xMin:			0
	  yMin:			0
	  xMax:			0
	  yMax:			0

	     0: Flags:		0x1006
		Glyf Index:	29
		X BOffset:	0
		Y BOffset:	0
		Other:		Round X,Y to Grid            


So Glyph 1 is defined as a composite character, that is it is composed of two glyph descriptions. This is commonly done to save space for accented characters, you can describe eacute as 'e' and an acute, and then you can describe aacute as 'a' and the same acute description, meaning you only need to embed one description each of the acute, e and a, rather than a, e, aacute and eacute. The more you can reuse an accent the greater the savings.

In this case we see that glyph 1 has the component glyph Glyf index 29.

But from the maxp table:

'maxp' Table - Maximum Profile
------------------------------
Size = 32 bytes (expecting 32 bytes)
	'maxp' version:		  1.0
	numGlyphs:		29

Remember from the GLYF table above that we start numbering glyphs from 0, so 29 glyphs means glyphs numbered 0 to 28, and indeed the GLYF table has its last entry at gid=28.

Since we can't find glyph 29 to draw it, the font is genuinely invalid. This is why Acrobat throws an error when you try to draw character code 1 as well, it has the same problem we do.

I'd suggest you open this as a bug with Libre Office, rather than us. and I hope the information here might be of some value in such a report.
Comment 4 9l8jrjr3hqsl1lxeizhchk5k 2021-07-06 16:20:40 UTC
Posted it to an existing Libreoffice Bug: "Narrow No-Break Space (U+202F) causes PDF Error by using bundled Liberation fonts"

https://bugs.documentfoundation.org/show_bug.cgi?id=112152