Bug 705288 - Support direct use of CID-Keyed fonts from PostScript with txtwrite
Summary: Support direct use of CID-Keyed fonts from PostScript with txtwrite
Status: UNCONFIRMED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: Other Driver (show other bugs)
Version: 9.56.1
Hardware: PC Windows 10
: P4 enhancement
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-09 08:42 UTC by Holger
Modified: 2022-05-18 09:17 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
a simple File (only "TEST") created with LibreOffice writer and print on HP Laserjet 4350 to file (18.92 KB, application/postscript)
2022-05-09 08:42 UTC, Holger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Holger 2022-05-09 08:42:03 UTC
Created attachment 22505 [details]
a simple File (only "TEST") created with LibreOffice writer and print on HP Laserjet 4350 to file

i create some PS Files with Windows Server 2019 an Printer Driver HP Laserjet 4350  (or some other PS Printer driver)  and try to extract Text Information from this files with the following command

gswin64c.exe  -dNOPAUSE -dBATCH -sDEVICE=txtwrite -sOutputFile="c:\temp\embtxt1.txt.%d" "C:\temp\test.ps"

but it reports 
*** C stack overflow. Quiting


i have tried to debug ghostscript and it loops the functions 

>	gsdll64.dll!textw_text_resync(gs_text_enum_s * pte, const gs_text_enum_s * pfrom) Zeile 1957	C
 	gsdll64.dll!gs_text_resync(gs_text_enum_s * pte, const gs_text_enum_s * pfrom) Zeile 690	C
 	gsdll64.dll!textw_text_resync(gs_text_enum_s * pte, const gs_text_enum_s * pfrom) Zeile 1958	C
 	gsdll64.dll!gs_text_resync(gs_text_enum_s * pte, const gs_text_enum_s * pfrom) Zeile 690	C
 	gsdll64.dll!textw_text_resync(gs_text_enum_s * pte, const gs_text_enum_s * pfrom) Zeile 1958	C
 	gsdll64.dll!gs_text_resync(gs_text_enum_s * pte, const gs_text_enum_s * pfrom) Zeile 690
Comment 1 Ken Sharp 2022-05-18 09:17:18 UTC
The PostScript program uses a CID-Keyed font, which is not supported by the txtwrite device, it only supports type 0 fonts with CID-Keyed descendants.

I've made a commit which resolves the recursion, and emits a warning that the font type is not supported before exiting 5527bce8f1c0c6cd62c4a0a19fc511507ae53da9

I'm altering this to an enhancement to support CID-Keyed fonts directly (note to self; steal process_cid_text from gdevpdtc.c).

However I should probably mention that even with support for the font type, the text extracted from this document will never be 'Text'. PostScript does not support ToUnicode CMaps, so there is no way to add Unicode information to the font. The Cmap which is used has a custom Ordering and Registry which means we cannot extract any meaning from it. The CIDs do not correspond to ASCII character codes (it's a subset font) and are 2 byte codes anyway.

The final result of all that is that there is nothing in the PostScript program which allows us to determine a Unicode code point for the text and so we must fall back on using the character codes, which are not ASCII. I believe the output from this example would be:

0x00 0x37 0x00 0x28 0x00 0x36 0x00 0x37

Treated as UTF16 that would be "7(67"