I've been attempting to use pstotext or ps2ascii to extract text from some PDF's, but whenever I run either on a PDF generated by Adobe InDesign, it gives me a fatal error: $ ps2ascii fails.pdf \Gamma Error: /rangecheck in --get-- Operand stack: --nostringval-- --dict:10/10(L)-- 600 2007 5307 68 --nostringval-- 68 Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 2 3 %oparray_pop 2 3 %oparray_pop 2 3 %oparray_pop --nostringval-- 2 1 1 --nostringval-- %for_pos_int_continue --nostringval-- --nostringval-- --nostringval-- --nostringval-- %array_continue --nostringval-- false 1 %stopped_push --nostringval-- %loop_continue --nostringval-- 3 10 %oparray_pop --nostringval-- 6 10 %oparray_pop (\000V\000G\000I\000D\000V\000G\000I\000G\000V\000D\000I) --nostringval-- %string_continue --nostringval-- Dictionary stack: --dict:1166/1686(ro)(G)-- --dict:0/20(G)-- --dict:78/200(L)-- --dict:78/200(L)-- --dict:104/127(ro)(G)-- --dict:238/347(ro)(G)-- --dict:20/24(L)-- --dict:4/6(L)-- --dict:21/32(L)-- --dict:20/31(L)-- Current allocation mode is local AFPL Ghostscript 8.14: Unrecoverable error, exit code 1 I'm not sure if the error is caused by something InDesign is doing (perhaps InDesign's forced use of CID fonts has something to do with it?). I'll attach the fails.pdf file as well - any help would be appreciated.
Created attachment 666 [details] Simple PDF causing problem
Created attachment 667 [details] patch There's no way to recover ASCII from the strings encofed for a CID font. The patch attached fixes the PostScript error but generates wrong results. It just dumps the strings in the unmodified encoding. Extraction of text from PDF should be done before conversion to PostScript using /ToUnicode CMap. The latter is an enhancement request, not a bug.
We should apply the patch and close the bug, but open a new enhancement request for the Unicode mode.
The patch is committed to head branch. An enhancement request (bug 687492) was creates to track the development of ps2ascii utility. There are 2 issues here: (1) Decode source strings with well-known CMap files into Unicode or ASCII when possible. (2) Use ToUnicode CMap if possible, but first we need to pass it from PDF to PostScript level (bug 685335).