Bug 687466

Summary: ps2ascii fails on PDF generated by Adobe InDesign
Product: Ghostscript Reporter: Jason Rhinelander <ghostscript>
Component: PDF InterpreterAssignee: Alex Cherepanov <alex>
Status: NOTIFIED FIXED    
Severity: normal    
Priority: P2    
Version: 8.14   
Hardware: PC   
OS: Linux   
Customer: Word Size: ---
Attachments: Simple PDF causing problem
patch

Description Jason Rhinelander 2004-05-14 15:47:10 UTC
I've been attempting to use pstotext or ps2ascii to extract text from some
PDF's, but whenever I run either on a PDF generated by Adobe InDesign, it gives
me a fatal error:

$ ps2ascii fails.pdf
 
 
\Gamma Error: /rangecheck in --get--
Operand stack:
   --nostringval--   --dict:10/10(L)--   600   2007   5307   68  
--nostringval--   68
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--  
--nostringval--   2   %stopped_push   --nostringval--   --nostringval--  
--nostringval--   false   1   %stopped_push   2 3   %oparray_pop   2   3  
%oparray_pop   2   3   %oparray_pop   --nostringval--   2   1   1
--nostringval--   %for_pos_int_continue   --nostringval--   --nostringval--  
--nostringval--  --nostringval--   %array_continue   --nostringval--   false   1
  %stopped_push   --nostringval--   %loop_continue   --nostringval--   3   10  
%oparray_pop   --nostringval--   6   10   %oparray_pop  
(\000V\000G\000I\000D\000V\000G\000I\000G\000V\000D\000I)   --nostringval--  
%string_continue   --nostringval--
Dictionary stack:
   --dict:1166/1686(ro)(G)--   --dict:0/20(G)--   --dict:78/200(L)--  
--dict:78/200(L)--   --dict:104/127(ro)(G)--   --dict:238/347(ro)(G)--  
--dict:20/24(L)--   --dict:4/6(L)--   --dict:21/32(L)--   --dict:20/31(L)--
Current allocation mode is local
AFPL Ghostscript 8.14: Unrecoverable error, exit code 1


I'm not sure if the error is caused by something InDesign is doing (perhaps
InDesign's forced use of CID fonts has something to do with it?).  I'll attach
the fails.pdf file as well - any help would be appreciated.
Comment 1 Jason Rhinelander 2004-05-14 15:47:55 UTC
Created attachment 666 [details]
Simple PDF causing problem
Comment 2 Alex Cherepanov 2004-05-15 14:53:53 UTC
Created attachment 667 [details]
patch

There's no way to recover ASCII from the strings encofed for a
CID font. The patch attached fixes the PostScript error but generates
wrong results. It just dumps the strings in the unmodified encoding.

Extraction of text from PDF should be done before conversion to PostScript
using /ToUnicode CMap. The latter is an enhancement request, not a bug.
Comment 3 Ray Johnston 2004-05-26 10:18:31 UTC
We should apply the patch and close the bug, but open a new
enhancement request for the Unicode mode.

Comment 4 Alex Cherepanov 2004-05-31 19:02:57 UTC
The patch is committed to head branch.
An enhancement request (bug 687492) was creates to track the
development of ps2ascii utility.

There are 2 issues here:
(1) Decode source strings with well-known CMap files into Unicode or ASCII
    when possible.
(2) Use ToUnicode CMap if possible, but first we need to pass it from PDF to
    PostScript level (bug 685335).