Bug 695204 - PDFwrite - Text cannot be selected in output file
Summary: PDFwrite - Text cannot be selected in output file
Status: RESOLVED WONTFIX
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.14
Hardware: PC Windows 7
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-02 10:14 UTC by pihug12
Modified: 2014-05-04 11:16 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
File 1 (2.46 MB, application/pdf)
2014-05-02 10:26 UTC, pihug12
Details
File 2 (2.59 MB, application/pdf)
2014-05-02 10:30 UTC, pihug12
Details

Note You need to log in before you can comment on or make changes to this bug.
Description pihug12 2014-05-02 10:14:06 UTC
I'm using the « gswin64c.exe -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=out.pdf file1.pdf file2.pdf » trick to create a third file with the two source files merged.

I'm using SumatraPDF (on Windows 7 x64) to read the files.

I can select/highlight/copy the text on the source PDFs, but I can't on the output one.
Comment 1 pihug12 2014-05-02 10:26:49 UTC
Created attachment 10872 [details]
File 1
Comment 2 pihug12 2014-05-02 10:30:28 UTC
Created attachment 10873 [details]
File 2
Comment 3 Ken Sharp 2014-05-02 11:05:30 UTC
The output file does not contain any text. It contains a series of paths and fills which make up objects with the same visual appearance as the original input files but no fonts and no text in the resulting PDF file.

Interestigly the original file1.pdf, when decompressed, comes to 130 Mb, so this is an astonishingly large file.

You don;t need to run 2 files, 1 is quite enough to show the difference. I will look at it, but since the visual appearance is the same I'm currently inclined to say this is not a bug.
Comment 4 Ken Sharp 2014-05-03 01:35:49 UTC
The problem is that the text you see in the PDF file is not text. What is actually there is an image, a bitmap.

There is text, and you can find it using search, but you can't select it using Acrobat. The reason is that the text is drawn in text rendering mode 7 (clip). We don't support preserving that as text, we convert it all to paths and clip to the path. Acrobat won't let you select the text so this seems reasonable to me.

It looks like someone has laid a series of white rectangles over the top of the original image, then 'cut out' the text from those images letting the background shine through in those areas.

If I alter the text rendering mode from 7 to 3, then Acrobat will allow me to select the text, and if I run that file through pdfwrite I can select the text in the output file.
Comment 5 pihug12 2014-05-04 09:57:57 UTC
Thanks for you reply.

> If I alter the text rendering mode from 7 to 3, then Acrobat will allow me
> to select the text, and if I run that file through pdfwrite I can select
> the text in the output file.

Is there anyway to do this easily? I don't understand what do I have to do.
Comment 6 Ken Sharp 2014-05-04 11:16:56 UTC
(In reply to comment #5)
> Thanks for you reply.
> 
> > If I alter the text rendering mode from 7 to 3, then Acrobat will allow me
> > to select the text, and if I run that file through pdfwrite I can select
> > the text in the output file.
> 
> Is there anyway to do this easily? 

Basically, no. I decompressed the file, then did a search and replace (using a binary editor) for '7 Tr' replacing with '3 Tr', inspecting each occurrence because the sequence is short enough to potentially exist in a binary stream, such as image data. The result, of course, looks incorrect, because the clip is no longer applied.

> I don't understand what do I have to do.

TO be honest, this is not really something you want to undertake with a PDF file. If you can alter the way the file is created then that might help your workflow, but once the file has been created like this it is very hard to alter.