Bug 689719 - Why do some pdf files display so badly?
Summary: Why do some pdf files display so badly?
Status: NOTIFIED WORKSFORME
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Interpreter (show other bugs)
Version: master
Hardware: PC Linux
: P4 normal
Assignee: Default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-02-23 07:31 UTC by Sammy Umar
Modified: 2008-12-19 08:31 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
Sample pdf file (484.87 KB, application/pdf)
2008-02-23 07:32 UTC, Sammy Umar
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sammy Umar 2008-02-23 07:31:24 UTC
Many physics journals have put all of the older articles onto the
web as pdf files. The new ones have no problem but the older ones
display very badly using gs or gs based viewers. They seem to 
display much better with xpdf or AdobeReader. Is there any way
to improve the display quality or is this an inherent problem?
I am attaching a sample file.
Thanks
Comment 1 Sammy Umar 2008-02-23 07:32:03 UTC
Created attachment 3809 [details]
Sample pdf file
Comment 2 Ray Johnston 2008-02-23 12:42:19 UTC
This file is one of the prevalent 'PDF in name only' PDF's that many
lame applications create (such as scanners). 

The PDF consists of a single image per page. As -dPDFDEBUG with gs shows:

%Resolving: [3 0]
<<
/Type /XObject /Subtype /Image /Name /I0 /Filter [
/CCITTFaxDecode ]
/Width 5169 /Height 7129 /BitsPerComponent 1 /ColorSpace /DeviceGray /Length 4 0 R
/DecodeParms [
<<
/Columns 5169 /Rows 7129 /K -1 /EndOfBlock false
>>
]
>>

Thus this text is rendered at approximtely 720 dpi.

Running:
     gs -sFile=bug_689719.pdf toolbin/pdf_info.ps
shows:
    bug_689719.pdf has 3 pages.
    Producer: g42pdf.pl 1.0

By default Ghostscript doesn't perform any 'image smoothing', but Adobe
Acrobat does.

The image looks better on my screen when I force Ghostscript to use an
image filter with:

    gs -dDOINTERPOLATE bug_689719.pdf 
Comment 3 Marcos H. Woehrmann 2008-02-24 13:04:49 UTC
Ray, could you please clarify your 'PDF in name only' comment.  

I don't see why this file is any less a 'real' PDF file than any other.  I agree
that it's a simple PDF file, but surely there isn't a complexity requirement in
the PDF spec.
Comment 4 Ray Johnston 2008-02-24 13:42:30 UTC
Marcos requested clarification of my 'PDF in name only' comment.

It's not an issue of compliance with the PDF spec, but rather a comment on
a PDF that doesn't conform to the 'spirit' of creating a Portable Document
Format that provides many advantages over other formats such as TIFF.

At least this one doesn't use lossy (JPEG) compression.

For a page that looks like text, and is placed into an archive of documents,
it seems that one might expect something if the document is a PDF instead
of a TIFF or JPEG.

Most PDF's that have text are 'searchable' i.e., the text is in the PDF as
PDF text operators, and usually with embedded fonts (or font subsets) to make
the PDF portable. Also, these 'real' PDF's don't have a specific resolution
'baked in' so the print and display well at a wide range of resolutions/zoom
factors.

A PDF that is nothing more than a PDF wrapper on a full page bitmap is neither
resolution independent, nor is it searchable. Also the file size is usually
larger than a 'real' PDF.

Ghostscript originally put many fonts into PDF's as bitmap fonts, and that
had the latter (resolution specific) limitation, but at least it was text,
although somtimes the Encoding would keep it from being searchable by tools
that didn't handle Type 3 fonts correctly.

Note that there are tools to convert images into searchable PDF's. Scansoft is
one that I've used that works quite well, although like most OCR based s/w
it may require manual 'cleanup'. Their cleanup tool is worthwhile as well.
This software came with my Fujitsu ScanSnap scanner, but this scanner does
default to non-OCR mode, creating exactly the type of PDF attached to this
report.