With ghostscript 8.70, processing test.pdf in attachment with this script # GS=/opt/ghostscript/ghostscript-8.70/bin/gs pdf_src=$1 dest=`basename $pdf_src .pdf`_gs870.pdf $GS -sDEVICE=pdfwrite -dNOPAUSE -dCompressFonts=false -dCompressPages=false -dUseFlateCompression=false -sOutputFile=$dest $pdf_src -c quit # produce test_gs870.pdf without errors. But If I view test_gs870.pdf with xpdf, I see on console many of this "errors" Error: Illegal entry in bfrange block in ToUnicode CMap Both pdfs are seen correctly . I'm actually under a Ubuntu Linux 8.04 LTS .
Created attachment 5316 [details] gs870.zip distill_870.sh test_gs870.pdf test.pdf
This appears to be an error with xpdf, not Ghostscript. The original file includes a ToUnicode CMap which uses the ~bfchar operators: 0 beginbfrange endbfrange 5 beginbfchar <0004> <0021> <002B> <0048> <0048> <0065> <004F> <006C> <0052> <006F> endbfchar The output from pdfwrite encodes the same information but uses the ~bfrange operators instead: 5 beginbfrange <04><04><0021> <2b><2b><0048> <48><48><0065> <4f><4f><006c> <52><52><006f> endbfrange Now this is nominally slightly less efficient, since it requires a start and end range, but the spec does not say (Technical note 5014 Adobe CMap and CIDFont files specification) that these cannot be the same (p72 and 73 of the spec). In fact it appears that xpdf is assuming that the start and end codes will be 4 bytes long. If I modify the entries thus: 5 beginbfrange <0004><0004><0021> <002b><002b><0048> <0048><0048><0065> <004f><004f><006c> <0052><0052><006f> endbfrange The problem disappears. The spec says: "Values for srcCodeLo and srcCodeHi must be in hexadecimal notation. " There is no apparent requirement for null padding, so this requirement by xpdf would seem to be incorrect. Further, Adobe Acrobat is capable of opening both files *and* searching for the text in both cases. Since Acrobat requires the ToUnicode table to search for text in a CIDFont it would seem that Adobe do not require this either.
PDF 1.7 spec, at pag 472 sec 5.9.2 says: It must use the ...beginbfrange, and endbfrange operators to define the mapping from characters code to Unicode character sequences expressed un UTF-16BE encoding (also see Adobe TechNote 5411) . Is it important ?
>It must use the ...beginbfrange, and endbfrange operators to define the mapping >from characters code to Unicode character sequences expressed un UTF-16BE >encoding (also see Adobe TechNote 5411) . Is it important ? Well it does do this. The character codes are expressed as single bytes, but the Unicode code points are expressed as 16 bits. Eg: <04><04><0021> Staring code = 0x04, end code = 0x04, Unicode = 0x0021. So the final (Unicode) value is expressed as a UTF-16BE value. Character codes are just numbers and there's no indication that they need to be a particular number of bits (in general in PostScript or PDF they are not). Its worth noting that CMaps may encode character codes of arbitrary size, the largest I've heard of is 5 bytes (a Chinese CMap), so requiring two byte character codes would be unhelpful. Also, as I said, Acrobat is happy with the CMap (and the PDF file), and Acrobat is generally regarded as the standard.
Thank you Ken for your patience -- you should forgive me, it's a bit new for me. I'm not conviced of 1 begincodespacerange <00><ff> endcodespacerange because 5411 says 1 begincodespacerange <0000><FFFF> endcodespacerange So I wget http://www.pragma-ade.com/general/manuals/mk.pdf and I proccessed with GS=/opt/ghostscript/ghostscript-8.70/bin/gs pdf_src=$1 dest=`basename $pdf_src .pdf`_gs870.pdf $GS -sDEVICE=pdfwrite -dNOPAUSE -dCompressFonts=false -dCompressPages=false -dFirstPage=102 -dUseFlateCompression=false -sOutputFile=$dest $pdf_src -c quit (note first page= 102) I see this : 1 begincodespacerange <00><ff> endcodespacerange 8 beginbfrange <05><07><dcca> <09><09><0046> <18><18><0055> <83><83><0061> <85><8b><dd56> <8e><92><dd5f> <94><96><dd65> <98><9a><dd69> endbfrange endcmap CMapName currentdict /CMap defineresource pop end end endstream endobj 1049 0 obj <</BaseFont/HCNWLM+CambriaMath-2-Identity-H/ToUnicode 1286 0 R/Type/Font /Encoding /Identity-H/DescendantFonts[1050 0 R]/Subtype/Type0>> it it OK ? If I proccessed with GS=/opt/ghostscript/ghostscript-8.70/bin/gs pdf_src=$1 dest=`basename $pdf_src .pdf`_gs870.pdf $GS -sDEVICE=pdfwrite -dNOPAUSE -dCompressFonts=false -dCompressPages=false -dUseFlateCompression=false -sOutputFile=$dest $pdf_src -c quit I have an error at (or after) pag 11 , but this can be another story. Operand stack: --dict:8/17(L)-- MplSh1 --dict:8/9(ro)(L)-- --dict:8/8(ro)(L)-- --nostringval-- --dict:8/9(ro)(L)-- Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1862 1 3 %oparray_pop 1861 1 3 %oparray_pop 1845 1 3 %oparray_pop --nostringval-- --nostringval-- 12 1 316 --nostringval-- %for_pos_int_continue --nostringval-- --nostringval-- false 1 %stopped_push --nostringval-- --nostringval-- --nostringval-- %array_continue --nostringval-- false 1 %stopped_push --nostringval-- %loop_continue --nostringval-- --nostringval-- --nostringval-- Dictionary stack: --dict:1155/1684(ro)(G)-- --dict:1/20(G)-- --dict:76/200(L)-- --dict:76/200(L)-- --dict:106/127(ro)(G)-- --dict:285/300(ro)(G)-- --dict:22/25(L)-- --dict:4/6(L)-- --dict:25/40(L)-- --dict:1/1(ro)(G)-- --dict:8/15(L)-- --dict:3/5(L)-- Current allocation mode is local Last OS error: 2 GPL Ghostscript 8.70: Unrecoverable error, exit code 1
Acroread 9 under linux has no problem with mk.pdf .
I'm not conviced of 1 begincodespacerange <00><ff> endcodespacerange because 5411 says 1 begincodespacerange <0000><FFFF> endcodespacerange Not sure what you mean by 5411, is this another tech note ? The ~codespacerange operators just declare the maximum and minimum values for the character codes, since we aren't using values outside 00->ff this is enough for us. Note this is the *input* codes, not the output values. Multiple bytes in the code space range have a different meaning to what you might expect. Each bytes is considered separately, not as forming a single value. See p48 of tech note 5014 for more on this. In the case 0000 ffff this is the a full complement of 2 byte character codes, but we really don't need to specify that. Now as for mk.pdf, I don't see a problem with the CMap you quote, but I would need to see the whole file to be more sure. Your other error I can't really comment on. We've drifted rather far away from your xpdf problem, if you think there is a problem with a different file using GS you should open a new issue, if you think there is still a problem with xpdf then you should reopen this issue.
I'm not conviced of 1 begincodespacerange <00><ff> endcodespacerange because 5411 says 1 begincodespacerange <0000><FFFF> endcodespacerange Not sure what you mean by 5411, is this another tech note ? Yes ToUnicode Mapping File Tutorial Technical Note #5411 Plase see also PDF 1.7 spec, pag 472 subsection 5.9.2 Anyway for mk.pdf maybe I will open another issue . Many Thanks
>?Plase see also > >PDF 1.7 spec, pag 472 subsection 5.9.2 I've looked at both these and I do not see a specific problem. Its true that the examples all use 2-byte codespace ranges, and as a result all use 2 byte bfchar/bfrange operators. However there is nothing in the spec which says that 2 byte ranges are a requirement. In fact the spec makes it reasonably clear (under begincodesparange) that single byte ranges are permissible. Given that we do use 2-byte codes (but always between 00 and ff) its possibly arguable that we should use 2 byte ranges, but as I said, Acrobat seems happy with what we currently output.
At pag 3 of AdobeTech#5411 there is the following text: "codespacetange" definition , without exception, shall always be used: 1 begincodespacerange <0000><FFFF> endcodespacerange See also PDF 1.7 spec, at pag 472 sec 5.9.2 says: "Additional guidance regarding the CMAP defined in this entry is provided in Adobe Technical Note #5411, ToUnicode Mapping File Tutorial" But this is not the case of test_gs870.pdf hence I reopen the issue. Many thanks at Ken for his patience .
Im my opinion, the paragraph that actually clarifies this debate is the following, taken from PDF32000_2008.pdf section 9.10.3 "ToUnicode CMaps": "The CMap file shall contain begincodespacerange and endcodespacerange operators that are consistent with the encoding that the font uses. In particular, for a simple font, the codespace shall be one byte long." Almost the same text can be found in the PDF Reference, section 5.9.2 "ToUnicode CMaps", 3rd bullet. I consider this implies that fonts using 2-byte charcodes should use 2-byte ‘character codes’ in the ToUnicode CMap. This is also consistent with the following, found in section 9.7.6.2 "CMap Mapping", and which refers to ‘regular’ CMaps: "The code extracted from the string shall be looked up in the character code mappings for codes of that length. ... Failing that, it shall be looked up in the notdef mappings, ..." (Note the "of that lenght".) It's clear from here the length of the code matters a lot. If there are no ranges of the correct length, the respective ‘character code’ is taken as undefined - there’s no ‘delete leading zeroes and retry’. To put it another way, in a Type 0 font ‘character codes’ are conceptually not numbers, to be able to talk about ‘0-filling the value’, but sequences of 1 or more bytes. For example the <00FF> starts not as the number 255, but as the 2 bytes 0 and 255; these are extracted from a show string and used in a certain way (maybe to derive a CID, which IS a number) to locate the corresponding character. Also note the ToUnicode CMap is attached to the Type0 composite font, not to its CIDFont descendent (the latter indeed identifies characters by numbers and not by sequences of bytes).
>Almost the same text can be found in the PDF Reference, section >5.9.2 "ToUnicode CMaps", 3rd bullet. I consider this implies that fonts using >2-byte charcodes should use 2-byte ‘character codes’ in the ToUnicode CMap. I agree with this completely, in fact the point about codespace ranges is that they are not linear. If the range contains two bytes then the first byte of the character code is compared to the first byte of the start code, and the first byte of the end code. Similarly the second byte of the character code is compared to the second byte of the start and end codes. Since the two bytes are considered separately its pretty clear that a 2 byte code space range requires 2 byte character codes. (but note that Acrobat appears to promote single byte codes by inserting '00') What isn't clear is that ToUnicode CMaps *always* have a two byte code space range. This seems to be documented only in tech note 5411 the ToUnicode Mapping File Tutorial where it says (as Luigi correctly points out): "the following “codespacerange” definition, without exception, shall always be used: 1 begincodespacerange <0000> <FFFF>endcodespacerange" This makes it mandatory to have a 2 byte code space range and therefore the bfrange/bfchar parameters must also be 2 bytes. In passing Luigi also tells me that the xpdf maintainer has already 'fixed' this and the next release of xpdf will process these CMaps without complaint. Nevertheless, we should fix pdfwrite so it conforms the the 'spec'.
Fixed in revision 10018, patch here: http://ghostscript.com/pipermail/gs-cvs/2009-August/009738.html