/Encoding prevents characters in a specific font from rendering but they do in ghostscript, chrome, and acrobat #14117

jberkenbilt · 2021-10-12T18:55:04Z

Note: I am the author of qpdf and very knowledgeable about PDF, but I have only just started digging into TrueType fonts while investigating this issue. I have passable Javascript skill but it is not my area of expertise. However I am happy to assist in producing a fix. I am going to dig into the code, but I thought I'd post the issue right away in case it's something that would be easily fixable by someone with more knowledge. Details below.

Attach (recommended) or Link to PDF file here:

ttf-font-encoding.pdf

Configuration:

Web browser and its version: gulp server -> displayed on chrome
Operating system and its version: Ubuntu 20.04
PDF.js version: github master at 3945965
Is a browser extension: no

Steps to reproduce the problem:

Start gulp server, then load the file
Observe that there is not very much text on the page; see below for correct rendering

What is the expected behavior? (add screenshot)

This is a screenshot of the file as rendered by chrome:

This is as rendered by ghostscript:

Poppler also can't render this. Here is the file as rendered by evince:

What went wrong? (add screenshot)

In the attached PDF, which is in "QDF" format and can be easily edited in a text editor that can handle binary files (like emacs), you can find two fonts defined: /F31 (object 6, base font /FSVHNM+Arial) and /F35 (object 7, base font /QIJLAK+Calibri). All the characters displayed with /F31 do not render. All the characters displayed with /F35 render properly. I have removed almost all extraneous information from the PDF file but have left in all the text from the original file (after removing sensitive information) rendered in either of those two fonts. Neither font has a /ToUnicode map, so it is easy to read the text the content stream.

If you edit object 6 to comment out the encoding (replace the space at offset 6752 with %), then the file renders properly with pdf.js as well as poppler. The fonts have /Flags 32, indicating a non-symbolic font. Removing /Flags has no bearing on the rendering.

If you extract the fontfile from the broken font from object 12 into a file, you can observe that the font file has two charmaps, one of which has format 0 and encoding "Apple Roman", and the other has format 2 and encoding "Unicode". When loading in pdf.js with the Javascript console displayed, you can observe these warnings:

Warning: cmap table has unsupported format: 2
util.js:28 Warning: TT: undefined function: 32

However, I'm not sure this is actually important since the file renders properly using presumably the other charmap with /Encoding removed.

Looking at the debugging out from gs -dNODISPLAY -dBATCH -dTTFDEBUG /tmp/ttf-font-encoding.pdf, it appears that ghostscript is deciding to use the builtin encoding from the cmap and is disregarding /Encoding, but I'm not sure, and I have intentionally not dug into the ghostscript code because it is GPL-2 and I don't want it to contaminate my thinking if I help with a fix.

Anything else I would say would be well into speculative territory at this point. Hopefully someone will be able to shed some light on this and help find a solution. My hunch is that we are dealing with an incorrect PDF or an incorrect TTF file that some other viewers are able to handle because of heuristics they have to work around broken files. I know from qpdf that a lot of the work of PDF readers is dealing with all the broken files in the wild, since for most of the world, "It works in Acrobat" seems to mean the PDF is good. :-)

The text was updated successfully, but these errors were encountered:

jberkenbilt · 2021-10-12T19:04:33Z

I should say that my "plan A" is to come up with (or have you come up with) a fix to pdf.js, and my "plan B" is to figure out exactly what properties of the file make it render properly so I can programmatically detect broken files and run them through ghostscript's pdfwrite device. Passing the files gs -sDEVICE=pdfwrite produces valid files. It's interesting to note that removing /Encoding /WinAnsiEncoding from the working font prevents that font from rendering, so I have yet to figure out why the broken font works without an explicit encoding in some viewers. I'm thinking it's probably just luck.

Snuffleupagus · 2021-10-12T20:35:03Z

If you edit object 6 to comment out the encoding (replace the space at offset 6752 with %), then the file renders properly with pdf.js as well as poppler.

As far as I'm concerned this is a red herring, since replacing the /Encoding-entry with bogus data results in another code-path being taken in the readCmapTable function. (And that's a function that you want to be very careful when touching, don't ask me how I know :-)

When loading in pdf.js with the Javascript console displayed, you can observe these warnings:
Warning: cmap table has unsupported format: 2
util.js:28 Warning: TT: undefined function: 32

That warning should thus explain the font rendering errors, since we currently don't support that particular cmap format.
In all the years that I've been around the PDF.js project that is the first case I can recall where cmap format 2 support was required, hence it seems safe to assume that it's probably very rare in practice.

All-in-all, it seems that the correct solution here would be to implement support for cmap format 2; the following should be helpful https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html

jberkenbilt · 2021-10-12T20:47:29Z

Before I read your message, I noticed that it is the format 2 issue, and I found that the following workaround causes the file to render properly because the 1,0 cmap works:

diff --git a/src/core/fonts.js b/src/core/fonts.js
index aeaf00e82..758c4cf31 100644
--- a/src/core/fonts.js
+++ b/src/core/fonts.js
@@ -1452,6 +1452,20 @@ class Font {
           }
         }
 
+        if (useTable) {
+          const oldPos = file.pos;
+          file.pos = start + offset;
+          const format = file.getUint16();
+          file.pos = oldPos;
+          if (!(format === 0) || (format === 4) || (format === 6)) {
+            // This cmap has an unsupported format, so we won't be
+            // able to use it. The list of supported formats is
+            // duplicated below.
+            useTable = false;
+            canBreak = false;
+          }
+        }
+
         if (useTable) {
           potentialTable = {
             platformId,

I may apply this locally for now. It's pretty harmless since, if the format is not supported, the cmap that it picks won't work, but it's a little ugly. I haven't opened a pull request because I doubt you would consider this a proper fix. I will look into the format 2 and see what I can do.

jberkenbilt · 2021-10-12T20:52:45Z

This file has only English text encoded in ASCII, though the font has a few characters that fall out of ASCII. Reading up on format 2, it seems like a very odd choice. However, I will continue to study it.

jberkenbilt · 2021-10-12T21:37:56Z

I will be able to give you a pull request implementing support for format 2.

jberkenbilt · 2021-10-12T22:33:13Z

My changes are working on my test files. I need to create a test, which I will do tomorrow. I'll push up a draft pull request without tests for early review.

Implement TrueType character map "format 2" (fixes #14117)

If a PDF included an embedded TrueType font whose preferred character map (cmap) was in "format 2", the code would select that character map and then refuse to read it because of an unsupported format, thus causing the characters not to be rendered. This commit implements support for format 2 as described at the link below. https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html

Snuffleupagus added font-conversion font-truetype labels Oct 12, 2021

Snuffleupagus linked a pull request Oct 13, 2021 that will close this issue

Implement TrueType character map "format 2" (fixes #14117) #14118

Merged

Snuffleupagus closed this as completed in #14118 Oct 13, 2021

Snuffleupagus added a commit that referenced this issue Oct 13, 2021

Merge pull request #14118 from adventhp/cmap-format-2

29fd87b

Implement TrueType character map "format 2" (fixes #14117)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/Encoding prevents characters in a specific font from rendering but they do in ghostscript, chrome, and acrobat #14117

/Encoding prevents characters in a specific font from rendering but they do in ghostscript, chrome, and acrobat #14117

jberkenbilt commented Oct 12, 2021

jberkenbilt commented Oct 12, 2021

Snuffleupagus commented Oct 12, 2021

jberkenbilt commented Oct 12, 2021

jberkenbilt commented Oct 12, 2021

jberkenbilt commented Oct 12, 2021

jberkenbilt commented Oct 12, 2021

/Encoding prevents characters in a specific font from rendering but they do in ghostscript, chrome, and acrobat #14117

/Encoding prevents characters in a specific font from rendering but they do in ghostscript, chrome, and acrobat #14117

Comments

jberkenbilt commented Oct 12, 2021

jberkenbilt commented Oct 12, 2021

Snuffleupagus commented Oct 12, 2021

jberkenbilt commented Oct 12, 2021

jberkenbilt commented Oct 12, 2021

jberkenbilt commented Oct 12, 2021

jberkenbilt commented Oct 12, 2021