Skip to content

/Encoding prevents characters in a specific font from rendering but they do in ghostscript, chrome, and acrobat #14117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jberkenbilt opened this issue Oct 12, 2021 · 6 comments · Fixed by #14118

Comments

@jberkenbilt
Copy link
Contributor

Note: I am the author of qpdf and very knowledgeable about PDF, but I have only just started digging into TrueType fonts while investigating this issue. I have passable Javascript skill but it is not my area of expertise. However I am happy to assist in producing a fix. I am going to dig into the code, but I thought I'd post the issue right away in case it's something that would be easily fixable by someone with more knowledge. Details below.

Attach (recommended) or Link to PDF file here:

ttf-font-encoding.pdf

Configuration:

  • Web browser and its version: gulp server -> displayed on chrome
  • Operating system and its version: Ubuntu 20.04
  • PDF.js version: github master at 3945965
  • Is a browser extension: no

Steps to reproduce the problem:

  1. Start gulp server, then load the file
  2. Observe that there is not very much text on the page; see below for correct rendering

01-rendered-with-pdfjs

What is the expected behavior? (add screenshot)

This is a screenshot of the file as rendered by chrome:

02-rendered-with-chrome

This is as rendered by ghostscript:

03-rendered-with-gs

Poppler also can't render this. Here is the file as rendered by evince:

04-rendered-with-evince

What went wrong? (add screenshot)

In the attached PDF, which is in "QDF" format and can be easily edited in a text editor that can handle binary files (like emacs), you can find two fonts defined: /F31 (object 6, base font /FSVHNM+Arial) and /F35 (object 7, base font /QIJLAK+Calibri). All the characters displayed with /F31 do not render. All the characters displayed with /F35 render properly. I have removed almost all extraneous information from the PDF file but have left in all the text from the original file (after removing sensitive information) rendered in either of those two fonts. Neither font has a /ToUnicode map, so it is easy to read the text the content stream.

If you edit object 6 to comment out the encoding (replace the space at offset 6752 with %), then the file renders properly with pdf.js as well as poppler. The fonts have /Flags 32, indicating a non-symbolic font. Removing /Flags has no bearing on the rendering.

If you extract the fontfile from the broken font from object 12 into a file, you can observe that the font file has two charmaps, one of which has format 0 and encoding "Apple Roman", and the other has format 2 and encoding "Unicode". When loading in pdf.js with the Javascript console displayed, you can observe these warnings:

Warning: cmap table has unsupported format: 2
util.js:28 Warning: TT: undefined function: 32

However, I'm not sure this is actually important since the file renders properly using presumably the other charmap with /Encoding removed.

Looking at the debugging out from gs -dNODISPLAY -dBATCH -dTTFDEBUG /tmp/ttf-font-encoding.pdf, it appears that ghostscript is deciding to use the builtin encoding from the cmap and is disregarding /Encoding, but I'm not sure, and I have intentionally not dug into the ghostscript code because it is GPL-2 and I don't want it to contaminate my thinking if I help with a fix.

Anything else I would say would be well into speculative territory at this point. Hopefully someone will be able to shed some light on this and help find a solution. My hunch is that we are dealing with an incorrect PDF or an incorrect TTF file that some other viewers are able to handle because of heuristics they have to work around broken files. I know from qpdf that a lot of the work of PDF readers is dealing with all the broken files in the wild, since for most of the world, "It works in Acrobat" seems to mean the PDF is good. :-)

@jberkenbilt
Copy link
Contributor Author

I should say that my "plan A" is to come up with (or have you come up with) a fix to pdf.js, and my "plan B" is to figure out exactly what properties of the file make it render properly so I can programmatically detect broken files and run them through ghostscript's pdfwrite device. Passing the files gs -sDEVICE=pdfwrite produces valid files. It's interesting to note that removing /Encoding /WinAnsiEncoding from the working font prevents that font from rendering, so I have yet to figure out why the broken font works without an explicit encoding in some viewers. I'm thinking it's probably just luck.

@Snuffleupagus
Copy link
Collaborator

If you edit object 6 to comment out the encoding (replace the space at offset 6752 with %), then the file renders properly with pdf.js as well as poppler.

As far as I'm concerned this is a red herring, since replacing the /Encoding-entry with bogus data results in another code-path being taken in the readCmapTable function. (And that's a function that you want to be very careful when touching, don't ask me how I know :-)

When loading in pdf.js with the Javascript console displayed, you can observe these warnings:

Warning: cmap table has unsupported format: 2
util.js:28 Warning: TT: undefined function: 32

That warning should thus explain the font rendering errors, since we currently don't support that particular cmap format.
In all the years that I've been around the PDF.js project that is the first case I can recall where cmap format 2 support was required, hence it seems safe to assume that it's probably very rare in practice.


All-in-all, it seems that the correct solution here would be to implement support for cmap format 2; the following should be helpful https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html

@jberkenbilt
Copy link
Contributor Author

Before I read your message, I noticed that it is the format 2 issue, and I found that the following workaround causes the file to render properly because the 1,0 cmap works:

diff --git a/src/core/fonts.js b/src/core/fonts.js
index aeaf00e82..758c4cf31 100644
--- a/src/core/fonts.js
+++ b/src/core/fonts.js
@@ -1452,6 +1452,20 @@ class Font {
           }
         }
 
+        if (useTable) {
+          const oldPos = file.pos;
+          file.pos = start + offset;
+          const format = file.getUint16();
+          file.pos = oldPos;
+          if (!(format === 0) || (format === 4) || (format === 6)) {
+            // This cmap has an unsupported format, so we won't be
+            // able to use it. The list of supported formats is
+            // duplicated below.
+            useTable = false;
+            canBreak = false;
+          }
+        }
+
         if (useTable) {
           potentialTable = {
             platformId,

I may apply this locally for now. It's pretty harmless since, if the format is not supported, the cmap that it picks won't work, but it's a little ugly. I haven't opened a pull request because I doubt you would consider this a proper fix. I will look into the format 2 and see what I can do.

@jberkenbilt
Copy link
Contributor Author

This file has only English text encoded in ASCII, though the font has a few characters that fall out of ASCII. Reading up on format 2, it seems like a very odd choice. However, I will continue to study it.

@jberkenbilt
Copy link
Contributor Author

I will be able to give you a pull request implementing support for format 2.

@jberkenbilt
Copy link
Contributor Author

My changes are working on my test files. I need to create a test, which I will do tomorrow. I'll push up a draft pull request without tests for early review.

Snuffleupagus added a commit that referenced this issue Oct 13, 2021
Implement TrueType character map "format 2" (fixes #14117)
bh213 pushed a commit to bh213/pdf.js that referenced this issue Jun 3, 2022
If a PDF included an embedded TrueType font whose preferred character
map (cmap) was in "format 2", the code would select that character map
and then refuse to read it because of an unsupported format, thus
causing the characters not to be rendered. This commit implements
support for format 2 as described at the link below.

https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
rousek pushed a commit to signosoft/pdf.js that referenced this issue Aug 10, 2022
If a PDF included an embedded TrueType font whose preferred character
map (cmap) was in "format 2", the code would select that character map
and then refuse to read it because of an unsupported format, thus
causing the characters not to be rendered. This commit implements
support for format 2 as described at the link below.

https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants