Skip to content

Implement TrueType character map "format 2" (fixes #14117) #14118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 13, 2021
Merged

Implement TrueType character map "format 2" (fixes #14117) #14118

merged 1 commit into from
Oct 13, 2021

Conversation

jberkenbilt
Copy link
Contributor

If a PDF included an embedded TrueType font whose preferred character
map (cmap) was in "format 2", the code would select that character map
and then refuse to read it because of an unsupported format, thus
causing the characters not to be rendered. This commit implements
support for format 2 as described at the link below.

https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html

@jberkenbilt
Copy link
Contributor Author

@Snuffleupagus This lints cleanly and works on a small sample of my test files that I have exercised it with. Tomorrow I will endeavor to create an automated tests and will manually test with the remaining files from my sample of known broken files. Years of coding to the PDF spec turns out to be good preparation for implementing format 2 from the Apple spec you referenced. In addition to just loading up PDF files, I ran this in the debugger and verified that the character map as loaded matched what this little Python script produces. This script uses freetype. This is just FYI.

#!/usr/bin/env python3
import freetype
import sys
import os
import PIL.Image as Image

whoami = os.path.basename(sys.argv[0])

try:
    filename = sys.argv[1]
except IndexError:
    exit(f'Usage: {whoami} font-filename')

face = freetype.Face(filename)
for i, c in enumerate(face.charmaps):
    print(f'charmap {i}:', c.cmap_format)

print('asc/desc:', face.ascender, face.descender)
print('bbox:', face.bbox.xMin, face.bbox.yMin, face.bbox.xMax, face.bbox.yMax)
print('num_glyphs:', face.num_glyphs)
for i, c in enumerate(face.charmaps):
    print(f'charmap {i}: {c.encoding_name}')
face.set_char_size(4096)
for i in face.get_chars():
    char, glyph = i
    print(chr(char), hex(glyph).replace('0x', ''))
    face.load_glyph(glyph)
    bitmap = face.glyph.bitmap
    width = bitmap.width
    height = bitmap.rows
    try:
        img = Image.frombytes(
            "L", (width, height), bytes(bytearray(bitmap.buffer)))
        img.save(f'/tmp/z/glyph-{glyph}.png')
    except Exception as e:
        print(char, e)

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lints cleanly and works on a small sample of my test files that I have exercised it with.

Did you manage to run the PDF.js test-suite successfully with these changes?
Please see https://github.com/mozilla/pdf.js/wiki/Contributing#4-run-lint-and-testing


Overall, this looks really good and seems to follow the specification nicely!
I've added a couple of comments, and please remember to squash the commits when addressing them.

@jberkenbilt
Copy link
Contributor Author

Did you manage to run the PDF.js test-suite successfully with these changes? Please see https://github.com/mozilla/pdf.js/wiki/Contributing#4-run-lint-and-testing

I did. There were some failures, but there were also some failures when I just straight up ran the test suite after running makeref. I will study closely to make sure there are no regressions. Of course, this is only adding code to a place that would have otherwise caused a failure, so regressions are highly unlikely. Famous last words though. I'll check.

@jberkenbilt
Copy link
Contributor Author

Comments addressed. I will retest, check the automated test results, squash commits, and remove draft status.

@jberkenbilt
Copy link
Contributor Author

Everything checks out for me locally. I'm just looking at automated tests now.

@Snuffleupagus
Copy link
Collaborator

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/3ecef91c370183e/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/3ecef91c370183e/output.txt

Total script time: 4.33 mins

Published

@Snuffleupagus
Copy link
Collaborator

I'm just looking at automated tests now.

Please note that we only run a subset of all tests with GitHub Actions (mostly the unit-test), and that the full test-suite needs to be triggered manually (by approved users).

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/6b115664a3836d2/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/7394175ad690d94/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/6b115664a3836d2/output.txt

Total script time: 3.93 mins

  • Font tests: Passed
  • Unit tests: FAILED
  • Integration Tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.241.84.105:8877/6b115664a3836d2/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/7394175ad690d94/output.txt

Total script time: 5.45 mins

  • Font tests: Passed
  • Unit tests: FAILED
  • Integration Tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.193.163.58:8877/7394175ad690d94/reftest-analyzer.html#web=eq.log

If a PDF included an embedded TrueType font whose preferred character
map (cmap) was in "format 2", the code would select that character map
and then refuse to read it because of an unsupported format, thus
causing the characters not to be rendered. This commit implements
support for format 2 as described at the link below.

https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
@jberkenbilt jberkenbilt marked this pull request as ready for review October 13, 2021 11:56
@jberkenbilt
Copy link
Contributor Author

@Snuffleupagus This is good from my end. The image tests that fail are eq test that contain DCTDecode (JPEG) images. There shouldn't be eq tests with JPEG images. They will not render consistently. I used to try to use heuristics like downsampling and converting to bitonal with thresholding, but these are not reliable. My recommendation is that JPEG images in the test PDFs be replaced with images compressed with a non-lossy compression scheme whose behavior isn't dependent on floating point arithmetic. You can see some image test failures if you just check out master, run gulp makeref, and then gulp test immediately without making any changes.

In any case, my local test files are all good and unit tests are passing again, so unless you have anything else you need me to do, I'm ready to have this merged when you are.

What does your release process look like? If I need this fix for my local dev, do I need to patch it, or will a fix be released soon? Thanks.

@jberkenbilt
Copy link
Contributor Author

@Snuffleupagus BTW, I greatly appreciate the quick review and feedback. Also Mr. Snuffleupagus was always one of my favorite Sesame Street characters, and it was a great day indeed when Big Bird was vindicated about his existence. Sadly I was already an adult by then and missed the actual event.

@Snuffleupagus
Copy link
Collaborator

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/2f02e54c869c4d3/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/87e134431f7ed77/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/2f02e54c869c4d3/output.txt

Total script time: 22.11 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  errors: 604
  different ref/snapshot: 19

Image differences available at: http://54.241.84.105:8877/2f02e54c869c4d3/reftest-analyzer.html#web=eq.log

@Snuffleupagus
Copy link
Collaborator

/botio-linux browsertest

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_browsertest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/59677b28d22d398/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/87e134431f7ed77/output.txt

Total script time: 40.97 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 6
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/87e134431f7ed77/reftest-analyzer.html#web=eq.log

@Snuffleupagus
Copy link
Collaborator

/botio-linux browsertest

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_browsertest from @Snuffleupagus received. Current queue size: 1

Live output at: http://54.241.84.105:8877/8e10a2fdc572b81/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/59677b28d22d398/output.txt

Total script time: 17.88 mins

  • Regression tests: FAILED
  errors: 604
  different ref/snapshot: 18
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/59677b28d22d398/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/8e10a2fdc572b81/output.txt

Total script time: 19.52 mins

  • Regression tests: FAILED
  different ref/snapshot: 21
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/8e10a2fdc572b81/reftest-analyzer.html#web=eq.log

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me, looks great; thank you for the patch!

@Snuffleupagus Snuffleupagus merged commit 29fd87b into mozilla:master Oct 13, 2021
@Snuffleupagus
Copy link
Collaborator

/botio makeref

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/1906d3c397232bb/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 1

Live output at: http://54.193.163.58:8877/1eb54024c8791ec/output.txt

@Snuffleupagus
Copy link
Collaborator

The image tests that fail are eq test that contain DCTDecode (JPEG) images. There shouldn't be eq tests with JPEG images.

While there's unfortunately a number of intermittent failures, I don't think they have anything to do with JPEG images in particular. Note that all image formats are decoded/handled internally by the PDF.js library, see e.g. https://github.com/mozilla/pdf.js/blob/master/src/core/jpg.js, hence there's nothing inherently special about JPEG images here.

My recommendation is that JPEG images in the test PDFs be replaced with images compressed with a non-lossy compression scheme whose behavior isn't dependent on floating point arithmetic.

Sorry, but I don't understand how this would be appropriate/correct. First of all that'd render every test-case for the PDF.js built-in JPEG decoder useless, and secondly it wouldn't really work for any of the linked test-case.

What does your release process look like? If I need this fix for my local dev, do I need to patch it, or will a fix be released soon?

We don't have a set release schedule, unfortunately.
Given that the latest release was less than two weeks ago, see https://github.com/mozilla/pdf.js/releases, it's probably going to at least a month before the next official release (the last few releases have been approximately two months apart).

@jberkenbilt
Copy link
Contributor Author

Regarding intermittent image failures, you can just disregard my comments. I spoke prematurely and didn't think it through all the way. I was remember my own experiences with jpeg tests across platforms, but even that wouldn't apply to running it twice on the same machine back to back. So never mind about the images.

Thanks for the reply on the release schedule. I'll find a way to patch locally and will try to figure out a way to get automatically notified when the next release is out.

@jberkenbilt jberkenbilt deleted the cmap-format-2 branch October 13, 2021 13:33
@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/1906d3c397232bb/output.txt

Total script time: 20.15 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/1eb54024c8791ec/output.txt

Total script time: 31.55 mins

  • Lint: Passed
  • Make references: FAILED

@Snuffleupagus
Copy link
Collaborator

/botio-windows makeref

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants