Skip to content

Represent cid chars using integers, not strings. #5111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 5, 2014

Conversation

nnethercote
Copy link
Contributor

cid chars are 16-bit unsigned integers. Currently we convert them to
single-char strings when inserting them into the CMap, and then convert
them back to integers when extracting them from the CMap. This patch
changes CMap so that cid chars stay in integer format throughout, saving
both time and space.

When loading the PDF from issue #4580, this change reduces peak RSS from
~600 to ~370 MiB. It also improves overall speed on that PDF by ~26%,
going from 724 ms to 533 ms.

@@ -210,18 +214,24 @@ var CMap = (function CMapClosure() {
this.numCodespaceRanges++;
},

mapRange: function(low, high, dstLow) {
mapCidRange: function(low, high, dstLow) {
var lastByte = dstLow.length - 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This line is now superfluous.

cid chars are 16-bit unsigned integers. Currently we convert them to
single-char strings when inserting them into the CMap, and then convert
them back to integers when extracting them from the CMap. This patch
changes CMap so that cid chars stay in integer format throughout, saving
both time and space.

When loading the PDF from issue mozilla#4580, this change reduces peak RSS from
~600 to ~370 MiB. It also improves overall speed on that PDF by ~26%,
going from 724 ms to 533 ms.
@nnethercote
Copy link
Contributor Author

I removed the line of dead code.

@Snuffleupagus
Copy link
Collaborator

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Aug 1, 2014

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.22.172.223:8877/4d7720a8bf14435/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Aug 1, 2014

From: Bot.io (Linux)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.21.233.14:8877/5aba5d3fdf2b161/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Aug 1, 2014

From: Bot.io (Windows)


Success

Full output at http://107.22.172.223:8877/4d7720a8bf14435/output.txt

Total script time: 19.75 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

pdfjsbot commented Aug 1, 2014

From: Bot.io (Linux)


Failed

Full output at http://107.21.233.14:8877/5aba5d3fdf2b161/output.txt

Total script time: 23.02 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://107.21.233.14:8877/5aba5d3fdf2b161/reftest-analyzer.html#web=eq.log

@yurydelendik
Copy link
Contributor

I wonder how the linux ref image got fixed. Otherwise it looks good.

@nnethercote
Copy link
Contributor Author

I wonder how the linux ref image got fixed.

I don't know. I didn't see that change on my Linux machine. In theory the visible behaviour should be unchanged.

@nnethercote
Copy link
Contributor Author

I didn't see that change on my Linux machine.

Oh, it's Chromium-only. I can reproduce the difference now. Seems like a clear improvement... :)

@nnethercote
Copy link
Contributor Author

I've worked out what changed. It's related to the way the '.notdef' glyph is handled in one place. Here's the old code, from src/core/fonts.js:

          // If the font is actually a CID font then we should use the charset
          // to map CIDs to GIDs.
          for (glyphId = 0; glyphId < charsets.length; glyphId++) {
            var cidString = String.fromCharCode(charsets[glyphId]);
            var charCode = properties.cMap.charCodeOf(cidString);
            charCodeToGlyphId[charCode] = glyphId;
          }

charsets[glyphId] is '.notdef'. cidString is '\x00' (because '.notdef' converted to an integer is NaN, and then String.fromCharCode called on NaN is '\x00').
We then do charCodeOf(0), which gives 0, and so we set charCodeToGlyphId[0].

Now consider the new code.

          // If the font is actually a CID font then we should use the charset
          // to map CIDs to GIDs.
          for (glyphId = 0; glyphId < charsets.length; glyphId++) {
            var cid = charsets[glyphId];
            var charCode = properties.cMap.charCodeOf(cid);
            charCodeToGlyphId[charCode] = glyphId;
          }

charsets[glyphId] is '.notdef', so cid is too. We then do charCodeOf('.notdef'), which gives -1, and so we set charCodeToGlyphId[-1], which I guess is effectively the same as not setting anything, since the -1th element probably won't be subsequently read.

I haven't worked out why this causes different behaviour in Firefox and Chromium. But I can replicate the Chromium bug with my patch applied if I add if (cid === '.notdef') charCode = 0 before the final assignment.

So my patch is clearly an improvement, but the handling of '.notdef' is still not good. One possibility is to check for a -1 return value from charCodeOf() in this loop. But there might be other misuses of '.notdef'... in which case fixing them in a follow-up PR might be appropriate, since it's well beyond the scope of this PR.

@yurydelendik, what do you think?

@yurydelendik
Copy link
Contributor

That's good. I just wanted to know the reason. Thanks for checking this out. We just need to file a new issue for that.

@nnethercote
Copy link
Contributor Author

I filed #5132 for the .notdef follow-up.

@yurydelendik
Copy link
Contributor

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2014

From: Bot.io (Windows)


Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.22.172.223:8877/871855041bd5bca/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2014

From: Bot.io (Linux)


Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.21.233.14:8877/625a34a08028cb4/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2014

From: Bot.io (Windows)


Success

Full output at http://107.22.172.223:8877/871855041bd5bca/output.txt

Total script time: 19.75 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2014

From: Bot.io (Linux)


Failed

Full output at http://107.21.233.14:8877/625a34a08028cb4/output.txt

Total script time: 22.91 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://107.21.233.14:8877/625a34a08028cb4/reftest-analyzer.html#web=eq.log

@yurydelendik
Copy link
Contributor

/botio-linux makeref

@pdfjsbot
Copy link

pdfjsbot commented Aug 5, 2014

From: Bot.io (Linux)


Received

Command cmd_makeref from @yurydelendik received. Current queue size: 0

Live output at: http://107.21.233.14:8877/84a5e71335580d0/output.txt

yurydelendik added a commit that referenced this pull request Aug 5, 2014
Represent cid chars using integers, not strings.
@yurydelendik yurydelendik merged commit 6865c28 into mozilla:master Aug 5, 2014
@yurydelendik
Copy link
Contributor

Thank you for the patch

@pdfjsbot
Copy link

pdfjsbot commented Aug 5, 2014

From: Bot.io (Linux)


Success

Full output at http://107.21.233.14:8877/84a5e71335580d0/output.txt

Total script time: 22.72 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@nnethercote nnethercote deleted the better-cidchars branch August 6, 2014 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants