Skip to content

Remove non-displayable chars from outline title (#14267) #14268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 13, 2021

Conversation

calixteman
Copy link
Contributor

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I don't think that we should implement it this way!

First of all, I'm not sure why it's a problem for the API itself to return such strings as-is (and this is potentially a breaking api-change).
Secondly, we're already ignoring some characters in the viewer instead; see

/**
* @private
*/
_normalizeTextContent(str) {
return removeNullCharacters(str) || /* en dash = */ "\u2013";
}

which uses

pdf.js/src/shared/util.js

Lines 569 to 580 in 7d6d3fc

const NullCharactersRegExp = /\x00/g;
/**
* @param {string} str
*/
function removeNullCharacters(str) {
if (typeof str !== "string") {
warn("The argument for removeNullCharacters must be a string.");
return str;
}
return str.replace(NullCharactersRegExp, "");
}

Hence my question is if it'd be generally safe to simply update removeNullCharacters to skip the [\x00-\x1F] range instead, or if that could cause issues at one of its call-sites?
For example what about e.g.

const urlNullRemoved = removeNullCharacters(url);
, although I suppose that none of the [\x00-\x1F]-chars would make sense in a link?

@calixteman
Copy link
Contributor Author

Sorry, but I don't think that we should implement it this way!

First of all, I'm not sure why it's a problem for the API itself to return such strings as-is (and this is potentially a breaking api-change).

Yep, you're right: I didn't think about the api.

, although I suppose that none of the [\x00-\x1F]-chars would make sense in a link?

I think it should be possible to have those chars in using some entities.
So I think we can extend removeNullChar to range [\x00-\x1F] but we shouldn't use it when creating the href but something based on encodeURIComponent. Wdyt ?

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Nov 12, 2021

How about limiting the scope of these changes to just the functionality needed for the viewer, since that's basically the only thing we have to fix here as far as I can tell, with e.g. something along these lines (although a better parameter name might not hurt):

diff --git a/src/shared/util.js b/src/shared/util.js
index ba94fec8b..188e97737 100644
--- a/src/shared/util.js
+++ b/src/shared/util.js
@@ -567,15 +567,20 @@ class AbortException extends BaseException {
 }
 
 const NullCharactersRegExp = /\x00/g;
+const InvisibleCharactersRegExp = /[\x00-\x1F]/g;
 
 /**
  * @param {string} str
+ * @param {boolean} [removeInvisible]
  */
-function removeNullCharacters(str) {
+function removeNullCharacters(str, removeInvisible = false) {
   if (typeof str !== "string") {
     warn("The argument for removeNullCharacters must be a string.");
     return str;
   }
+  if (removeInvisible) {
+    return str.replace(InvisibleCharactersRegExp, "");
+  }
   return str.replace(NullCharactersRegExp, "");
 }
 
diff --git a/web/base_tree_viewer.js b/web/base_tree_viewer.js
index 6e203aea8..2c1d7c92f 100644
--- a/web/base_tree_viewer.js
+++ b/web/base_tree_viewer.js
@@ -59,7 +59,10 @@ class BaseTreeViewer {
    * @private
    */
   _normalizeTextContent(str) {
-    return removeNullCharacters(str) || /* en dash = */ "\u2013";
+    return (
+      removeNullCharacters(str, /* removeInvisible = */ true) ||
+      /* en dash = */ "\u2013"
+    );
   }
 
   /**

That way we'd not change the default behaviour, and these viewer changes would apply not just to the OutlineView but also the AttachmentsView/LayersView as well.
Furthermore, we could easily test it by adding one (or two) new tests to the pre-existing ones in test/unit/util_spec.js.

@calixteman
Copy link
Contributor Author

I had a look on what acrobat and pdfium are doing here.
In pdfium every chars in [00-1F] is replaced by a white space (0x20):
https://pdfium.googlesource.com/pdfium/+/refs/heads/main/core/fpdfdoc/cpdf_bookmark.cpp#40
In acrobat it's a bit different:

  • 00: text after 00 is removed
  • 01-08: the char is not replaced
  • 09-1F: the char is replaced by a space (0x20)

@Snuffleupagus
Copy link
Collaborator

Based on #14268 (comment), I wonder if we shouldn't perhaps use a slightly modified and "combined" approach here:

  • Leave the \x00 case alone, in order to avoid introducing any regressions (e.g. with hyper-links).
  • For the \x01-\x1F range replace all of them with regular spaces, since I cannot imagine that rendering some of them (as Adobe Reader does) is helpful. This new replacement could then be limited to a new replaceInvisible-mode in the existing removeNullCharacters function (such that we don't have to make a lot of code-changes).

@calixteman
Copy link
Contributor Author

/botio unittest

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_unittest from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/ccf63129ff28450/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_unittest from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/f98b3e69763516d/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/ccf63129ff28450/output.txt

Total script time: 3.03 mins

  • Unit Tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/f98b3e69763516d/output.txt

Total script time: 6.82 mins

  • Unit Tests: Passed

if (typeof str !== "string") {
warn("The argument for removeNullCharacters must be a string.");
return str;
}
if (replaceInvisible) {
return str.replace(InvisibleCharactersRegExp, " ");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was to purposely leave the \x00-case intact, even in the new mode, since there could very well be regressions by changing that.
Not to mentioned that having the \x00-char treated differently depending on the value of the new parameter feels surprising.

In the PDF document that originally prompted us to add the removeNullCharacters-function, I believe that this will lead to a bunch of trailing spaces, which could also potentially affect formatting of the outline-titles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you think we should trim the outline title when rendering it ?

Comment on lines 190 to 192
const str = Array.from(Array(32).keys())
.map(x => String.fromCharCode(x))
.join("a");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this string perhaps missing the last "a" character, in order for it to be consistently formatted? Currently this code gives:

\u0000a\u0001a\u0002a\u0003a\u0004a\u0005a\u0006a\u0007a\u0008a	a
a\u000ba\u000ca
a\u000ea\u000fa\u0010a\u0011a\u0012a\u0013a\u0014a\u0015a\u0016a\u0017a\u0018a\u0019a\u001aa\u001ba\u001ca\u001da\u001ea\u001f

whereas I believe that you probably want to use e.g.

const str = [...Array(32).keys()]
  .map(x => String.fromCharCode(x) + "a")
  .join("");

since that'd give you:

\u0000a\u0001a\u0002a\u0003a\u0004a\u0005a\u0006a\u0007a\u0008a	a
a\u000ba\u000ca
a\u000ea\u000fa\u0010a\u0011a\u0012a\u0013a\u0014a\u0015a\u0016a\u0017a\u0018a\u0019a\u001aa\u001ba\u001ca\u001da\u001ea\u001fa

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me, with the last comments addressed; thank you!

Comment on lines 194 to 200
const expected =
"a" +
" "
.repeat(31)
.split("")
.map(x => x + "a")
.join("");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that it's the expected value, I cannot help wondering if simply spelling it out explicitly wouldn't actually be clearer and easier to read?

const expected =
  "a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a";

@@ -59,7 +59,11 @@ class BaseTreeViewer {
* @private
*/
_normalizeTextContent(str) {
return removeNullCharacters(str) || /* en dash = */ "\u2013";
// Chars in range [0x00-0x1F] will be replaced with a white space.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The char-range probably needs to be updated here.

 - it aims to fix mozilla#14267;
 - there is nothing about chars in range [0-1F] in the specs but acrobat doesn't display them in any way.
@calixteman
Copy link
Contributor Author

/botio unittest

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_unittest from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/5780b58353ce4b7/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_unittest from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/a56a4aba850fcee/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/5780b58353ce4b7/output.txt

Total script time: 3.06 mins

  • Unit Tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/a56a4aba850fcee/output.txt

Total script time: 6.12 mins

  • Unit Tests: Passed

@calixteman calixteman merged commit 85c6dd5 into mozilla:master Nov 13, 2021
@calixteman calixteman deleted the outline branch November 13, 2021 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Outline title can contain non-displayable chars
3 participants