Remove non-displayable chars from outline title (#14267) #14268

calixteman · 2021-11-12T18:48:16Z

it aims to fix Outline title can contain non-displayable chars #14267;
there is nothing about chars in range [0-1F] in the specs but acrobat doesn't display them in any way.

Snuffleupagus

Sorry, but I don't think that we should implement it this way!

First of all, I'm not sure why it's a problem for the API itself to return such strings as-is (and this is potentially a breaking api-change).
Secondly, we're already ignoring some characters in the viewer instead; see

pdf.js/web/base_tree_viewer.js

Lines 58 to 63 in 7d6d3fc

    
             /** 
        
              * @private 
        
              */ 
        
             _normalizeTextContent(str) { 
        
               return removeNullCharacters(str) || /* en dash = */ "\u2013"; 
        
             }

which uses

pdf.js/src/shared/util.js

Lines 569 to 580 in 7d6d3fc

    
           const NullCharactersRegExp = /\x00/g; 
        
           /** 
        
            * @param {string} str 
        
            */ 
        
           function removeNullCharacters(str) { 
        
             if (typeof str !== "string") { 
        
               warn("The argument for removeNullCharacters must be a string."); 
        
               return str; 
        
             } 
        
             return str.replace(NullCharactersRegExp, ""); 
        
           }

Hence my question is if it'd be generally safe to simply update removeNullCharacters to skip the [\x00-\x1F] range instead, or if that could cause issues at one of its call-sites?
For example what about e.g.

pdf.js/src/display/display_utils.js

Line 350 in 7d6d3fc

const urlNullRemoved = removeNullCharacters(url);

, although I suppose that none of the [\x00-\x1F]-chars would make sense in a link?

calixteman · 2021-11-12T19:42:46Z

Sorry, but I don't think that we should implement it this way!

First of all, I'm not sure why it's a problem for the API itself to return such strings as-is (and this is potentially a breaking api-change).

Yep, you're right: I didn't think about the api.

, although I suppose that none of the [\x00-\x1F]-chars would make sense in a link?

I think it should be possible to have those chars in using some entities.
So I think we can extend removeNullChar to range [\x00-\x1F] but we shouldn't use it when creating the href but something based on encodeURIComponent. Wdyt ?

Snuffleupagus · 2021-11-12T21:00:13Z

How about limiting the scope of these changes to just the functionality needed for the viewer, since that's basically the only thing we have to fix here as far as I can tell, with e.g. something along these lines (although a better parameter name might not hurt):

diff --git a/src/shared/util.js b/src/shared/util.js
index ba94fec8b..188e97737 100644
--- a/src/shared/util.js
+++ b/src/shared/util.js
@@ -567,15 +567,20 @@ class AbortException extends BaseException {
 }
 
 const NullCharactersRegExp = /\x00/g;
+const InvisibleCharactersRegExp = /[\x00-\x1F]/g;
 
 /**
  * @param {string} str
+ * @param {boolean} [removeInvisible]
  */
-function removeNullCharacters(str) {
+function removeNullCharacters(str, removeInvisible = false) {
   if (typeof str !== "string") {
     warn("The argument for removeNullCharacters must be a string.");
     return str;
   }
+  if (removeInvisible) {
+    return str.replace(InvisibleCharactersRegExp, "");
+  }
   return str.replace(NullCharactersRegExp, "");
 }
 
diff --git a/web/base_tree_viewer.js b/web/base_tree_viewer.js
index 6e203aea8..2c1d7c92f 100644
--- a/web/base_tree_viewer.js
+++ b/web/base_tree_viewer.js
@@ -59,7 +59,10 @@ class BaseTreeViewer {
    * @private
    */
   _normalizeTextContent(str) {
-    return removeNullCharacters(str) || /* en dash = */ "\u2013";
+    return (
+      removeNullCharacters(str, /* removeInvisible = */ true) ||
+      /* en dash = */ "\u2013"
+    );
   }
 
   /**

That way we'd not change the default behaviour, and these viewer changes would apply not just to the OutlineView but also the AttachmentsView/LayersView as well.
Furthermore, we could easily test it by adding one (or two) new tests to the pre-existing ones in test/unit/util_spec.js.

calixteman · 2021-11-13T11:40:22Z

I had a look on what acrobat and pdfium are doing here.
In pdfium every chars in [00-1F] is replaced by a white space (0x20):
https://pdfium.googlesource.com/pdfium/+/refs/heads/main/core/fpdfdoc/cpdf_bookmark.cpp#40
In acrobat it's a bit different:

00: text after 00 is removed
01-08: the char is not replaced
09-1F: the char is replaced by a space (0x20)

Snuffleupagus · 2021-11-13T12:38:27Z

Based on #14268 (comment), I wonder if we shouldn't perhaps use a slightly modified and "combined" approach here:

Leave the \x00 case alone, in order to avoid introducing any regressions (e.g. with hyper-links).
For the \x01-\x1F range replace all of them with regular spaces, since I cannot imagine that rendering some of them (as Adobe Reader does) is helpful. This new replacement could then be limited to a new replaceInvisible-mode in the existing removeNullCharacters function (such that we don't have to make a lot of code-changes).

calixteman · 2021-11-13T13:38:24Z

/botio unittest

pdfjsbot · 2021-11-13T13:38:25Z

From: Bot.io (Linux m4)

Received

Command cmd_unittest from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/ccf63129ff28450/output.txt

pdfjsbot · 2021-11-13T13:38:26Z

From: Bot.io (Windows)

Received

Command cmd_unittest from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/f98b3e69763516d/output.txt

pdfjsbot · 2021-11-13T13:41:28Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/ccf63129ff28450/output.txt

Total script time: 3.03 mins

Unit Tests: Passed

pdfjsbot · 2021-11-13T13:45:16Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/f98b3e69763516d/output.txt

Total script time: 6.82 mins

Unit Tests: Passed

Snuffleupagus · 2021-11-13T14:34:42Z

src/shared/util.js

  if (typeof str !== "string") {
    warn("The argument for removeNullCharacters must be a string.");
    return str;
  }
+  if (replaceInvisible) {
+    return str.replace(InvisibleCharactersRegExp, " ");


My suggestion was to purposely leave the \x00-case intact, even in the new mode, since there could very well be regressions by changing that.
Not to mentioned that having the \x00-char treated differently depending on the value of the new parameter feels surprising.

In the PDF document that originally prompted us to add the removeNullCharacters-function, I believe that this will lead to a bunch of trailing spaces, which could also potentially affect formatting of the outline-titles.

Don't you think we should trim the outline title when rendering it ?

test/unit/api_spec.js

Snuffleupagus · 2021-11-13T14:48:39Z

test/unit/util_spec.js

+      const str = Array.from(Array(32).keys())
+        .map(x => String.fromCharCode(x))
+        .join("a");


Is this string perhaps missing the last "a" character, in order for it to be consistently formatted? Currently this code gives:

\u0000a\u0001a\u0002a\u0003a\u0004a\u0005a\u0006a\u0007a\u0008a a a\u000ba\u000ca a\u000ea\u000fa\u0010a\u0011a\u0012a\u0013a\u0014a\u0015a\u0016a\u0017a\u0018a\u0019a\u001aa\u001ba\u001ca\u001da\u001ea\u001f

whereas I believe that you probably want to use e.g.

const str = [...Array(32).keys()] .map(x => String.fromCharCode(x) + "a") .join("");

since that'd give you:

\u0000a\u0001a\u0002a\u0003a\u0004a\u0005a\u0006a\u0007a\u0008a a a\u000ba\u000ca a\u000ea\u000fa\u0010a\u0011a\u0012a\u0013a\u0014a\u0015a\u0016a\u0017a\u0018a\u0019a\u001aa\u001ba\u001ca\u001da\u001ea\u001fa

Snuffleupagus

r=me, with the last comments addressed; thank you!

test/unit/api_spec.js

Snuffleupagus · 2021-11-13T15:42:16Z

test/unit/util_spec.js

+      const expected =
+        "a" +
+        " "
+          .repeat(31)
+          .split("")
+          .map(x => x + "a")
+          .join("");


Given that it's the expected value, I cannot help wondering if simply spelling it out explicitly wouldn't actually be clearer and easier to read?

const expected = "a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a";

Snuffleupagus · 2021-11-13T15:44:05Z

web/base_tree_viewer.js

@@ -59,7 +59,11 @@ class BaseTreeViewer {
   * @private
   */
  _normalizeTextContent(str) {
-    return removeNullCharacters(str) || /* en dash = */ "\u2013";
+    // Chars in range [0x00-0x1F] will be replaced with a white space.


The char-range probably needs to be updated here.

- it aims to fix mozilla#14267; - there is nothing about chars in range [0-1F] in the specs but acrobat doesn't display them in any way.

calixteman · 2021-11-13T16:02:40Z

/botio unittest

pdfjsbot · 2021-11-13T16:02:41Z

From: Bot.io (Linux m4)

Received

Command cmd_unittest from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/5780b58353ce4b7/output.txt

pdfjsbot · 2021-11-13T16:02:41Z

From: Bot.io (Windows)

Received

Command cmd_unittest from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/a56a4aba850fcee/output.txt

pdfjsbot · 2021-11-13T16:05:46Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/5780b58353ce4b7/output.txt

Total script time: 3.06 mins

Unit Tests: Passed

pdfjsbot · 2021-11-13T16:08:49Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/a56a4aba850fcee/output.txt

Total script time: 6.12 mins

Unit Tests: Passed

calixteman requested a review from Snuffleupagus November 12, 2021 18:48

Snuffleupagus requested changes Nov 12, 2021

View reviewed changes

calixteman force-pushed the outline branch from 6fbd318 to 7467fc7 Compare November 13, 2021 13:36

Snuffleupagus requested changes Nov 13, 2021

View reviewed changes

calixteman force-pushed the outline branch from 7467fc7 to 82b112f Compare November 13, 2021 15:28

Snuffleupagus approved these changes Nov 13, 2021

View reviewed changes

Remove non-displayable chars from outline title (mozilla#14267)

7041c62

- it aims to fix mozilla#14267; - there is nothing about chars in range [0-1F] in the specs but acrobat doesn't display them in any way.

calixteman force-pushed the outline branch from 82b112f to 7041c62 Compare November 13, 2021 15:56

calixteman merged commit 85c6dd5 into mozilla:master Nov 13, 2021

calixteman deleted the outline branch November 13, 2021 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove non-displayable chars from outline title (#14267) #14268

Remove non-displayable chars from outline title (#14267) #14268

calixteman commented Nov 12, 2021

Snuffleupagus left a comment

calixteman commented Nov 12, 2021

Snuffleupagus commented Nov 12, 2021 •

edited

Loading

calixteman commented Nov 13, 2021

Snuffleupagus commented Nov 13, 2021

calixteman commented Nov 13, 2021

pdfjsbot commented Nov 13, 2021

pdfjsbot commented Nov 13, 2021

pdfjsbot commented Nov 13, 2021

pdfjsbot commented Nov 13, 2021

Snuffleupagus Nov 13, 2021

calixteman Nov 13, 2021

Snuffleupagus Nov 13, 2021

Snuffleupagus left a comment

Snuffleupagus Nov 13, 2021

Snuffleupagus Nov 13, 2021

calixteman commented Nov 13, 2021

pdfjsbot commented Nov 13, 2021

pdfjsbot commented Nov 13, 2021

pdfjsbot commented Nov 13, 2021

pdfjsbot commented Nov 13, 2021

	/**
	* @private
	*/
	_normalizeTextContent(str) {
	return removeNullCharacters(str) \|\| /* en dash = */ "\u2013";
	}

	const NullCharactersRegExp = /\x00/g;

	/**
	* @param {string} str
	*/
	function removeNullCharacters(str) {
	if (typeof str !== "string") {
	warn("The argument for removeNullCharacters must be a string.");
	return str;
	}
	return str.replace(NullCharactersRegExp, "");
	}

Remove non-displayable chars from outline title (#14267) #14268

Remove non-displayable chars from outline title (#14267) #14268

Conversation

calixteman commented Nov 12, 2021

Snuffleupagus left a comment

Choose a reason for hiding this comment

calixteman commented Nov 12, 2021

Snuffleupagus commented Nov 12, 2021 • edited Loading

calixteman commented Nov 13, 2021

Snuffleupagus commented Nov 13, 2021

calixteman commented Nov 13, 2021

pdfjsbot commented Nov 13, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Nov 13, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Nov 13, 2021

From: Bot.io (Linux m4)

Success

pdfjsbot commented Nov 13, 2021

From: Bot.io (Windows)

Success

Snuffleupagus Nov 13, 2021

Choose a reason for hiding this comment

calixteman Nov 13, 2021

Choose a reason for hiding this comment

Snuffleupagus Nov 13, 2021

Choose a reason for hiding this comment

Snuffleupagus left a comment

Choose a reason for hiding this comment

Snuffleupagus Nov 13, 2021

Choose a reason for hiding this comment

Snuffleupagus Nov 13, 2021

Choose a reason for hiding this comment

calixteman commented Nov 13, 2021

pdfjsbot commented Nov 13, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Nov 13, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Nov 13, 2021

From: Bot.io (Linux m4)

Success

pdfjsbot commented Nov 13, 2021

From: Bot.io (Windows)

Success

Snuffleupagus commented Nov 12, 2021 •

edited

Loading