-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Remove non-displayable chars from outline title (#14267) #14268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
calixteman
commented
Nov 12, 2021
- it aims to fix Outline title can contain non-displayable chars #14267;
- there is nothing about chars in range [0-1F] in the specs but acrobat doesn't display them in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I don't think that we should implement it this way!
First of all, I'm not sure why it's a problem for the API itself to return such strings as-is (and this is potentially a breaking api-change).
Secondly, we're already ignoring some characters in the viewer instead; see
pdf.js/web/base_tree_viewer.js
Lines 58 to 63 in 7d6d3fc
/** | |
* @private | |
*/ | |
_normalizeTextContent(str) { | |
return removeNullCharacters(str) || /* en dash = */ "\u2013"; | |
} |
which uses
Lines 569 to 580 in 7d6d3fc
const NullCharactersRegExp = /\x00/g; | |
/** | |
* @param {string} str | |
*/ | |
function removeNullCharacters(str) { | |
if (typeof str !== "string") { | |
warn("The argument for removeNullCharacters must be a string."); | |
return str; | |
} | |
return str.replace(NullCharactersRegExp, ""); | |
} |
Hence my question is if it'd be generally safe to simply update removeNullCharacters
to skip the [\x00-\x1F]
range instead, or if that could cause issues at one of its call-sites?
For example what about e.g.
pdf.js/src/display/display_utils.js
Line 350 in 7d6d3fc
const urlNullRemoved = removeNullCharacters(url); |
[\x00-\x1F]
-chars would make sense in a link?
Yep, you're right: I didn't think about the api.
I think it should be possible to have those chars in using some entities. |
How about limiting the scope of these changes to just the functionality needed for the viewer, since that's basically the only thing we have to fix here as far as I can tell, with e.g. something along these lines (although a better parameter name might not hurt): diff --git a/src/shared/util.js b/src/shared/util.js
index ba94fec8b..188e97737 100644
--- a/src/shared/util.js
+++ b/src/shared/util.js
@@ -567,15 +567,20 @@ class AbortException extends BaseException {
}
const NullCharactersRegExp = /\x00/g;
+const InvisibleCharactersRegExp = /[\x00-\x1F]/g;
/**
* @param {string} str
+ * @param {boolean} [removeInvisible]
*/
-function removeNullCharacters(str) {
+function removeNullCharacters(str, removeInvisible = false) {
if (typeof str !== "string") {
warn("The argument for removeNullCharacters must be a string.");
return str;
}
+ if (removeInvisible) {
+ return str.replace(InvisibleCharactersRegExp, "");
+ }
return str.replace(NullCharactersRegExp, "");
}
diff --git a/web/base_tree_viewer.js b/web/base_tree_viewer.js
index 6e203aea8..2c1d7c92f 100644
--- a/web/base_tree_viewer.js
+++ b/web/base_tree_viewer.js
@@ -59,7 +59,10 @@ class BaseTreeViewer {
* @private
*/
_normalizeTextContent(str) {
- return removeNullCharacters(str) || /* en dash = */ "\u2013";
+ return (
+ removeNullCharacters(str, /* removeInvisible = */ true) ||
+ /* en dash = */ "\u2013"
+ );
}
/** That way we'd not change the default behaviour, and these viewer changes would apply not just to the OutlineView but also the AttachmentsView/LayersView as well. |
I had a look on what acrobat and pdfium are doing here.
|
Based on #14268 (comment), I wonder if we shouldn't perhaps use a slightly modified and "combined" approach here:
|
/botio unittest |
From: Bot.io (Linux m4)ReceivedCommand cmd_unittest from @calixteman received. Current queue size: 0 Live output at: http://54.241.84.105:8877/ccf63129ff28450/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_unittest from @calixteman received. Current queue size: 0 Live output at: http://54.193.163.58:8877/f98b3e69763516d/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/ccf63129ff28450/output.txt Total script time: 3.03 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.193.163.58:8877/f98b3e69763516d/output.txt Total script time: 6.82 mins
|
src/shared/util.js
Outdated
if (typeof str !== "string") { | ||
warn("The argument for removeNullCharacters must be a string."); | ||
return str; | ||
} | ||
if (replaceInvisible) { | ||
return str.replace(InvisibleCharactersRegExp, " "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion was to purposely leave the \x00
-case intact, even in the new mode, since there could very well be regressions by changing that.
Not to mentioned that having the \x00
-char treated differently depending on the value of the new parameter feels surprising.
In the PDF document that originally prompted us to add the removeNullCharacters
-function, I believe that this will lead to a bunch of trailing spaces, which could also potentially affect formatting of the outline-titles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you think we should trim the outline title when rendering it ?
test/unit/util_spec.js
Outdated
const str = Array.from(Array(32).keys()) | ||
.map(x => String.fromCharCode(x)) | ||
.join("a"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this string perhaps missing the last "a" character, in order for it to be consistently formatted? Currently this code gives:
\u0000a\u0001a\u0002a\u0003a\u0004a\u0005a\u0006a\u0007a\u0008a a
a\u000ba\u000ca
a\u000ea\u000fa\u0010a\u0011a\u0012a\u0013a\u0014a\u0015a\u0016a\u0017a\u0018a\u0019a\u001aa\u001ba\u001ca\u001da\u001ea\u001f
whereas I believe that you probably want to use e.g.
const str = [...Array(32).keys()]
.map(x => String.fromCharCode(x) + "a")
.join("");
since that'd give you:
\u0000a\u0001a\u0002a\u0003a\u0004a\u0005a\u0006a\u0007a\u0008a a
a\u000ba\u000ca
a\u000ea\u000fa\u0010a\u0011a\u0012a\u0013a\u0014a\u0015a\u0016a\u0017a\u0018a\u0019a\u001aa\u001ba\u001ca\u001da\u001ea\u001fa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r=me, with the last comments addressed; thank you!
test/unit/util_spec.js
Outdated
const expected = | ||
"a" + | ||
" " | ||
.repeat(31) | ||
.split("") | ||
.map(x => x + "a") | ||
.join(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that it's the expected value, I cannot help wondering if simply spelling it out explicitly wouldn't actually be clearer and easier to read?
const expected =
"a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a";
web/base_tree_viewer.js
Outdated
@@ -59,7 +59,11 @@ class BaseTreeViewer { | |||
* @private | |||
*/ | |||
_normalizeTextContent(str) { | |||
return removeNullCharacters(str) || /* en dash = */ "\u2013"; | |||
// Chars in range [0x00-0x1F] will be replaced with a white space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The char-range probably needs to be updated here.
- it aims to fix mozilla#14267; - there is nothing about chars in range [0-1F] in the specs but acrobat doesn't display them in any way.
/botio unittest |
From: Bot.io (Linux m4)ReceivedCommand cmd_unittest from @calixteman received. Current queue size: 0 Live output at: http://54.241.84.105:8877/5780b58353ce4b7/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_unittest from @calixteman received. Current queue size: 0 Live output at: http://54.193.163.58:8877/a56a4aba850fcee/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/5780b58353ce4b7/output.txt Total script time: 3.06 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.193.163.58:8877/a56a4aba850fcee/output.txt Total script time: 6.12 mins
|