-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Use ActualText when getting the text for the text layer #20014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.241.84.105:8877/b97b27b230d6a57/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.193.163.58:8877/1161b6cc8eacaff/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/b97b27b230d6a57/output.txt Total script time: 32.07 mins
Image differences available at: http://54.241.84.105:8877/b97b27b230d6a57/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/1161b6cc8eacaff/output.txt Total script time: 58.70 mins
Image differences available at: http://54.193.163.58:8877/1161b6cc8eacaff/reftest-analyzer.html#web=eq.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the args[1]?.get("ActualText")
be exposed in the getOperatorList result as well?
e.g. something like this
args = [
args[0].name,
args[1] instanceof Dict ? args[1].get("MCID") : null,
args[1] instanceof Dict ? args[1].get("ActualText") : null // <--- extra arg
];
in
Lines 2300 to 2303 in d2a6638
args = [ | |
args[0].name, | |
args[1] instanceof Dict ? args[1].get("MCID") : null, | |
]; |
Not sure whether its a breaking change, but it's crucial for reconstructing content (e.g. svg) from the results of getOperatorList() when not using getTextContent().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you file a bug and explain why it'd be useful to have such a feature ?
Could it help to fix an existing issue in the current viewer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@calixteman Ok, I'll open a another ticket for it. I don't think it's related to the current issue with the viewer.
Actually I opened the original ticket because I got wrong text from getOperatorList(), and the viewer is also affected so I used it to open the ticket as it's easier to reproduce than a code snippet.
I was actually building a pdf -> svg conversion tool with getOperatorList(). I found getTextContent() to be not useful - it only extracts text, and the shape info can only be obtained from getOperatorList(), and there's no easy way to interweave the text+shape back into correct order from the results of both functions, so I ditched getTextContent() and only use getOperatorList() to also obtain text.
/botio-linux preview |
From: Bot.io (Linux m4)ReceivedCommand cmd_preview from @timvandermeij received. Current queue size: 0 Live output at: http://54.241.84.105:8877/ab0c7336a1331e6/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/ab0c7336a1331e6/output.txt Total script time: 1.14 mins Published |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, with the comments addressed and passing tests.
@@ -3446,6 +3457,11 @@ class PartialEvaluator { | |||
return; | |||
case OPS.beginMarkedContent: | |||
flushTextContentItem(); | |||
if (args[0]?.name === "Span") { | |||
textContentItem.span = stringToPDFString( | |||
args[1]?.get("ActualText") || "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR also fix #12237 perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a start at fixing that issue. This is the first step, getting this /ActualText
into the text content. That issue is asking for src/display/text_layer.js:#processItems
to draw spans containing this actual text in the right places. That will mean also accumulating the text drawing that would have been done (to know the bounds of the glyphs that will be drawn) so that the bounds of the span can be calculated.
@@ -3923,6 +3923,20 @@ Caron Broadcasting, Inc., an Ohio corporation (“Lessee”).`) | |||
expect(items[1].fontName).not.toEqual(items[0].fontName); | |||
}); | |||
|
|||
it("get the text a content stream containing some ActualText", async function () { | |||
const loadingTask = getDocument(buildGetDocumentParams("issue20007.pdf")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know why, but the unit test failure suggests that this can't be loaded:
TEST-UNEXPECTED-FAIL | get the text a content stream containing some ActualText | in firefox | ResponseException: Unexpected server response (404) while retrieving PDF "http://127.0.0.1:38175/test/pdfs/issue20007.pdf". in http://127.0.0.1:38175/src/shared/util.js (line 501)
Moreover, is the movement in the reference test expected?
if (args[0]?.name === "Span") { | ||
textContentItem.span = stringToPDFString( | ||
args[1]?.get("ActualText") || "" | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this addition does. This is for a BMC
which is just a tag so there never is an args[1]
? beginMarkedContentProps
(below) is for BDC
which is a tag and dictionary.
No description provided.