Skip to content

Use ActualText when getting the text for the text layer #20014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

calixteman
Copy link
Contributor

No description provided.

@calixteman calixteman linked an issue Jun 16, 2025 that may be closed by this pull request
@calixteman
Copy link
Contributor Author

/botio test

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/b97b27b230d6a57/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/1161b6cc8eacaff/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/b97b27b230d6a57/output.txt

Total script time: 32.07 mins

  • Unit tests: FAILED
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 1

Image differences available at: http://54.241.84.105:8877/b97b27b230d6a57/reftest-analyzer.html#web=eq.log

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/1161b6cc8eacaff/output.txt

Total script time: 58.70 mins

  • Unit tests: FAILED
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 1

Image differences available at: http://54.193.163.58:8877/1161b6cc8eacaff/reftest-analyzer.html#web=eq.log

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the args[1]?.get("ActualText") be exposed in the getOperatorList result as well?

e.g. something like this

args = [
    args[0].name,
    args[1] instanceof Dict ? args[1].get("MCID") : null,
    args[1] instanceof Dict ? args[1].get("ActualText") : null  // <--- extra arg
];

in

pdf.js/src/core/evaluator.js

Lines 2300 to 2303 in d2a6638

args = [
args[0].name,
args[1] instanceof Dict ? args[1].get("MCID") : null,
];

Not sure whether its a breaking change, but it's crucial for reconstructing content (e.g. svg) from the results of getOperatorList() when not using getTextContent().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you file a bug and explain why it'd be useful to have such a feature ?
Could it help to fix an existing issue in the current viewer ?

Copy link

@fnlctrl fnlctrl Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@calixteman Ok, I'll open a another ticket for it. I don't think it's related to the current issue with the viewer.

Actually I opened the original ticket because I got wrong text from getOperatorList(), and the viewer is also affected so I used it to open the ticket as it's easier to reproduce than a code snippet.

I was actually building a pdf -> svg conversion tool with getOperatorList(). I found getTextContent() to be not useful - it only extracts text, and the shape info can only be obtained from getOperatorList(), and there's no easy way to interweave the text+shape back into correct order from the results of both functions, so I ditched getTextContent() and only use getOperatorList() to also obtain text.

@timvandermeij
Copy link
Contributor

/botio-linux preview

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/ab0c7336a1331e6/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/ab0c7336a1331e6/output.txt

Total script time: 1.14 mins

Published

Copy link
Contributor

@timvandermeij timvandermeij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, with the comments addressed and passing tests.

@@ -3446,6 +3457,11 @@ class PartialEvaluator {
return;
case OPS.beginMarkedContent:
flushTextContentItem();
if (args[0]?.name === "Span") {
textContentItem.span = stringToPDFString(
args[1]?.get("ActualText") || ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR also fix #12237 perhaps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a start at fixing that issue. This is the first step, getting this /ActualText into the text content. That issue is asking for src/display/text_layer.js:#processItems to draw spans containing this actual text in the right places. That will mean also accumulating the text drawing that would have been done (to know the bounds of the glyphs that will be drawn) so that the bounds of the span can be calculated.

@@ -3923,6 +3923,20 @@ Caron Broadcasting, Inc., an Ohio corporation (“Lessee”).`)
expect(items[1].fontName).not.toEqual(items[0].fontName);
});

it("get the text a content stream containing some ActualText", async function () {
const loadingTask = getDocument(buildGetDocumentParams("issue20007.pdf"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know why, but the unit test failure suggests that this can't be loaded:

TEST-UNEXPECTED-FAIL | get the text a content stream containing some ActualText | in firefox | ResponseException: Unexpected server response (404) while retrieving PDF "http://127.0.0.1:38175/test/pdfs/issue20007.pdf". in http://127.0.0.1:38175/src/shared/util.js (line 501)

Moreover, is the movement in the reference test expected?

Comment on lines +3463 to +3467
if (args[0]?.name === "Span") {
textContentItem.span = stringToPDFString(
args[1]?.get("ActualText") || ""
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this addition does. This is for a BMC which is just a tag so there never is an args[1]? beginMarkedContentProps (below) is for BDC which is a tag and dictionary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Cannot obtain text info with custom fonts
5 participants