Skip to content

Improve "EI" detection in inline images (PR 12028 follow-up, issue 16454) #16461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 23, 2023

Conversation

Snuffleupagus
Copy link
Collaborator

Given that inline images may contain "EI"-sequences in the image-data itself, actually finding the end-of-image operator isn't always straightforward.
Here we extend the implementation from PR #12028 to potentially check all of the following bytes, rather than stopping immediately. While we have fairly decent test-coverage for this code, whenever you're changing it there's unfortunately a slightly higher than normal risk of regressions. (You'd really wish that PDF generators just stop using inline images.)

@Snuffleupagus Snuffleupagus changed the title Improve "EI" detection in inline images WIP (PR 12028 follow-up, issue 16454) Improve "EI" detection in inline images (PR 12028 follow-up, issue 16454) May 23, 2023
@mozilla mozilla deleted a comment from moz-tools-bot May 23, 2023
@mozilla mozilla deleted a comment from moz-tools-bot May 23, 2023
@mozilla mozilla deleted a comment from moz-tools-bot May 23, 2023
@mozilla mozilla deleted a comment from moz-tools-bot May 23, 2023
…454)

Given that inline images may contain "EI"-sequences in the image-data itself, actually finding the end-of-image operator isn't always straightforward.
Here we extend the implementation from PR 12028 to potentially check all of the following bytes, rather than stopping immediately. While we have fairly decent test-coverage for this code, whenever you're changing it there's unfortunately a slightly higher than normal risk of regressions. (You'd really wish that PDF generators just stop using inline images.)
@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/239c38d95c3977a/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/dd0a979f561bdda/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/dd0a979f561bdda/output.txt

Total script time: 27.99 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 11

Image differences available at: http://54.241.84.105:8877/dd0a979f561bdda/reftest-analyzer.html#web=eq.log

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/239c38d95c3977a/output.txt

Total script time: 34.63 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 30
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/239c38d95c3977a/reftest-analyzer.html#web=eq.log

@calixteman
Copy link
Contributor

I didn't know this feature (BI/EI) before today.
Can't we try to just decode the stream we have ? and then once it's decoded we know for sure where it ends whatever its contents are (EI, ascii, whatever).

@Snuffleupagus
Copy link
Collaborator Author

Can't we try to just decode the stream we have ? and then once it's decoded we know for sure where it ends whatever its contents are (EI, ascii, whatever).

How would that actually work, since I don't see how we can safely parse the stream without knowing where it's supposed to end? Given that we're, in the general case, dealing with "raw" binary data it's often not possible to actually tell from the data itself when the inline-image is supposed to end (unless it's e.g. a JPEG image).

@calixteman
Copy link
Contributor

In the pdf in the issue, the first inlined image is 5x206 and from the CS entry it's a color-indexed image where the index is lower than 255 so it means that the image should take 1030 bytes.
And in opening the pdf (where I removed the compressed streams with qpdf), I can see that the number of bytes between ID and EI is 1030.
It probably depends on the D entry, but I suppose (I'm not sure of that) that the size is guessable for Flate or LZW, Jpeg, ...
For ASCIIHex or ASCII85, I suppose it shouldn't be a problem.

@Snuffleupagus
Copy link
Collaborator Author

Snuffleupagus commented May 23, 2023

In the pdf in the issue, the first inlined image is 5x206 and from the CS entry it's a color-indexed image where the index is lower than 255 so it means that the image should take 1030 bytes.

In a perfect world that's true, however in many PDF documents that simply doesn't hold and it's unfortunately not possible to make any assumptions about the inline-image length based only on the dict-parameters. There's even examples of this in the PDF document that this PR fixes.

Please note that I've tried that many years ago, and it just doesn't work out in practice :-(

@calixteman
Copy link
Contributor

We could just try to guess the length, then check that at ID+length (modulo whitespaces) we've an EI and if not just fallback on the current code.

@Snuffleupagus
Copy link
Collaborator Author

We could just try to guess the length, then check that at ID+length (modulo whitespaces) we've an EI and if not just fallback on the current code.

But if that guess is off by a lot, that could very easily make us skip over another inline-image that immediately follows the current one.

@calixteman
Copy link
Contributor

We could just try to guess the length, then check that at ID+length (modulo whitespaces) we've an EI and if not just fallback on the current code.

But if that guess is off by a lot, that could very easily make us skip over another inline-image that immediately follows the current one.

That'd be very unfortunate... but who knows... Did you already see such a pdf irl ?
Out of curiosity I checked how it works in PDFium:
https://pdfium.googlesource.com/pdfium/+/refs/heads/chromium/2964/core/fpdfapi/page/cpdf_streamparser.cpp#119

and as far as I understand, they try to guess from the dictionary or in reading the stream.
Are inlined images frequent in the wild ?
Anyway I'm fine to accept the patch (I was just emitting some thoughts), but I think we should really improve that stuff if it's worth doing it.

@Snuffleupagus
Copy link
Collaborator Author

That'd be very unfortunate... but who knows... Did you already see such a pdf irl ?

I can't remember now, but given all the weird things that happen in bad PDF documents I wouldn't consider it impossible.

and as far as I understand, they try to guess from the dictionary or in reading the stream.

One additional complication here is that inline-images may reference a ColorSpace in the /Resources-dict of the /Page, and we (obviously) don't have access to that data in the Parser-code.

Are inlined images frequent in the wild ?

Not all that common, thankfully.

Anyway I'm fine to accept the patch (I was just emitting some thoughts), but I think we should really improve that stuff if it's worth doing it.

Yes, I'm obviously biased but I think we should land this as-is :-)

Given that there's been a few years since the last time this code was changed, it's not immediately clear how much we gain from trying to re-factor this (especially given the regression risks).

@calixteman
Copy link
Contributor

Yes, I'm obviously biased but I think we should land this as-is :-)

Given that there's been a few years since the last time this code was changed, it's not immediately clear how much we gain from trying to re-factor this (especially given the regression risks).

Yes I agree with you. Maybe you could just add a TODO and we wait for the next issue :).

@calixteman
Copy link
Contributor

Given that there's been a few years since the last time this code was changed, it's not immediately clear how much we gain from trying to re-factor this (especially given the regression risks).

I tend to think that the sum of heuristics to handle a new failure finally leads to have unmaintainable code and increase the tech debt. That said I'm fine to have them but just to handle some corner cases nobody thought about in writing the spec.
They could just have forced to have a mandatory entry Length...

@Snuffleupagus
Copy link
Collaborator Author

Snuffleupagus commented May 23, 2023

I tend to think that the sum of heuristics to handle a new failure finally leads to have unmaintainable code and increase the tech debt.

That's obviously correct, but in this case we might need to keep a fallback to (something along the lines of) the current stream-parsing anyway; hence it's not immediately clear to me how much we'd be able to simplify things even if we re-factored this!?

(And as mentioned above, handling of ColorSpace-entries could be challenging for us.)

They could just have forced to have a mandatory entry Length...

Or simply not have invented the concept of inline-images at all, which would have been even better :-)

Copy link
Contributor

@calixteman calixteman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you.

@Snuffleupagus Snuffleupagus merged commit a6f9505 into mozilla:master May 23, 2023
@Snuffleupagus
Copy link
Collaborator Author

/botio makeref

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/e6e3a5045628054/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/d1bf51194f261f7/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/e6e3a5045628054/output.txt

Total script time: 23.61 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/d1bf51194f261f7/output.txt

Total script time: 24.40 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants