-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Improve "EI" detection in inline images (PR 12028 follow-up, issue 16454) #16461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fcae700
to
d65e6b7
Compare
…454) Given that inline images may contain "EI"-sequences in the image-data itself, actually finding the end-of-image operator isn't always straightforward. Here we extend the implementation from PR 12028 to potentially check all of the following bytes, rather than stopping immediately. While we have fairly decent test-coverage for this code, whenever you're changing it there's unfortunately a slightly higher than normal risk of regressions. (You'd really wish that PDF generators just stop using inline images.)
d65e6b7
to
dfbbb8c
Compare
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/239c38d95c3977a/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/dd0a979f561bdda/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/dd0a979f561bdda/output.txt Total script time: 27.99 mins
Image differences available at: http://54.241.84.105:8877/dd0a979f561bdda/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/239c38d95c3977a/output.txt Total script time: 34.63 mins
Image differences available at: http://54.193.163.58:8877/239c38d95c3977a/reftest-analyzer.html#web=eq.log |
I didn't know this feature (BI/EI) before today. |
How would that actually work, since I don't see how we can safely parse the stream without knowing where it's supposed to end? Given that we're, in the general case, dealing with "raw" binary data it's often not possible to actually tell from the data itself when the inline-image is supposed to end (unless it's e.g. a JPEG image). |
In the pdf in the issue, the first inlined image is 5x206 and from the CS entry it's a color-indexed image where the index is lower than 255 so it means that the image should take 1030 bytes. |
In a perfect world that's true, however in many PDF documents that simply doesn't hold and it's unfortunately not possible to make any assumptions about the inline-image length based only on the dict-parameters. There's even examples of this in the PDF document that this PR fixes. Please note that I've tried that many years ago, and it just doesn't work out in practice :-( |
We could just try to guess the length, then check that at ID+length (modulo whitespaces) we've an EI and if not just fallback on the current code. |
But if that guess is off by a lot, that could very easily make us skip over another inline-image that immediately follows the current one. |
That'd be very unfortunate... but who knows... Did you already see such a pdf irl ? and as far as I understand, they try to guess from the dictionary or in reading the stream. |
I can't remember now, but given all the weird things that happen in bad PDF documents I wouldn't consider it impossible.
One additional complication here is that inline-images may reference a ColorSpace in the /Resources-dict of the /Page, and we (obviously) don't have access to that data in the
Not all that common, thankfully.
Yes, I'm obviously biased but I think we should land this as-is :-) Given that there's been a few years since the last time this code was changed, it's not immediately clear how much we gain from trying to re-factor this (especially given the regression risks). |
Yes I agree with you. Maybe you could just add a TODO and we wait for the next issue :). |
I tend to think that the sum of heuristics to handle a new failure finally leads to have unmaintainable code and increase the tech debt. That said I'm fine to have them but just to handle some corner cases nobody thought about in writing the spec. |
That's obviously correct, but in this case we might need to keep a fallback to (something along the lines of) the current stream-parsing anyway; hence it's not immediately clear to me how much we'd be able to simplify things even if we re-factored this!? (And as mentioned above, handling of ColorSpace-entries could be challenging for us.)
Or simply not have invented the concept of inline-images at all, which would have been even better :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you.
/botio makeref |
From: Bot.io (Linux m4)ReceivedCommand cmd_makeref from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/e6e3a5045628054/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_makeref from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/d1bf51194f261f7/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/e6e3a5045628054/output.txt Total script time: 23.61 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.193.163.58:8877/d1bf51194f261f7/output.txt Total script time: 24.40 mins
|
Given that inline images may contain "EI"-sequences in the image-data itself, actually finding the end-of-image operator isn't always straightforward.
Here we extend the implementation from PR #12028 to potentially check all of the following bytes, rather than stopping immediately. While we have fairly decent test-coverage for this code, whenever you're changing it there's unfortunately a slightly higher than normal risk of regressions. (You'd really wish that PDF generators just stop using inline images.)