Improve "EI" detection in inline images (PR 12028 follow-up, issue 16454) #16461

Snuffleupagus · 2023-05-23T13:49:45Z

Given that inline images may contain "EI"-sequences in the image-data itself, actually finding the end-of-image operator isn't always straightforward.
Here we extend the implementation from PR #12028 to potentially check all of the following bytes, rather than stopping immediately. While we have fairly decent test-coverage for this code, whenever you're changing it there's unfortunately a slightly higher than normal risk of regressions. (You'd really wish that PDF generators just stop using inline images.)

…454) Given that inline images may contain "EI"-sequences in the image-data itself, actually finding the end-of-image operator isn't always straightforward. Here we extend the implementation from PR 12028 to potentially check all of the following bytes, rather than stopping immediately. While we have fairly decent test-coverage for this code, whenever you're changing it there's unfortunately a slightly higher than normal risk of regressions. (You'd really wish that PDF generators just stop using inline images.)

moz-tools-bot · 2023-05-23T15:08:23Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/239c38d95c3977a/output.txt

moz-tools-bot · 2023-05-23T15:08:24Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/dd0a979f561bdda/output.txt

moz-tools-bot · 2023-05-23T15:36:24Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/dd0a979f561bdda/output.txt

Total script time: 27.99 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 11

Image differences available at: http://54.241.84.105:8877/dd0a979f561bdda/reftest-analyzer.html#web=eq.log

moz-tools-bot · 2023-05-23T15:43:02Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/239c38d95c3977a/output.txt

Total script time: 34.63 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 30
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/239c38d95c3977a/reftest-analyzer.html#web=eq.log

calixteman · 2023-05-23T15:54:18Z

I didn't know this feature (BI/EI) before today.
Can't we try to just decode the stream we have ? and then once it's decoded we know for sure where it ends whatever its contents are (EI, ascii, whatever).

Snuffleupagus · 2023-05-23T16:08:45Z

Can't we try to just decode the stream we have ? and then once it's decoded we know for sure where it ends whatever its contents are (EI, ascii, whatever).

How would that actually work, since I don't see how we can safely parse the stream without knowing where it's supposed to end? Given that we're, in the general case, dealing with "raw" binary data it's often not possible to actually tell from the data itself when the inline-image is supposed to end (unless it's e.g. a JPEG image).

calixteman · 2023-05-23T16:42:17Z

In the pdf in the issue, the first inlined image is 5x206 and from the CS entry it's a color-indexed image where the index is lower than 255 so it means that the image should take 1030 bytes.
And in opening the pdf (where I removed the compressed streams with qpdf), I can see that the number of bytes between ID and EI is 1030.
It probably depends on the D entry, but I suppose (I'm not sure of that) that the size is guessable for Flate or LZW, Jpeg, ...
For ASCIIHex or ASCII85, I suppose it shouldn't be a problem.

Snuffleupagus · 2023-05-23T16:55:56Z

In the pdf in the issue, the first inlined image is 5x206 and from the CS entry it's a color-indexed image where the index is lower than 255 so it means that the image should take 1030 bytes.

In a perfect world that's true, however in many PDF documents that simply doesn't hold and it's unfortunately not possible to make any assumptions about the inline-image length based only on the dict-parameters. There's even examples of this in the PDF document that this PR fixes.

Please note that I've tried that many years ago, and it just doesn't work out in practice :-(

calixteman · 2023-05-23T17:07:32Z

We could just try to guess the length, then check that at ID+length (modulo whitespaces) we've an EI and if not just fallback on the current code.

Snuffleupagus · 2023-05-23T17:25:03Z

We could just try to guess the length, then check that at ID+length (modulo whitespaces) we've an EI and if not just fallback on the current code.

But if that guess is off by a lot, that could very easily make us skip over another inline-image that immediately follows the current one.

calixteman · 2023-05-23T17:45:52Z

We could just try to guess the length, then check that at ID+length (modulo whitespaces) we've an EI and if not just fallback on the current code.

But if that guess is off by a lot, that could very easily make us skip over another inline-image that immediately follows the current one.

That'd be very unfortunate... but who knows... Did you already see such a pdf irl ?
Out of curiosity I checked how it works in PDFium:
https://pdfium.googlesource.com/pdfium/+/refs/heads/chromium/2964/core/fpdfapi/page/cpdf_streamparser.cpp#119

and as far as I understand, they try to guess from the dictionary or in reading the stream.
Are inlined images frequent in the wild ?
Anyway I'm fine to accept the patch (I was just emitting some thoughts), but I think we should really improve that stuff if it's worth doing it.

Snuffleupagus · 2023-05-23T18:01:16Z

That'd be very unfortunate... but who knows... Did you already see such a pdf irl ?

I can't remember now, but given all the weird things that happen in bad PDF documents I wouldn't consider it impossible.

and as far as I understand, they try to guess from the dictionary or in reading the stream.

One additional complication here is that inline-images may reference a ColorSpace in the /Resources-dict of the /Page, and we (obviously) don't have access to that data in the Parser-code.

Are inlined images frequent in the wild ?

Not all that common, thankfully.

Anyway I'm fine to accept the patch (I was just emitting some thoughts), but I think we should really improve that stuff if it's worth doing it.

Yes, I'm obviously biased but I think we should land this as-is :-)

Given that there's been a few years since the last time this code was changed, it's not immediately clear how much we gain from trying to re-factor this (especially given the regression risks).

calixteman · 2023-05-23T18:04:52Z

Yes, I'm obviously biased but I think we should land this as-is :-)

Given that there's been a few years since the last time this code was changed, it's not immediately clear how much we gain from trying to re-factor this (especially given the regression risks).

Yes I agree with you. Maybe you could just add a TODO and we wait for the next issue :).

calixteman · 2023-05-23T18:12:01Z

Given that there's been a few years since the last time this code was changed, it's not immediately clear how much we gain from trying to re-factor this (especially given the regression risks).

I tend to think that the sum of heuristics to handle a new failure finally leads to have unmaintainable code and increase the tech debt. That said I'm fine to have them but just to handle some corner cases nobody thought about in writing the spec.
They could just have forced to have a mandatory entry Length...

Snuffleupagus · 2023-05-23T18:28:39Z

I tend to think that the sum of heuristics to handle a new failure finally leads to have unmaintainable code and increase the tech debt.

That's obviously correct, but in this case we might need to keep a fallback to (something along the lines of) the current stream-parsing anyway; hence it's not immediately clear to me how much we'd be able to simplify things even if we re-factored this!?

(And as mentioned above, handling of ColorSpace-entries could be challenging for us.)

They could just have forced to have a mandatory entry Length...

Or simply not have invented the concept of inline-images at all, which would have been even better :-)

calixteman

LGTM. Thank you.

Snuffleupagus · 2023-05-23T20:23:47Z

/botio makeref

moz-tools-bot · 2023-05-23T20:23:49Z

From: Bot.io (Linux m4)

Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/e6e3a5045628054/output.txt

moz-tools-bot · 2023-05-23T20:23:49Z

From: Bot.io (Windows)

Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/d1bf51194f261f7/output.txt

moz-tools-bot · 2023-05-23T20:47:26Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/e6e3a5045628054/output.txt

Total script time: 23.61 mins

Lint: Passed
Make references: Passed
Check references: Passed

moz-tools-bot · 2023-05-23T20:48:14Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/d1bf51194f261f7/output.txt

Total script time: 24.40 mins

Lint: Passed
Make references: Passed
Check references: Passed

Snuffleupagus added the core label May 23, 2023

Snuffleupagus linked an issue May 23, 2023 that may be closed by this pull request

Pdf fails to render.Failed to execute 'setLineDash' on 'CanvasRenderingContext2D': The object must have a callable @@iterator property. #16454

Closed

Snuffleupagus changed the title ~~Improve "EI" detection in inline images WIP (PR 12028 follow-up, issue 16454)~~ Improve "EI" detection in inline images (PR 12028 follow-up, issue 16454) May 23, 2023

Snuffleupagus force-pushed the issue-16454 branch from fcae700 to d65e6b7 Compare May 23, 2023 14:46

mozilla deleted a comment from moz-tools-bot May 23, 2023

Snuffleupagus force-pushed the issue-16454 branch from d65e6b7 to dfbbb8c Compare May 23, 2023 15:05

calixteman approved these changes May 23, 2023

View reviewed changes

Snuffleupagus merged commit a6f9505 into mozilla:master May 23, 2023

Snuffleupagus deleted the issue-16454 branch May 23, 2023 20:48

Yuras mentioned this pull request Jun 17, 2023

Process (by ignoring) inline images Yuras/pdf-toolbox#82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve "EI" detection in inline images (PR 12028 follow-up, issue 16454) #16461

Improve "EI" detection in inline images (PR 12028 follow-up, issue 16454) #16461

Snuffleupagus commented May 23, 2023

moz-tools-bot commented May 23, 2023

moz-tools-bot commented May 23, 2023

moz-tools-bot commented May 23, 2023

moz-tools-bot commented May 23, 2023

calixteman commented May 23, 2023

Snuffleupagus commented May 23, 2023

calixteman commented May 23, 2023

Snuffleupagus commented May 23, 2023 •

edited

Loading

calixteman commented May 23, 2023

Snuffleupagus commented May 23, 2023

calixteman commented May 23, 2023

Snuffleupagus commented May 23, 2023

calixteman commented May 23, 2023

calixteman commented May 23, 2023

Snuffleupagus commented May 23, 2023 •

edited

Loading

calixteman left a comment

Snuffleupagus commented May 23, 2023

moz-tools-bot commented May 23, 2023

moz-tools-bot commented May 23, 2023

moz-tools-bot commented May 23, 2023

moz-tools-bot commented May 23, 2023

Improve "EI" detection in inline images (PR 12028 follow-up, issue 16454) #16461

Improve "EI" detection in inline images (PR 12028 follow-up, issue 16454) #16461

Conversation

Snuffleupagus commented May 23, 2023

moz-tools-bot commented May 23, 2023

From: Bot.io (Windows)

Received

moz-tools-bot commented May 23, 2023

From: Bot.io (Linux m4)

Received

moz-tools-bot commented May 23, 2023

From: Bot.io (Linux m4)

Failed

moz-tools-bot commented May 23, 2023

From: Bot.io (Windows)

Failed

calixteman commented May 23, 2023

Snuffleupagus commented May 23, 2023

calixteman commented May 23, 2023

Snuffleupagus commented May 23, 2023 • edited Loading

calixteman commented May 23, 2023

Snuffleupagus commented May 23, 2023

calixteman commented May 23, 2023

Snuffleupagus commented May 23, 2023

calixteman commented May 23, 2023

calixteman commented May 23, 2023

Snuffleupagus commented May 23, 2023 • edited Loading

calixteman left a comment

Choose a reason for hiding this comment

Snuffleupagus commented May 23, 2023

moz-tools-bot commented May 23, 2023

From: Bot.io (Linux m4)

Received

moz-tools-bot commented May 23, 2023

From: Bot.io (Windows)

Received

moz-tools-bot commented May 23, 2023

From: Bot.io (Linux m4)

Success

moz-tools-bot commented May 23, 2023

From: Bot.io (Windows)

Success

Snuffleupagus commented May 23, 2023 •

edited

Loading

Snuffleupagus commented May 23, 2023 •

edited

Loading