Skip to content

Gracefully tokenize invalid objects #968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pietermarsman opened this issue Jun 25, 2024 · 1 comment
Open

Gracefully tokenize invalid objects #968

pietermarsman opened this issue Jun 25, 2024 · 1 comment

Comments

@pietermarsman
Copy link
Member

From #947.

This PDF

<< /ColorSpace @pgfcolorspaces >>

raises

pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'ColorSpace', /b'@', /b'pgfcolorspaces']

The cause is that << /ColorSpace @pgfcolorspaces >> is actually an invalid PDF object. With the current implementation @ and pgfcolorspaces are both tokenized as KWD's. Ideally this invalid object is tokenized as a single token with a different class from KWD.

The invalid PDF object has a << and >>. This indicates that it is a dictionary object (section 3.2.7). Dictionary keys are names (section 3.2.4) and the value can be any object. But the value in question (@pgfcolorspaces) is not a valid object because it starts with an @:

  • Booleans (3.2.1) are either true or false
  • Numerics (3.2.2) are numbers with a potential leading +, - or .
  • Literal strings (3.2.3) start with a (
  • Hexadecimal strings (3.2.3) start with a single <
  • Name objects (3.2.4) start with a /
  • Array objects (3.2.5) start with a [
  • Dictionary objects (3.2.6) start with a <<
  • Stream objects (3.2.7)start with a dictionary.
  • Null objects (3.2.8) are simply null
  • And indirect objects (3.2.9) start with a numeric object.

So starting an object with a @ is not an option in the PDF spec. So the question is: how should this unexpected object be tokenized?

Currently the tokenizer checks for a couple of special characters (%/-+.(<>) to recognize most objects (e.g. numerics, strings, names, etc.). If the token starts with a alphabetical character, it checks if it is a boolean and otherwise assumes it is a multi-character keywords (ending at the first whitespace or special character from above). All non-special non-alphabetical characters are assumed to be keywords.

This works well for PDF's with correct syntax. For example, the array keywords are parsed correctly ([ and ]). And the same for indirect objects (which use the keyword R). But for PDF's with incorrect syntax it puts all the "unexpected" characters on their on in a KWD. Ideally pdfminer.six distinguishes between expected and known keywords (using KWD) and other unexpected characters (using another class). I think that is the preferred solution here.

@pietermarsman pietermarsman changed the title Gracefully tokenize characters that are objects. Gracefully tokenize invalid objects Jun 25, 2024
@norbusan
Copy link

norbusan commented Dec 2, 2024

To add something here: we just tried pdfminer.six pdf2tex.py on this arXiv document: https://arxiv.org/pdf/2411.18626 and it failed with:

pdfminer.psexceptions.PSSyntaxError: Invalid dictionary construct: [/'AIS', /b'fals', /b'e', /'BM', /'Normal', /'CA', 1, /'OP', False, /'OPM', 1, /'SA', True, /'SMask', /'None', /'Type', /'ExtGState', /'ca', 1, /'op', False]

I guess it is correct, since there is an odd number of entries because the fals and e are split into two (no idea why).

What puzzles me is that:

  • qpdf --check reports no error
  • pdfcpu validate reports no error
  • okular/browsers display the document without any error
  • pdftotext produces output (although not really useful)

I am not sure what the correct solution is - but I would love to see a two fold approach:

  • a validate like command that returns a list of errors found
  • the text extraction part that tries to be resilient in case of errors like the above

Thanks for your work on all this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants