Gracefully tokenize invalid objects #968

pietermarsman · 2024-06-25T19:45:09Z

From #947.

This PDF

<< /ColorSpace @pgfcolorspaces >>

raises

pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'ColorSpace', /b'@', /b'pgfcolorspaces']

The cause is that << /ColorSpace @pgfcolorspaces >> is actually an invalid PDF object. With the current implementation @ and pgfcolorspaces are both tokenized as KWD's. Ideally this invalid object is tokenized as a single token with a different class from KWD.

The invalid PDF object has a << and >>. This indicates that it is a dictionary object (section 3.2.7). Dictionary keys are names (section 3.2.4) and the value can be any object. But the value in question (@pgfcolorspaces) is not a valid object because it starts with an @:

Booleans (3.2.1) are either true or false
Numerics (3.2.2) are numbers with a potential leading +, - or .
Literal strings (3.2.3) start with a (
Hexadecimal strings (3.2.3) start with a single <
Name objects (3.2.4) start with a /
Array objects (3.2.5) start with a [
Dictionary objects (3.2.6) start with a <<
Stream objects (3.2.7)start with a dictionary.
Null objects (3.2.8) are simply null
And indirect objects (3.2.9) start with a numeric object.

So starting an object with a @ is not an option in the PDF spec. So the question is: how should this unexpected object be tokenized?

Currently the tokenizer checks for a couple of special characters (%/-+.(<>) to recognize most objects (e.g. numerics, strings, names, etc.). If the token starts with a alphabetical character, it checks if it is a boolean and otherwise assumes it is a multi-character keywords (ending at the first whitespace or special character from above). All non-special non-alphabetical characters are assumed to be keywords.

This works well for PDF's with correct syntax. For example, the array keywords are parsed correctly ([ and ]). And the same for indirect objects (which use the keyword R). But for PDF's with incorrect syntax it puts all the "unexpected" characters on their on in a KWD. Ideally pdfminer.six distinguishes between expected and known keywords (using KWD) and other unexpected characters (using another class). I think that is the preferred solution here.

The text was updated successfully, but these errors were encountered:

norbusan · 2024-12-02T23:21:49Z

To add something here: we just tried pdfminer.six pdf2tex.py on this arXiv document: https://arxiv.org/pdf/2411.18626 and it failed with:

pdfminer.psexceptions.PSSyntaxError: Invalid dictionary construct: [/'AIS', /b'fals', /b'e', /'BM', /'Normal', /'CA', 1, /'OP', False, /'OPM', 1, /'SA', True, /'SMask', /'None', /'Type', /'ExtGState', /'ca', 1, /'op', False]

I guess it is correct, since there is an odd number of entries because the fals and e are split into two (no idea why).

What puzzles me is that:

qpdf --check reports no error
pdfcpu validate reports no error
okular/browsers display the document without any error
pdftotext produces output (although not really useful)

I am not sure what the correct solution is - but I would love to see a two fold approach:

a validate like command that returns a list of errors found
the text extraction part that tries to be resilient in case of errors like the above

Thanks for your work on all this!

pietermarsman changed the title ~~Gracefully tokenize characters that are objects.~~ Gracefully tokenize invalid objects Jun 25, 2024

pietermarsman added type: new feature status: needs solution labels Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gracefully tokenize invalid objects #968

Gracefully tokenize invalid objects #968

pietermarsman commented Jun 25, 2024

norbusan commented Dec 2, 2024

Uh oh!

Gracefully tokenize invalid objects #968

Gracefully tokenize invalid objects #968

Comments

pietermarsman commented Jun 25, 2024

norbusan commented Dec 2, 2024

Uh oh!