You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The cause is that << /ColorSpace @pgfcolorspaces >> is actually an invalid PDF object. With the current implementation @ and pgfcolorspaces are both tokenized as KWD's. Ideally this invalid object is tokenized as a single token with a different class from KWD.
The invalid PDF object has a << and >>. This indicates that it is a dictionary object (section 3.2.7). Dictionary keys are names (section 3.2.4) and the value can be any object. But the value in question (@pgfcolorspaces) is not a valid object because it starts with an @:
Booleans (3.2.1) are either true or false
Numerics (3.2.2) are numbers with a potential leading +, - or .
Literal strings (3.2.3) start with a (
Hexadecimal strings (3.2.3) start with a single <
Name objects (3.2.4) start with a /
Array objects (3.2.5) start with a [
Dictionary objects (3.2.6) start with a <<
Stream objects (3.2.7)start with a dictionary.
Null objects (3.2.8) are simply null
And indirect objects (3.2.9) start with a numeric object.
So starting an object with a @ is not an option in the PDF spec. So the question is: how should this unexpected object be tokenized?
Currently the tokenizer checks for a couple of special characters (%/-+.(<>) to recognize most objects (e.g. numerics, strings, names, etc.). If the token starts with a alphabetical character, it checks if it is a boolean and otherwise assumes it is a multi-character keywords (ending at the first whitespace or special character from above). All non-special non-alphabetical characters are assumed to be keywords.
This works well for PDF's with correct syntax. For example, the array keywords are parsed correctly ([ and ]). And the same for indirect objects (which use the keyword R). But for PDF's with incorrect syntax it puts all the "unexpected" characters on their on in a KWD. Ideally pdfminer.six distinguishes between expected and known keywords (using KWD) and other unexpected characters (using another class). I think that is the preferred solution here.
The text was updated successfully, but these errors were encountered:
pietermarsman
changed the title
Gracefully tokenize characters that are objects.
Gracefully tokenize invalid objects
Jun 25, 2024
From #947.
This PDF
raises
The cause is that
<< /ColorSpace @pgfcolorspaces >>
is actually an invalid PDF object. With the current implementation@
andpgfcolorspaces
are both tokenized asKWD
's. Ideally this invalid object is tokenized as a single token with a different class fromKWD
.The invalid PDF object has a
<<
and>>
. This indicates that it is a dictionary object (section 3.2.7). Dictionary keys are names (section 3.2.4) and the value can be any object. But the value in question (@pgfcolorspaces
) is not a valid object because it starts with an@
:true
orfalse
+
,-
or.
(
<
/
[
<<
null
So starting an object with a
@
is not an option in the PDF spec. So the question is: how should this unexpected object be tokenized?Currently the tokenizer checks for a couple of special characters (
%/-+.(<>
) to recognize most objects (e.g. numerics, strings, names, etc.). If the token starts with a alphabetical character, it checks if it is a boolean and otherwise assumes it is a multi-character keywords (ending at the first whitespace or special character from above). All non-special non-alphabetical characters are assumed to be keywords.This works well for PDF's with correct syntax. For example, the array keywords are parsed correctly (
[
and]
). And the same for indirect objects (which use the keywordR
). But for PDF's with incorrect syntax it puts all the "unexpected" characters on their on in aKWD
. Ideally pdfminer.six distinguishes between expected and known keywords (usingKWD
) and other unexpected characters (using another class). I think that is the preferred solution here.The text was updated successfully, but these errors were encountered: