allow @ in identifiers #947

norbusan · 2024-02-15T01:31:46Z

Identifiers can contain the @ symbol, a typical example is

put @resources << /ColorSpace @pgfcolorspaces >>

as found in documents created with LaTeX package pgf and colorspaces.

At the moment a pdf containing the above will trigger an error:

pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'ColorSpace', /b'@', /b'pgfcolorspaces']

since @ is parsed as separate token.

The PR was tested against about 20 different PDf papers from the arXiv corpus.

Checklist

I have read CONTRIBUTING.md.
I have added a concise human-readable description of the change to CHANGELOG.md.
I have tested that this fix is effective or that this feature works.
I have added docstrings to newly created methods and classes.
I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

Identifiers can contain the @ symbol, a typical example is ``` put @resources << /ColorSpace @pgfcolorspaces >> ``` as found in documents created with LaTeX package pgf and colorspaces. At the moment a pdf containing the above will trigger an error: ``` pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'ColorSpace', /b'@', /b'pgfcolorspaces'] ``` since @ is parsed as separate token.

pietermarsman · 2024-06-24T06:25:40Z

Hi,

This seems important.

I cannot reproduce this (tried it with this random example). Can you share the code and examplary PDF's to trigger this?

norbusan · 2024-06-24T06:59:45Z

HI @pietermarsman
thanks for checking in after that long time. A document that fails is here: https://arxiv.org/pdf/2210.16408
I get with current master:

[~/arXiv/pdfminer-fixes/pdfminer.six.git] ./pdf2txt.py ~/Downloads/2210.16408v4.pdf 
Traceback (most recent call last):
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/./pdf2txt.py", line 317, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/./pdf2txt.py", line 311, in main
    outfp = extract_text(**vars(parsed_args))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/./pdf2txt.py", line 62, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/pdfminer/high_level.py", line 132, in extract_text_to_fp
    interpreter.process_page(page)
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/pdfminer/pdfinterp.py", line 997, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/pdfminer/pdfinterp.py", line 1016, in render_contents
    self.execute(list_value(streams))
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/pdfminer/pdfinterp.py", line 1027, in execute
    (_, obj) = parser.nextobject()
               ^^^^^^^^^^^^^^^^^^^
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/pdfminer/psparser.py", line 636, in nextobject
    raise PSSyntaxError(error_msg)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'ColorSpace', /b'@', /b'pgfcolorspaces']

pietermarsman · 2024-06-24T16:11:23Z

Thanks for the (much quicker) response. Will check it out.

pietermarsman · 2024-06-24T16:27:31Z

Can confirm that it does fix the issue for this PDF. I'm wondering if we should add other characters as well. The [pdf reference] section 3.2.4 on Name Objects says that any non-whitespace non-delimiter character can be used. Figuring that out now.

Will try to add a small test to this PR and then merge it.

pietermarsman · 2024-06-24T16:33:45Z

Hhm, when I use mutools to clean up the PDF I get:

$ mutool clean -gggg -l -dzc Downloads/2210.16408 Downloads/2210.16408.pdf
...
error: unknown keyword: 'put'
error: unknown keyword: '@resources'

pietermarsman · 2024-06-24T17:02:57Z

Hhm, I was not familiar with these put commands. It seems to be a PostScript thing that pdfminer.six ignores.

I've ran out of time today trying to understand that. Will continue next monday.

pietermarsman · 2024-06-25T19:40:28Z

I was looking more into << /ColorSpace @pgfcolorspaces >> and realized that this is actually an invalid PDF object. The best course of action is to create an issue first, and discuss the solutions before going further with this MR. That is why I'm closing this MR. More details below.

The invalid PDF object has a << and >>. This indicates that it is a dictionary object (section 3.2.7). Dictionary keys are names (section 3.2.4) and the value can be any object. But the value in question (@pgfcolorspaces) is not a valid object because it starts with an @:

Booleans (3.2.1) are either true or false
Numerics (3.2.2) are numbers with a potential leading +, - or .
Literal strings (3.2.3) start with a (
Hexadecimal strings (3.2.3) start with a single <
Name objects (3.2.4) start with a /
Array objects (3.2.5) start with a [
Dictionary objects (3.2.6) start with a <<
Stream objects (3.2.7)start with a dictionary.
Null objects (3.2.8) are simply null
And indirect objects (3.2.9) start with a numeric object.

So starting an object with a @ is not an option in the PDF spec. So the question is: how should this unexpected object be tokenized?

Currently the tokenizer checks for a couple of special characters (%/-+.(<>) to recognize most objects (e.g. numerics, strings, names, etc.). If the token starts with a alphabetical character, it checks if it is a boolean and otherwise assumes it is a multi-character keywords (ending at the first whitespace or special character from above). All non-special non-alphabetical characters are assumed to be keywords.

This works well for PDF's with correct syntax. For example, the array keywords are parsed correctly ([ and ]). And the same for indirect objects (which use the keyword R). But for PDF's with incorrect syntax it puts all the "unexpected" characters on their on in a KWD. Ideally pdfminer.six distinguishes between expected and known keywords (using KWD) and other unexpected characters (using another class). I think that is the preferred solution here, but it is a lot more work than this MR.

pietermarsman · 2024-06-25T19:48:01Z

I'm sorry to close this MR after we have been discussing it for a while. But deviating from the PDF spec is something I want to do carefully. Closing this and discussing things in an issue gives me a bit more headspace to think about the consequences.

pietermarsman · 2024-06-25T19:55:43Z

Also I have difficulties finding the specification for the put ... syntax. I thought it was a PostScript thing but cannot find it in the specs anywhere. At least, in the specs the put is always at the end of the line, not in the beginning. So that's also a mystery to me.

It seems like it is declaring the variable @resources. And so probably @pgfcolorspaces was declared before. In that case we need to insert the content of this variable, instead of parsing the variable instead. That is also a major change. To do that, we definitely need a specification.

petervwyatt · 2024-06-26T00:32:31Z

This is caused by certain bad LaTeX PDF producers that use some kind of macro expansion/replacement (search all of GitHub for "@pgfcolorspaces" for source documents and similar examples).

Here's the start of the content stream for page 1:

1 0 0 1 -72 769.89 cm
obj
@pgfcolorspaces
<< >> put
@resources
<< >> >>
put
@pgfcolorspaces
<< /pgfprgb [ /Pattern /DeviceRGB ] >> 0 g
0 G
0 g
0 G
....

From a formal PDF syntax PoV, encountering a content stream token commencing with "@" or "put" syntactically implies an unrecognized (new?) operator, whereas "obj" is an existing formally defined PDF keyword that is syntactically out-of-place (invalid in content streams). None of this is encapsulated between the BX/EX compatibility operators so erroring is an acceptable outcome (otherwise you'd suppress errrors).

There is currently a public discussion occurring on this exact topic - see PDF Errata #363 - if you want to make sure you're parsing rules are correct.

norbusan · 2024-06-26T01:19:06Z

Thanks @pietermarsman and @petervwyatt
that is disturbing that the pgf package or the graphics driver does something bad here.
I will get in contact with the respective teams on the TeX side to get this fixed in one way or another.

norbusan · 2024-06-26T01:35:17Z

Ok, I see now what the problem is (wrong driver was used for the graphicx package which created the incorrect document).
This needs to be fixed on arXiv.org side.
Sorry for the noise.

norbusan marked this pull request as draft February 15, 2024 01:33

Add changelog entry

465a8f7

norbusan marked this pull request as ready for review February 15, 2024 01:39

pietermarsman closed this Jun 25, 2024

pietermarsman mentioned this pull request Jun 25, 2024

Gracefully tokenize invalid objects #968

Open

norbusan mentioned this pull request Jun 26, 2024

[Discussion/suggestion] Wrong driver selection results in incorrect PDF - forcibly use correct driver? latex3/graphics-def#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

allow @ in identifiers #947

allow @ in identifiers #947

Uh oh!

norbusan commented Feb 15, 2024 •

edited

Loading

Uh oh!

pietermarsman commented Jun 24, 2024

Uh oh!

norbusan commented Jun 24, 2024

Uh oh!

pietermarsman commented Jun 24, 2024

Uh oh!

pietermarsman commented Jun 24, 2024

Uh oh!

pietermarsman commented Jun 24, 2024

Uh oh!

pietermarsman commented Jun 24, 2024 •

edited

Loading

Uh oh!

pietermarsman commented Jun 25, 2024

Uh oh!

pietermarsman commented Jun 25, 2024

Uh oh!

pietermarsman commented Jun 25, 2024

Uh oh!

petervwyatt commented Jun 26, 2024

Uh oh!

norbusan commented Jun 26, 2024

Uh oh!

norbusan commented Jun 26, 2024

Uh oh!

Uh oh!

allow @ in identifiers #947

allow @ in identifiers #947

Uh oh!

Conversation

norbusan commented Feb 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietermarsman commented Jun 24, 2024

Uh oh!

norbusan commented Jun 24, 2024

Uh oh!

pietermarsman commented Jun 24, 2024

Uh oh!

pietermarsman commented Jun 24, 2024

Uh oh!

pietermarsman commented Jun 24, 2024

Uh oh!

pietermarsman commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietermarsman commented Jun 25, 2024

Uh oh!

pietermarsman commented Jun 25, 2024

Uh oh!

pietermarsman commented Jun 25, 2024

Uh oh!

petervwyatt commented Jun 26, 2024

Uh oh!

norbusan commented Jun 26, 2024

Uh oh!

norbusan commented Jun 26, 2024

Uh oh!

Uh oh!

norbusan commented Feb 15, 2024 •

edited

Loading

pietermarsman commented Jun 24, 2024 •

edited

Loading