Skip to content

allow @ in identifiers #947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

norbusan
Copy link

@norbusan norbusan commented Feb 15, 2024

Identifiers can contain the @ symbol, a typical example is

put @resources << /ColorSpace @pgfcolorspaces >>

as found in documents created with LaTeX package pgf and colorspaces.

At the moment a pdf containing the above will trigger an error:

pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'ColorSpace', /b'@', /b'pgfcolorspaces']

since @ is parsed as separate token.

The PR was tested against about 20 different PDf papers from the arXiv corpus.

Checklist

  • I have read CONTRIBUTING.md.
  • I have added a concise human-readable description of the change to CHANGELOG.md.
  • I have tested that this fix is effective or that this feature works.
  • I have added docstrings to newly created methods and classes.
  • I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

Identifiers can contain the @ symbol, a typical example is
```
put @resources << /ColorSpace @pgfcolorspaces >>
```
as found in documents created with LaTeX package pgf and colorspaces.

At the moment a pdf containing the above will trigger an error:
```
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'ColorSpace', /b'@', /b'pgfcolorspaces']
```
since @ is parsed as separate token.
@norbusan norbusan marked this pull request as draft February 15, 2024 01:33
@norbusan norbusan marked this pull request as ready for review February 15, 2024 01:39
@pietermarsman
Copy link
Member

Hi,

This seems important.

I cannot reproduce this (tried it with this random example). Can you share the code and examplary PDF's to trigger this?

@norbusan
Copy link
Author

HI @pietermarsman
thanks for checking in after that long time. A document that fails is here: https://arxiv.org/pdf/2210.16408
I get with current master:

[~/arXiv/pdfminer-fixes/pdfminer.six.git] ./pdf2txt.py ~/Downloads/2210.16408v4.pdf 
Traceback (most recent call last):
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/./pdf2txt.py", line 317, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/./pdf2txt.py", line 311, in main
    outfp = extract_text(**vars(parsed_args))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/./pdf2txt.py", line 62, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/pdfminer/high_level.py", line 132, in extract_text_to_fp
    interpreter.process_page(page)
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/pdfminer/pdfinterp.py", line 997, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/pdfminer/pdfinterp.py", line 1016, in render_contents
    self.execute(list_value(streams))
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/pdfminer/pdfinterp.py", line 1027, in execute
    (_, obj) = parser.nextobject()
               ^^^^^^^^^^^^^^^^^^^
  File "/home/norbert/arXiv/pdfminer-fixes/pdfminer.six.git/pdfminer/psparser.py", line 636, in nextobject
    raise PSSyntaxError(error_msg)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'ColorSpace', /b'@', /b'pgfcolorspaces']

@pietermarsman
Copy link
Member

Thanks for the (much quicker) response. Will check it out.

@pietermarsman
Copy link
Member

Can confirm that it does fix the issue for this PDF. I'm wondering if we should add other characters as well. The [pdf reference] section 3.2.4 on Name Objects says that any non-whitespace non-delimiter character can be used. Figuring that out now.

Will try to add a small test to this PR and then merge it.

@pietermarsman
Copy link
Member

Hhm, when I use mutools to clean up the PDF I get:

$ mutool clean -gggg -l -dzc Downloads/2210.16408 Downloads/2210.16408.pdf
...
error: unknown keyword: 'put'
error: unknown keyword: '@resources'

@pietermarsman
Copy link
Member

pietermarsman commented Jun 24, 2024

Hhm, I was not familiar with these put commands. It seems to be a PostScript thing that pdfminer.six ignores.

I've ran out of time today trying to understand that. Will continue next monday.

@pietermarsman
Copy link
Member

I was looking more into << /ColorSpace @pgfcolorspaces >> and realized that this is actually an invalid PDF object. The best course of action is to create an issue first, and discuss the solutions before going further with this MR. That is why I'm closing this MR. More details below.

The invalid PDF object has a << and >>. This indicates that it is a dictionary object (section 3.2.7). Dictionary keys are names (section 3.2.4) and the value can be any object. But the value in question (@pgfcolorspaces) is not a valid object because it starts with an @:

  • Booleans (3.2.1) are either true or false
  • Numerics (3.2.2) are numbers with a potential leading +, - or .
  • Literal strings (3.2.3) start with a (
  • Hexadecimal strings (3.2.3) start with a single <
  • Name objects (3.2.4) start with a /
  • Array objects (3.2.5) start with a [
  • Dictionary objects (3.2.6) start with a <<
  • Stream objects (3.2.7)start with a dictionary.
  • Null objects (3.2.8) are simply null
  • And indirect objects (3.2.9) start with a numeric object.

So starting an object with a @ is not an option in the PDF spec. So the question is: how should this unexpected object be tokenized?

Currently the tokenizer checks for a couple of special characters (%/-+.(<>) to recognize most objects (e.g. numerics, strings, names, etc.). If the token starts with a alphabetical character, it checks if it is a boolean and otherwise assumes it is a multi-character keywords (ending at the first whitespace or special character from above). All non-special non-alphabetical characters are assumed to be keywords.

This works well for PDF's with correct syntax. For example, the array keywords are parsed correctly ([ and ]). And the same for indirect objects (which use the keyword R). But for PDF's with incorrect syntax it puts all the "unexpected" characters on their on in a KWD. Ideally pdfminer.six distinguishes between expected and known keywords (using KWD) and other unexpected characters (using another class). I think that is the preferred solution here, but it is a lot more work than this MR.

@pietermarsman
Copy link
Member

I'm sorry to close this MR after we have been discussing it for a while. But deviating from the PDF spec is something I want to do carefully. Closing this and discussing things in an issue gives me a bit more headspace to think about the consequences.

@pietermarsman
Copy link
Member

Also I have difficulties finding the specification for the put ... syntax. I thought it was a PostScript thing but cannot find it in the specs anywhere. At least, in the specs the put is always at the end of the line, not in the beginning. So that's also a mystery to me.

It seems like it is declaring the variable @resources. And so probably @pgfcolorspaces was declared before. In that case we need to insert the content of this variable, instead of parsing the variable instead. That is also a major change. To do that, we definitely need a specification.

@petervwyatt
Copy link

This is caused by certain bad LaTeX PDF producers that use some kind of macro expansion/replacement (search all of GitHub for "@pgfcolorspaces" for source documents and similar examples).

Here's the start of the content stream for page 1:

1 0 0 1 -72 769.89 cm
obj
@pgfcolorspaces
<< >> put
@resources
<< >> >>
put
@pgfcolorspaces
<< /pgfprgb [ /Pattern /DeviceRGB ] >> 0 g
0 G
0 g
0 G
....

From a formal PDF syntax PoV, encountering a content stream token commencing with "@" or "put" syntactically implies an unrecognized (new?) operator, whereas "obj" is an existing formally defined PDF keyword that is syntactically out-of-place (invalid in content streams). None of this is encapsulated between the BX/EX compatibility operators so erroring is an acceptable outcome (otherwise you'd suppress errrors).

There is currently a public discussion occurring on this exact topic - see PDF Errata #363 - if you want to make sure you're parsing rules are correct.

@norbusan
Copy link
Author

Thanks @pietermarsman and @petervwyatt
that is disturbing that the pgf package or the graphics driver does something bad here.
I will get in contact with the respective teams on the TeX side to get this fixed in one way or another.

@norbusan
Copy link
Author

Ok, I see now what the problem is (wrong driver was used for the graphicx package which created the incorrect document).
This needs to be fixed on arXiv.org side.
Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants