-
Notifications
You must be signed in to change notification settings - Fork 979
allow @ in identifiers #947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Identifiers can contain the @ symbol, a typical example is ``` put @resources << /ColorSpace @pgfcolorspaces >> ``` as found in documents created with LaTeX package pgf and colorspaces. At the moment a pdf containing the above will trigger an error: ``` pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'ColorSpace', /b'@', /b'pgfcolorspaces'] ``` since @ is parsed as separate token.
Hi, This seems important. I cannot reproduce this (tried it with this random example). Can you share the code and examplary PDF's to trigger this? |
HI @pietermarsman
|
Thanks for the (much quicker) response. Will check it out. |
Can confirm that it does fix the issue for this PDF. I'm wondering if we should add other characters as well. The [pdf reference] section 3.2.4 on Name Objects says that any non-whitespace non-delimiter character can be used. Figuring that out now. Will try to add a small test to this PR and then merge it. |
Hhm, when I use mutools to clean up the PDF I get:
|
Hhm, I was not familiar with these I've ran out of time today trying to understand that. Will continue next monday. |
I was looking more into The invalid PDF object has a
So starting an object with a Currently the tokenizer checks for a couple of special characters ( This works well for PDF's with correct syntax. For example, the array keywords are parsed correctly ( |
I'm sorry to close this MR after we have been discussing it for a while. But deviating from the PDF spec is something I want to do carefully. Closing this and discussing things in an issue gives me a bit more headspace to think about the consequences. |
Also I have difficulties finding the specification for the It seems like it is declaring the variable |
This is caused by certain bad LaTeX PDF producers that use some kind of macro expansion/replacement (search all of GitHub for "@pgfcolorspaces" for source documents and similar examples). Here's the start of the content stream for page 1:
From a formal PDF syntax PoV, encountering a content stream token commencing with "@" or "put" syntactically implies an unrecognized (new?) operator, whereas "obj" is an existing formally defined PDF keyword that is syntactically out-of-place (invalid in content streams). None of this is encapsulated between the BX/EX compatibility operators so erroring is an acceptable outcome (otherwise you'd suppress errrors). There is currently a public discussion occurring on this exact topic - see PDF Errata #363 - if you want to make sure you're parsing rules are correct. |
Thanks @pietermarsman and @petervwyatt |
Ok, I see now what the problem is (wrong driver was used for the graphicx package which created the incorrect document). |
Identifiers can contain the @ symbol, a typical example is
as found in documents created with LaTeX package pgf and colorspaces.
At the moment a pdf containing the above will trigger an error:
since @ is parsed as separate token.
The PR was tested against about 20 different PDf papers from the arXiv corpus.
Checklist