-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plain citations parser giving unrelated citations back. How come? Attached a test case #12893
Comments
Hi! Thanks for checking out JabRef and its text parser! I have worked on this feature for some time. TL;DR: Plain citation parsing in JabRef works not very well (rule-based parser is underdeveloped (it's hard to develop rule-based algorithms), Grobid gives irrelevant entries, LLM is hard to set up). We call this thing - plain citations parser (sometimes plain references parser). In JabRef 5.16 there are 2 methods how you can parse citations: rule-based and Grobid. As we've experimented a lot of times (see 1, 2, 3), both parsers give not-so-good results. Grobid quite often gives irrelevant entries (actually, for some time JabRef's Grobid instance was down 4) - this is what you might experience. In the new version of JabRef (6.0-alpha), we have added warnings about confabulations of plain citation parsing 5 and expanded the documentation: 6. One could also use LLM to parse citations 7, people (including me) say it works good-enough 8, 9. Footnotes |
Hello! I’m interested in working on this issue. Let me know if I can take it up. |
@Bha2912, this issue is not marked as So, for now, you can look at other issues |
Hi @InAnYan I changed the title accordingly based on your comments. I know how difficult it is as I am doing my best with vim and regexes myself for years. But what grobid returns at the moment is in my opinion not worth it. I used the JabRef development version for the next test based on the above original text input. The grobid had just 10% correct and 90% was complete garbage which makes me not to trust the grobid parser system one bit for now. For the rules based result I see that at each line the previous author is pasted back in the author fields anystyle.io helped in my case, would this way of dealing with plain citations not be a better idea? |
JabRef version
Other (please describe below)
Operating system
GNU / Linux
Details on version and operating system
JabRef 5.16--2024-07-25--771c4cd Linux 6.12.20-2-manjaro amd64 Java 21.0.2 JavaFX 22.0.2+4
Checked with the latest development build (copy version output from About dialog)
Steps to reproduce the behaviour
JabRef 5.16
There is a problem with the text parser which is changing the citations completely into a not related citations.
I created a test case see two attached files.
testcase.txt
testcase.bib.txt
How come this totally different citations are matched? Is there a way to parse the strings without the use of grobid? As I think blindly trusting grobid is wrong. At least verify the whole title string would already help to see something is not right.
Appendix
No response
The text was updated successfully, but these errors were encountered: