Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plain citations parser giving unrelated citations back. How come? Attached a test case #12893

Open
2 tasks done
bwakkie opened this issue Apr 7, 2025 · 4 comments
Open
2 tasks done

Comments

@bwakkie
Copy link

bwakkie commented Apr 7, 2025

JabRef version

Other (please describe below)

Operating system

GNU / Linux

Details on version and operating system

JabRef 5.16--2024-07-25--771c4cd Linux 6.12.20-2-manjaro amd64 Java 21.0.2 JavaFX 22.0.2+4

Checked with the latest development build (copy version output from About dialog)

  • I made a backup of my libraries before testing the latest development version.
  • I have tested the latest development version and the problem persists

Steps to reproduce the behaviour

JabRef 5.16

There is a problem with the text parser which is changing the citations completely into a not related citations.
I created a test case see two attached files.

testcase.txt

testcase.bib.txt

How come this totally different citations are matched? Is there a way to parse the strings without the use of grobid? As I think blindly trusting grobid is wrong. At least verify the whole title string would already help to see something is not right.

Appendix

No response

@InAnYan
Copy link
Member

InAnYan commented Apr 7, 2025

Hi! Thanks for checking out JabRef and its text parser! I have worked on this feature for some time.

TL;DR: Plain citation parsing in JabRef works not very well (rule-based parser is underdeveloped (it's hard to develop rule-based algorithms), Grobid gives irrelevant entries, LLM is hard to set up).

We call this thing - plain citations parser (sometimes plain references parser).

In JabRef 5.16 there are 2 methods how you can parse citations: rule-based and Grobid. As we've experimented a lot of times (see 1, 2, 3), both parsers give not-so-good results. Grobid quite often gives irrelevant entries (actually, for some time JabRef's Grobid instance was down 4) - this is what you might experience.

In the new version of JabRef (6.0-alpha), we have added warnings about confabulations of plain citation parsing 5 and expanded the documentation: 6. One could also use LLM to parse citations 7, people (including me) say it works good-enough 8, 9.

Footnotes

  1. https://github.com/JabRef/jabref/issues/11805

  2. https://github.com/JabRef/jabref/issues/12211

  3. https://github.com/JabRef/jabref/issues/6672

  4. https://github.com/JabRef/jabref/issues/12211

  5. https://github.com/JabRef/jabref/issues/11825

  6. https://docs.jabref.org/collect/newentryfromplaintext

  7. https://docs.jabref.org/collect/newentryfromplaintext#llm

  8. https://github.com/JabRef/jabref/issues/11805#issuecomment-2445963839

  9. https://github.com/JabRef/jabref/issues/12211#issuecomment-2484934246

@Bha2912
Copy link

Bha2912 commented Apr 7, 2025

Hello! I’m interested in working on this issue. Let me know if I can take it up.

@InAnYan
Copy link
Member

InAnYan commented Apr 7, 2025

@Bha2912, this issue is not marked as good first issue (or with other labels). And we currently don't have a plan of solving it (it's in the discussion state).

So, for now, you can look at other issues

@bwakkie bwakkie changed the title Text parser giving unrelated citations back. How come? Attached a test case Plain citations parser giving unrelated citations back. How come? Attached a test case Apr 7, 2025
@bwakkie
Copy link
Author

bwakkie commented Apr 7, 2025

Hi @InAnYan I changed the title accordingly based on your comments.

I know how difficult it is as I am doing my best with vim and regexes myself for years. But what grobid returns at the moment is in my opinion not worth it.
I know though that grobid can be trained but it goes a bit over my head now/how. I have my own grobid server running and it gives similar problems hence I was looking for an alternative solution.

I used the JabRef development version for the next test based on the above original text input. The grobid had just 10% correct and 90% was complete garbage which makes me not to trust the grobid parser system one bit for now.

For the rules based result I see that at each line the previous author is pasted back in the author fields

testcase_rulebased.bib.txt

anystyle.io helped in my case, would this way of dealing with plain citations not be a better idea?
e.g. parse -> show user -> user fix -> parser learns -> include to a library

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants