-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a database as a backend for JabRef library management #12708
Comments
Regarding the de-coupling of the UI with the data, starting points are |
In last year's GSoC we opted for Postgres, because it has many plugins, especially for fast regular expression search so that the DBMS does the regex indexing and resolving - and not the client. |
XTDB? There is so much more to do with literature than a bibtex/biblatex schema covers, and immutable data would enable some great opportunities. Examples include a traceable history of screening or detecting duplicates from the past (recognising previously imported entries even after they have changed). |
Since this is about memory saving, more things need to be considered: Loading of a .bib file
Presenting data to JabRef
Saving of a .bib file
|
One decision driver: do not spawn another process (because this might be forbidden in certain environments) |
In my opinion, the scope of a technology decision should be broader and not limited solely to storing BibTeX entries. I'm concerned that JabRef may end up using too many technologies in parallel. For example, abbreviations are currently implemented using MVStore. Ideally, JabRef should rely on a single storage technology. I'm also not in favor of packaging a full PostgreSQL database with JabRef for several reasons:
That said, even file-based databases should be carefully evaluated. For instance, DuckDB currently does not support automatic updates to full-text search indexes when the input table changes Conclusion: |
Just for fun, I created a comparison with a little help from ChatGPT. It might not be perfectly accurate, but in my opinion, using comparisons like this - along with prototypes - is the right approach for making sustainable decisions.
|
Thank you for the comparison start! We should include
|
📚 Alternative Approach: BibTeXIndexer — Streaming BibTeX Parser with Lazy Loading and Random Access An alternative worth considering — or possibly excluding depending on the use case — is the BibTeXIndexer, a streaming, file-based parser that supports lazy loading and random access of entries. This approach does not use a database, which makes it lightweight and memory-efficient. However, it also means that full-text search functionality is not built-in and must be implemented separately, for example, using a search engine like Apache Lucene. Key Characteristics:
Depending on your application (e.g. GUI tools like JabRef vs. web apps or cloud backends), this model may be ideal for performance and simplicity — or it might lack the flexibility needed for advanced search and filtering without additional components.
A working prototype of the BibTeXIndexer can be accessed here: View on GitHub Gist. ⚙️ StrategyThe program processes
✅ Advantages
|
Limitation | Description |
---|---|
🐢 More File I/O | Each entry read requires a file seek + read. |
🧠 More Complex Code | Offset tracking and brace balancing are trickier than full parsing. |
🔁 Full File Scan on Load | One-time pass is needed to build the index. |
💥 Malformed Entry Risk | Unbalanced braces may confuse the indexer. |
💾 No Built-in Caching | Frequently accessed entries are not cached unless added manually. |
💬 File I/O Suitability for Desktop Use
Yes — The I/O model is efficient and acceptable for local applications like JabRef.
- All access is local and buffered — no network or remote calls.
- Modern SSDs offer fast seeks, making random access viable.
- Memory usage is minimal, even for huge files.
- Adding a small cache for recently viewed entries would improve UX further.
🔄 Use in JabRef: Lazy Entry Loading
BibTeXIndexer
can improve JabRef’s efficiency by supporting:
📖 Lazy Detail View
- Load only the selected entry when the user clicks it.
- Greatly reduces memory load and speeds up large
.bib
file handling.
✍️ On-Demand Editing
- Write changes back to the file in-place or append as needed.
- Keeps JabRef responsive with low file locking and I/O overhead.
🔎 Use with Apache Lucene: Fast Entry Search
BibTeXIndexer
is a perfect backend for Lucene-based search:
During Indexing:
- Extract metadata fields (title, author, year, etc.).
- Store entry key + byte position + length as Lucene document fields.
During Search:
- Search returns document keys.
- Use the index to seek directly to the matching entry in the
.bib
file. - No need to store or duplicate full entry text in the Lucene index.
Benefits:
- ✅ Lightweight, metadata-driven Lucene index.
- ✅ Fast retrieval of full BibTeX source.
- ✅ Minimal disk I/O and memory usage.
🧠 Summary
BibTeXIndexer
is a robust, modern solution for handling large .bib
files with:
- Fast indexing
- Lazy access
- In-place updates
- Excellent scalability
It’s suitable for integration into reference managers like JabRef, or search systems using Apache Lucene — giving the best of both speed and resource efficiency.
We need to work on how the indexer fulfills the requirements reg-ex based search and normalisation of terms (Duesseldorf, Dusseldorf, Düsseldorf, D"{u}sseldorf) and also normalization of name s Idea seems to be similar to https://github.com/dhis2/json-tree, but json-tree misses normalization. |
Reading more: Lucene could solve that. However, @LoayGhreeb switched to Postgres because of reasons at #11803. I did not force to write an ADR, thus we need to collect the reasons by him. RegEx performance and wrong matches were IMHO the main reasons... |
Is your suggestion for improvement related to a problem? Please describe.
Currently, JabRef struggles with libraries that have over 1000 entries (#10209).
Short reason and solution: JabRef stores all information in RAM. JabRef needs a mechanism to manage lots of data. This is a perfect use case for databases!
Longer issue description: look at how JabRef manages libraries and entries:
.bib
file..bib
file intoBibDatabase
(withBibDatabaseContext
) andBibEntry
. Those are Java objects that are stored in RAM..bib
file.So, JabRef's original philosophy is to be a file editor. However, when you have a giant library, you just don't have enough JVM heap. It is limited.
Describe the solution you'd like
JabRef should have a mechanism for managing a lot of data and use it for storing and manipulating libraries.
This is the purpose of databases! A DBMS will also cache data: a typical DBMS stores data in pages. Some pages are stored in RAM, some are offloaded to disk. This is a perfect solution for giant libraries, as now you are not limited to RAM space, but to space on your HDD/SDD!
Moreover, DBMS allows you to query data fast and powerful. Here is one place where SQL can be used: #10209 (comment). Search functionality is also a perfect case for databases.
Additional context
This is planned as a GSoC project. Beware, while this project is quite important for JabRef, it might turn out to be very complex.
We aim for a Relational DBMS like SQLite, DuckDB, Postgres. Especially, we want a database to be embedded.
In fact, we want Postgres to be our backend, as Postgres has powerful capabilities for search. It can be used as an embedded database, actually; checkout this library: https://github.com/zonkyio/embedded-postgres.
Here are some materials for this project:
The text was updated successfully, but these errors were encountered: