Use a database as a backend for JabRef library management #12708

InAnYan · 2025-03-12T10:09:28Z

Is your suggestion for improvement related to a problem? Please describe.

Currently, JabRef struggles with libraries that have over 1000 entries (#10209).

Short reason and solution: JabRef stores all information in RAM. JabRef needs a mechanism to manage lots of data. This is a perfect use case for databases!

Longer issue description: look at how JabRef manages libraries and entries:

Load .bib file.
Convert .bib file into BibDatabase (with BibDatabaseContext) and BibEntry. Those are Java objects that are stored in RAM.
Manipulate library with those objects.
Save those objects into a .bib file.

So, JabRef's original philosophy is to be a file editor. However, when you have a giant library, you just don't have enough JVM heap. It is limited.

Describe the solution you'd like

JabRef should have a mechanism for managing a lot of data and use it for storing and manipulating libraries.

This is the purpose of databases! A DBMS will also cache data: a typical DBMS stores data in pages. Some pages are stored in RAM, some are offloaded to disk. This is a perfect solution for giant libraries, as now you are not limited to RAM space, but to space on your HDD/SDD!

Moreover, DBMS allows you to query data fast and powerful. Here is one place where SQL can be used: #10209 (comment). Search functionality is also a perfect case for databases.

Additional context

This is planned as a GSoC project. Beware, while this project is quite important for JabRef, it might turn out to be very complex.

We aim for a Relational DBMS like SQLite, DuckDB, Postgres. Especially, we want a database to be embedded.

In fact, we want Postgres to be our backend, as Postgres has powerful capabilities for search. It can be used as an embedded database, actually; checkout this library: https://github.com/zonkyio/embedded-postgres.

Here are some materials for this project:

Postgres: https://www.postgresql.org/.
Other databases you might consider (though, Postgres is preferable):
- DuckDB: https://duckdb.org/ -- seems promising too. Can do JNI and thus could save a process: https://github.com/duckdb/duckdb-java/blob/main/src/jni/duckdb_java.cpp#L25
- SQLite: https://www.sqlite.org/.
- H2: https://www.h2database.com/html/main.html.
- HSQLDB: https://hsqldb.org/.
BibTeX and BibLaTeX (you can use this information to design the schema of the DB):
- Internals of BibTeX: https://polish-mirror.evolution-host.com/ctan/biblio/bibtex/base/btxdoc.pdf.
- Internals of BibLaTeX: https://mirrors.ibiblio.org/CTAN/macros/latex/contrib/biblatex/doc/biblatex.pdf.
How Zotero internally stores data: https://github.com/zotero/zotero/blob/main/resource/schema/userdata.sql.
Use Postgres as an embedded database: https://github.com/zonkyio/embedded-postgres.
Take a look at JabRef's code:
- Search functionality: https://github.com/JabRef/jabref/blob/main/src/main/java/org/jabref/model/search/PostgreConstants.java#L6. (It already uses embedded Postgres).
- Shared database: https://github.com/JabRef/jabref/blob/main/src/main/java/org/jabref/logic/shared/PostgreSQLProcessor.java (schemas, etc.)

The text was updated successfully, but these errors were encountered:

koppor · 2025-03-12T10:46:45Z

Regarding the de-coupling of the UI with the data, starting points are

koppor · 2025-03-12T10:47:25Z

In last year's GSoC we opted for Postgres, because it has many plugins, especially for fast regular expression search so that the DBMS does the regex indexing and resolving - and not the client.

ryan-carpenter · 2025-03-14T11:22:03Z

XTDB? There is so much more to do with literature than a bibtex/biblatex schema covers, and immutable data would enable some great opportunities. Examples include a traceable history of screening or detecting duplicates from the past (recognising previously imported entries even after they have changed).

koppor · 2025-03-16T22:05:22Z

Since this is about memory saving, more things need to be considered:

Loading of a .bib file

If the bib file is read with "plain" JabRef, strings might be in memory, too
latex-to-unicode and unicode-to-latex (for search) is done in the DB, not in Java (hopefully --> stored procedure)

Presenting data to JabRef

Tableviews might help here to gain speed and reduce memory consumption on JabRef's side.

Saving of a .bib file

If the bib file is saved with "plain" JabRef, strings might be in memory, too
Postgres should offer some pre-formatting for saving (e.g., views, stored procedure, ...)
Maybe, a "simple" table view also helps

koppor · 2025-03-19T22:57:09Z

One decision driver: do not spawn another process (because this might be forbidden in certain environments)

ungerts · 2025-03-23T12:03:07Z

In my opinion, the scope of a technology decision should be broader and not limited solely to storing BibTeX entries. I'm concerned that JabRef may end up using too many technologies in parallel. For example, abbreviations are currently implemented using MVStore. Ideally, JabRef should rely on a single storage technology.

I'm also not in favor of packaging a full PostgreSQL database with JabRef for several reasons:

Security: PostgreSQL may open network ports, and if poorly configured, could introduce potential security vulnerabilities.
Resource Usage: PostgreSQL is a server-grade database meant to run on dedicated systems. Running it on end-user machines can consume significant CPU, memory, and disk resources, negatively impacting performance and user experience. Additionally, managing Postgres-specific tasks like cache configuration or vacuuming to reclaim disk space may be beyond the intended scope of JabRef.
Initialization and Portability: SQLite database files, for example, can be easily copied and delivered as part of the application, simplifying distribution - particularly for features like abbreviations.

That said, even file-based databases should be carefully evaluated. For instance, DuckDB currently does not support automatic updates to full-text search indexes when the input table changes

Conclusion:
A detailed architectural decision should be made regarding JabRef's persistence technology, taking into account scalability, performance, maintainability, and user environment constraints.

ungerts · 2025-03-29T22:36:48Z

Just for fun, I created a comparison with a little help from ChatGPT. It might not be perfectly accurate, but in my opinion, using comparisons like this - along with prototypes - is the right approach for making sustainable decisions.

Criteria	SQLite 🥇 (Recommended)	DuckDB 🥈 (Alternative)	PostgreSQL ❌ (Not Ideal)	HSQLDB ⚠️ (Legacy Option)	H2 🟡 (Viable Alternative)
Installation Required	❌ None – fully embedded	❌ None – fully embedded	✅ Yes – needs external server	❌ None – fully embedded	❌ None – fully embedded
Resource Usage	🟢 Very low	🟢 Low to moderate	🔴 High	🟢 Low	🟢 Low
Crash Safety	🟢 Excellent – no external processes	🟢 Excellent – no external processes	🔴 Poor – background processes may linger	🟡 Historically had issues	🟢 Generally good
Performance (Local Usage)	🟢 Good for most library sizes (<10k entries)	🟢 Optimized for analytics, fast on big data	🟢 Great – but overkill for local use	🟡 Adequate but aging	🟢 Fast and efficient
Scalability	🟡 Medium – may slow with very large DBs	🟢 High – columnar storage efficient	🟢 High – built for large concurrent loads	🟡 Medium	🟡 Medium to good
Multi-user Support	🔴 No – single-writer only	🔴 No – not built for concurrent access	🟢 Yes – designed for multi-user environments	🔴 Poor – file-based, single user	🔴 Basic – embedded usage only
Java Integration	🟢 Mature and stable JDBC drivers	🟢 Good – newer but actively improving	🟢 Mature JDBC ecosystem	🟢 Native Java	🟢 Native Java
Access Method	🔹 JNI (Java Native Interface)	🔹 JNI (Java Native Interface)	🔸 Network via TCP/IP	🟢 Direct Java API	🟢 Direct Java API
Cross-platform Compatibility	🟢 Excellent	🟢 Excellent	🟢 Excellent	🟢 Excellent	🟢 Excellent
Query Features	🟡 Basic SQL	🟢 Advanced analytics, SQL + functions	🟢 Full SQL, indexing, full-text search	🟡 Basic SQL	🟢 Full SQL, stored procedures
Full-Text Search (FTS)	🟢 Yes – via FTS5 extension	🟢 Yes – via FTS extension (DuckDB Docs)	🟢 Yes – built-in full-text search	🔴 Very limited or none	🟢 Yes – via native or Lucene (H2 Docs)
Portability (Database File)	🟢 Excellent – single `.db` file	🟢 Excellent – file-based	🔴 Poor – tied to server instance	🟢 Good – single file	🟢 Good – file-based
Ease of Embedding in JabRef	🟢 Very easy	🟢 Easy	🔴 Complex – requires local server connection	🟢 Easy – Java-native	🟢 Easy – Java-native
End-user Complexity	🟢 Transparent – no setup or config	🟢 Transparent	🔴 Requires setup and config	🟢 Transparent	🟢 Transparent
Fits Intended Use?	✅ Yes – built for local embedded use	🟡 Partially – analytics-focused	❌ No – intended for server environments	🟡 Partial – legacy option	✅ Yes – built for embedded Java apps

koppor · 2025-03-30T07:56:38Z

Thank you for the comparison start! We should include

reg-ex based search
normalisation of terms (Duesseldorf, Dusseldorf, Düsseldorf, D"{u}sseldorf)
easiness to support lazy loading (potentially all of them)
potential to be used by multiple clients
modern and legacy macOS support, ARM support (potentially all of them)

ungerts · 2025-03-30T11:56:24Z

📚 Alternative Approach: BibTeXIndexer — Streaming BibTeX Parser with Lazy Loading and Random Access

An alternative worth considering — or possibly excluding depending on the use case — is the BibTeXIndexer, a streaming, file-based parser that supports lazy loading and random access of entries.

This approach does not use a database, which makes it lightweight and memory-efficient. However, it also means that full-text search functionality is not built-in and must be implemented separately, for example, using a search engine like Apache Lucene.

Key Characteristics:

✅ Fully file-based — no need for a database
✅ Low memory usage and fast entry access
❌ No native search — external indexing (e.g. Lucene) required

Depending on your application (e.g. GUI tools like JabRef vs. web apps or cloud backends), this model may be ideal for performance and simplicity — or it might lack the flexibility needed for advanced search and filtering without additional components.

BibTeXIndexer is a high-performance, memory-efficient BibTeX indexer that supports fast random access, lazy loading, and in-place entry updates — ideal for large .bib files and tools like JabRef.

A working prototype of the BibTeXIndexer can be accessed here: View on GitHub Gist.

⚙️ Strategy

The program processes .bib files in a streaming, non-blocking way:

Line-by-line reading using UTF-8 and BufferedReader.
Brace-counting logic detects BibTeX entry blocks.
Byte-level tracking records start position and length of each entry.
Line ending detection ensures cross-platform compatibility.
Entries are:
- Loaded lazily only when accessed.
- Updated in-place if length permits.
- Appended if content is too long or key is new.

✅ Advantages

Feature	Description
🧠 Low Memory Usage	Streams entries; doesn't load entire file into RAM.
⚡ Random Access	Fast lookup by seeking to byte offset.
🧵 UTF-8 Safe	Correctly handles multibyte characters.
🛠️ Update-Safe	Supports safe in-place modification or appending.
🔄 Line Ending Awareness	Respects LF / CRLF endings for consistent output.
📦 Integratable	Ready for use in tools like JabRef or search systems like Lucene.

⚠️ Disadvantages

Limitation	Description
🐢 More File I/O	Each entry read requires a file seek + read.
🧠 More Complex Code	Offset tracking and brace balancing are trickier than full parsing.
🔁 Full File Scan on Load	One-time pass is needed to build the index.
💥 Malformed Entry Risk	Unbalanced braces may confuse the indexer.
💾 No Built-in Caching	Frequently accessed entries are not cached unless added manually.

💬 File I/O Suitability for Desktop Use

Yes — The I/O model is efficient and acceptable for local applications like JabRef.

All access is local and buffered — no network or remote calls.
Modern SSDs offer fast seeks, making random access viable.
Memory usage is minimal, even for huge files.
Adding a small cache for recently viewed entries would improve UX further.

🔄 Use in JabRef: Lazy Entry Loading

BibTeXIndexer can improve JabRef’s efficiency by supporting:

📖 Lazy Detail View

Load only the selected entry when the user clicks it.
Greatly reduces memory load and speeds up large .bib file handling.

✍️ On-Demand Editing

Write changes back to the file in-place or append as needed.
Keeps JabRef responsive with low file locking and I/O overhead.

🔎 Use with Apache Lucene: Fast Entry Search

BibTeXIndexer is a perfect backend for Lucene-based search:

During Indexing:

Extract metadata fields (title, author, year, etc.).
Store entry key + byte position + length as Lucene document fields.

During Search:

Search returns document keys.
Use the index to seek directly to the matching entry in the .bib file.
No need to store or duplicate full entry text in the Lucene index.

Benefits:

✅ Lightweight, metadata-driven Lucene index.
✅ Fast retrieval of full BibTeX source.
✅ Minimal disk I/O and memory usage.

🧠 Summary

BibTeXIndexer is a robust, modern solution for handling large .bib files with:

Fast indexing
Lazy access
In-place updates
Excellent scalability

It’s suitable for integration into reference managers like JabRef, or search systems using Apache Lucene — giving the best of both speed and resource efficiency.

koppor · 2025-03-30T12:09:13Z

We need to work on how the indexer fulfills the requirements reg-ex based search and normalisation of terms (Duesseldorf, Dusseldorf, Düsseldorf, D"{u}sseldorf) and also normalization of name s

Idea seems to be similar to https://github.com/dhis2/json-tree, but json-tree misses normalization.

koppor · 2025-03-30T12:12:41Z

We need to work on how the indexer fulfills the requirements reg-ex based search and normalisation of terms (Duesseldorf, Dusseldorf, Düsseldorf, D"{u}sseldorf) and also normalization of name s

Idea seems to be similar to https://github.com/dhis2/json-tree, but json-tree misses normalization.

Reading more: Lucene could solve that. However, @LoayGhreeb switched to Postgres because of reasons at #11803. I did not force to write an ADR, thus we need to collect the reasons by him. RegEx performance and wrong matches were IMHO the main reasons...

subhramit added project: gsoc dev: data-model labels Mar 12, 2025

subhramit added this to Prioritization Mar 12, 2025

github-project-automation bot moved this to Normal priority in Prioritization Mar 12, 2025

subhramit moved this from Normal priority to High priority in Prioritization Mar 12, 2025

koppor added the dev: performance label Mar 12, 2025

koppor added the dev: adr label Mar 19, 2025

koppor added the component: postgres label Apr 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a database as a backend for JabRef library management #12708

Use a database as a backend for JabRef library management #12708

InAnYan commented Mar 12, 2025 •

edited by koppor

Loading

koppor commented Mar 12, 2025

koppor commented Mar 12, 2025

ryan-carpenter commented Mar 14, 2025

koppor commented Mar 16, 2025 •

edited

Loading

koppor commented Mar 19, 2025

ungerts commented Mar 23, 2025 •

edited

Loading

ungerts commented Mar 29, 2025

koppor commented Mar 30, 2025

ungerts commented Mar 30, 2025

koppor commented Mar 30, 2025

koppor commented Mar 30, 2025

Use a database as a backend for JabRef library management #12708

Use a database as a backend for JabRef library management #12708

Comments

InAnYan commented Mar 12, 2025 • edited by koppor Loading

koppor commented Mar 12, 2025

koppor commented Mar 12, 2025

ryan-carpenter commented Mar 14, 2025

koppor commented Mar 16, 2025 • edited Loading

Loading of a .bib file

Presenting data to JabRef

Saving of a .bib file

koppor commented Mar 19, 2025

ungerts commented Mar 23, 2025 • edited Loading

ungerts commented Mar 29, 2025

koppor commented Mar 30, 2025

ungerts commented Mar 30, 2025

Key Characteristics:

⚙️ Strategy

✅ Advantages

⚠️ Disadvantages

💬 File I/O Suitability for Desktop Use

🔄 Use in JabRef: Lazy Entry Loading

📖 Lazy Detail View

✍️ On-Demand Editing

🔎 Use with Apache Lucene: Fast Entry Search

During Indexing:

During Search:

Benefits:

🧠 Summary

koppor commented Mar 30, 2025

koppor commented Mar 30, 2025

InAnYan commented Mar 12, 2025 •

edited by koppor

Loading

koppor commented Mar 16, 2025 •

edited

Loading

ungerts commented Mar 23, 2025 •

edited

Loading