Skip to content

Parse MeSH terms in PubMed MEDLINE records #12532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ryan-carpenter opened this issue Feb 19, 2025 · 10 comments
Open

Parse MeSH terms in PubMed MEDLINE records #12532

ryan-carpenter opened this issue Feb 19, 2025 · 10 comments

Comments

@ryan-carpenter
Copy link

ryan-carpenter commented Feb 19, 2025

Is your suggestion for improvement related to a problem? Please describe.

MEDLINE records are indexed with headings and subheadings (MeSH terms), having a one-to-many relationship between headings and subheadings. PubMed displays the MeSH terms individually (in pairs), like this.

Notice that the heading "Kidney Diseases" repeats for each associated subheading, with trailing asterisks denoting "Major topics". This is not how the MeSH terms appear in PubMed exports, and therefore, not how JabRef imports them.

This is how the terms come in PubMed text files.

MH  - *Kidney Diseases/diagnosis/epidemiology/physiopathology/therapy

JabRef imports this unchanged as one keyword:

*Kidney Diseases/diagnosis/epidemiology/physiopathology/therapy

This is how the terms appear in PubMed xml.

<MeshHeading>
    <DescriptorName UI="D007674" MajorTopicYN="Y">Kidney Diseases</DescriptorName>
    <QualifierName UI="Q000175" MajorTopicYN="N">diagnosis</QualifierName>
    <QualifierName UI="Q000453" MajorTopicYN="N">epidemiology</QualifierName>
    <QualifierName UI="Q000503" MajorTopicYN="N">physiopathology</QualifierName>
    <QualifierName UI="Q000628" MajorTopicYN="N">therapy</QualifierName>
</MeshHeading>

Again, JabRef imports this as one keyword, this time separating the subheadings with a comma:

Kidney Diseases, diagnosis, epidemiology, physiopathology, therapy

Describe the solution you'd like

I would like JabRef to import MeSH terms as individual keywords using the same format as PubMed where each heading has a maximum of one subheading and the major topic is displayed as an asterisk at the end of the heading or subheading string. Keywords generated from plain text or xml files from PubMed should have the same format in JabRef.

The keywords should look like this:
Kidney Diseases*/diagnosis Kidney Diseases*/epidemiology Kidney Diseases*/physiopathology Kidney Diseases*/therapy

The bibtex source should look like this (assuming the user-define keyword separator is a semicolon):

Kidney Diseases*/diagnosis; Kidney Diseases*/epidemiology; Kidney Diseases*/physiopathology; Kidney Diseases*/therapy

Parsing MeSH terms this way lets the keywords fit better in the GUI and makes it easier to search and filter by keyword.

Additional context
Ideally, the MEDLINE importer (and other importers) would check if the user-defined keyword separator is included in the input, and warn or choose a substitution in case of conflict. List items are appear one per line in PubMed text files, so the keyword separator should not be found in any lines that begin with MH - .

* Parses the keyword list and uses {@link Keyword#DEFAULT_HIERARCHICAL_DELIMITER} as hierarchical delimiter.

Regex for moving asterisks to the end.

(?<slash>/{0,1})\*(?<subhead>.+?(?=/|$))(?<=^MH  - .*)

Replace with

"${slash}${subhead}*

Discussion on JabRef Discourse

@ryan-carpenter
Copy link
Author

Edits:

  • Removed * from the xml-import example. The MEDLINE importer currently does not mark the major topic.
  • Updated suggested import format to Heading/subheading instead of Heading /subheading

Keywords could have the format Heading / subheading, Heading /subheading, or Heading/subheading. I suggest the last one, because this is the format of PubMed text files and provides the greatest compatibility with existing libraries. This is the format JabRef already uses, as long as the heading has only one subheading, so it makes sense to keep this pattern.

@github-project-automation github-project-automation bot moved this to Normal priority in Prioritization Feb 24, 2025
@ungerts
Copy link
Contributor

ungerts commented Mar 21, 2025

Implemented a prototype to identify potential challenges or problems related to this issue.

Image

I encountered two main problems:

  • Inconsistent Keyword Separators in JabRef: JabRef uses different characters as keyword separators depending on the context. For instance, MedlineImporter.class uses the semicolon (;), defined as a constant within the class. Meanwhile, the Preferences configuration uses a comma (,) as the default separator. This inconsistency can lead to confusion or unexpected behavior in standard configurations.

  • Comma in MeSH Terms: MeSH terms may contain commas, which conflict with the default keyword separator (,). Currently, there is no mechanism to escape or handle such cases, resulting in the incorrect splitting of MeSH terms.

In my opinion, addressing these inconsistencies is essential to developing a sustainable and robust solution.

@koppor
Copy link
Member

koppor commented Mar 22, 2025

"Models, Molecular" seems like a hierarchical keyword? Is this always the case that comma is used for hierarchy?

@ungerts
Copy link
Contributor

ungerts commented Mar 22, 2025

Although it might look like a hierarchy, it's not actually one. I experimented a bit with the MeSH database, and treating the comma in Models, Molecular as a hierarchical delimiter doesn’t work. In that interpretation, you’d end up with two hierarchies named Model, which doesn’t make sense.

Here’s the actual hierarchy:

Analytical, Diagnostic and Therapeutic Techniques and Equipment Category
        Investigative Techniques
            Models, Theoretica
                Models, Molecular
                    Molecular Docking Simulation
                    Molecular Dynamics Simulation
                    Pharmacophore

Looking at the hierarchy for Diabetes Mellitus reveals even more complex scenarios.

@ungerts
Copy link
Contributor

ungerts commented Mar 22, 2025

It seems there will also be some UI-related issues—the keyword box is too small to handle a large set of keywords.

Image

@koppor
Copy link
Member

koppor commented Mar 23, 2025

It seems there will also be some UI-related issues—the keyword box is too small to handle a large set of keywords.

This refs #6856

@koppor
Copy link
Member

koppor commented Mar 23, 2025

  • Inconsistent Keyword Separators in JabRef: JabRef uses different characters as keyword separators depending on the context. For instance, MedlineImporter.class uses the semicolon (;), defined as a constant within the class. Meanwhile, the Preferences configuration uses a comma (,) as the default separator. This inconsistency can lead to confusion or unexpected behavior in standard configurations.

I deduct:

  1. Have the keyword separator config be used in MedlineImporter
  2. Have escaping for the keyword separator be implemented, especially

@ryan-carpenter
Copy link
Author

"Models, Molecular" seems like a hierarchical keyword? Is this always the case that comma is used for hierarchy?

No, unfortunately, commas do not reflect the hierarchy even though it often appears this way.

Here is an example where the comma looks like part of the hierarchy and where it would be possible to abbreviate the keyword by removing the redundant portion Aged.

Image

Now consider an example where it is clear that commas do not in fact represent the hierarchy. Notice that Models, Animal has the peer Models, Theoretical but neither of these is a child of Models. Similarly, Disease Models, Animal is not a child of Disease Models, and in fact there is no subject in MEDLINE called "Disease Models". Note: there are three hierarchies shown because Disease Models, Animal exists in each of them.

Image

@ryan-carpenter
Copy link
Author

It seems there will also be some UI-related issues—the keyword box is too small to handle a large set of keywords.

This is a problem in the current situation as well. MEDLINE terms that have multiple subheadings are too long for the entry editor. Since keywords do not wrap, they get cut off and cannot be seen beyond the right boundary of the field.

@ryan-carpenter
Copy link
Author

I deduct:

1. Have the keyword separator config be used in `MedlineImporter`

2. Have escaping for the keyword separator be implemented, especially

Escaping is the only real solution, because the keyword separator comes from the source and not the user. Different sources use different separators, so the user preference alone cannot be relied upon. Substitution is another possibility, but this too would be unreliable since the substitution would have to use a character that does not occur in the source.

@koppor koppor moved this from Normal priority to High priority in Prioritization Apr 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: High priority
Development

No branches or pull requests

4 participants