Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve KeywordList parsing to support escaped characters and nested structures #12888

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

krishnagjsForGit
Copy link

@krishnagjsForGit krishnagjsForGit commented Apr 6, 2025

Closes #12810

What I changed
Implemented escape handling in KeywordList.parse(String, Character, Character) to support escaping the keyword separator using backslash (). This prevents incorrect splitting of keywords when delimiters appear within the keyword itself.

Refactored the parsing logic for clarity, including renaming loop variables for better readability and intent.

Added test cases in KeywordListTest to cover escape scenarios such as:

Escaped delimiter: "one\,two" → ["one,two"]

Escaped backslash: "one\\two" → ["one\two"]

Mixed escaped and unescaped delimiters.

Where the changes are
KeywordList.java: Modified the parse method to include escape handling logic using a character-by-character loop.

KeywordListTest.java: Added JUnit test cases to ensure parsing behaves correctly with escaped delimiters and backslashes.

Why I made these changes
Fixes bug #12810: Current parsing breaks when delimiters appear inside keywords (e.g., in MeSH terms). There was no way to escape the delimiter character, leading to incorrect keyword splitting.

Improves consistency: Provides a more robust, predictable behavior for keyword parsing, especially when users or importers use delimiters in keyword values.

Next Steps
Awaiting review and feedback.

If the community prefers a different escape character or behavior, I’m happy to adjust the implementation.

After this is merged, similar logic could be extracted or reused where keyword parsing happens elsewhere (e.g., importers).

Mandatory checks

  • I own the copyright of the code submitted and I license it under the MIT license
  • [/] Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • [/] Screenshots added in PR description (if change is visible to the user)
  • [/] Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Copy link

trag-bot bot commented Apr 6, 2025

@trag-bot didn't find any issues in the code! ✅✨

@krishnagjsForGit
Copy link
Author

can someone help me on this. is this a valid comment from Bot ? I have never used DisplayName annotation and I changed method name to be more comprehensive. Am I missing anything here ?

@Siedlerchr
Copy link
Member

Seems like a false positive from the bot @koppor

@koppor
Copy link
Member

koppor commented Apr 6, 2025

@krishnagjsForGit please change the PR title to contain text summarizing the fix.

@krishnagjsForGit krishnagjsForGit changed the title Fix for issue https://github.com/JabRef/jabref/issues/12810 Improve KeywordList parsing to support escaped characters and nested structures Apr 6, 2025
@krishnagjsForGit
Copy link
Author

krishnagjsForGit commented Apr 6, 2025

@krishnagjsForGit please change the PR title to contain text summarizing the fix.

Done. is this fine @koppor ?

} else if (currentChar == '\\') {
isEscaping = true;
} else if (currentChar == delimiter) {
tokens.add(currentToken.toString().trim());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a check here to ensure the token isn't empty. This would make the implementation more reliable and error-tolerant. For example, consider a .bib file that contains a token with only whitespace, like , ,. The current implementation would trim the string to an empty one and still add it, which might not be the desired behavior.

currentToken.append(currentChar);
isEscaping = false;
} else if (currentChar == '\\') {
isEscaping = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether a backslash (\) is a valid character in a keyword. The current implementation would remove the backslash, which might not be the intended behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solution escaped delimiters and escaped hierarchical delimiters.

keywordList.add(chainRoot);
List<String> tokens = splitRespectingEscapes(keywordString, delimiter);

for (String token : tokens) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there's an opportunity to eliminate this loop. In the _splitRespectingEscapes_ method, all tokens are already available and accessible. It feels inefficient to first collect all tokens in a List and then iterate over it.

Sketch:

public static KeywordList parse(String keywordString, Character delimiter, Character hierarchicalDelimiter) {
    if (StringUtil.isBlank(keywordString)) {
        return new KeywordList();
    }

    Objects.requireNonNull(delimiter);
    Objects.requireNonNull(hierarchicalDelimiter);

    KeywordList keywordList = new KeywordList();
    List<String> hierarchy = new ArrayList<>();
    StringBuilder currentToken = new StringBuilder();
    boolean isEscaping = false;

    for (int i = 0; i < keywordString.length(); i++) {
        char currentChar = keywordString.charAt(i);

        if (isEscaping) {
            currentToken.append(currentChar);
            isEscaping = false;
        } else if (currentChar == '\\') {
            isEscaping = true;
        } else if (currentChar == hierarchicalDelimiter) {
            hierarchy.add(currentToken.toString().trim());
            currentToken.setLength(0);
        } else if (currentChar == delimiter) {
            hierarchy.add(currentToken.toString().trim());
            keywordList.add(Keyword.of(hierarchy.toArray(new String[0])));
            hierarchy.clear();
            currentToken.setLength(0);
        } else {
            currentToken.append(currentChar);
        }
    }

    // Handle final token
    if (currentToken.length() > 0 || !hierarchy.isEmpty()) {
        hierarchy.add(currentToken.toString().trim());
        keywordList.add(Keyword.of(hierarchy.toArray(new String[0])));
    }

    return keywordList;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement escaping for keyword separators
4 participants