Skip to content

[BUG] synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer – similar to #16263 #18037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nupurjaiswal opened this issue Apr 22, 2025 · 5 comments
Labels
bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing

Comments

@nupurjaiswal
Copy link

Describe the bug

I’m encountering a bug similar to #16263 while configuring analyzers that use both word_delimiter_graph and synonym_graph. I'm currently migrating from Solr to OpenSearch 2.19 and encountered a limitation while working with the synonym_graph filter that uses a custom synonym_analyzer(whitespace tokenizer).

When I define a simple synonym analyzer using the whitespace tokenizer (i.e., no_split_synonym_analyzer) and apply the synonym_graph filter using this, everything works as expected.

However, the moment I add any additional filters such as word_delimiter_graph, asciifolding, or hunspell, I encounter the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
  },
  "status": 400
}

Use Case
In our Solr configuration, we handle synonym normalization for terms like:

covid, covid-19, covid 19

skydiving, sky diving, sky-diving

handheld, hand-held

This works seamlessly there even when using filters like WordDelimiterGraphFilterFactory, Hunspell, etc.

We want to achieve similar behavior in OpenSearch, including using a synonym_graph filter with a custom analyzer that includes:

word_delimiter_graph (with preserve_original or catenate_all)

asciifolding (with preserve_original)

hunspell

and a pattern_replace filter

Sample Config (Works):

"analyzer": {
  "test_analyzer": {
    "type": "custom",
    "tokenizer": "whitespace",
    "filter": [
      "lowercase",
      "custom_synonym_graph-replacement_filter"
    ]
  },
  "no_split_synonym_analyzer": {
    "type": "custom",
    "tokenizer": "whitespace"
  }
}

Sample Config (Fails):
When adding custom_word_delimiter, asciifolding, or hunspell to the same analyzer:

"test_analyzer": {
  "type": "custom",
  "tokenizer": "whitespace",
  "filter": [
    "lowercase",
    "custom_word_delimiter",
    "custom_hunspell_stemmer",
    "custom_synonym_graph-replacement_filter"
  ]
}

Results in:

Token filter [custom_word_delimiter] cannot be used to parse synonyms

It would be great if OpenSearch could enhance the synonym_graph behavior to:

Allow more flexible use of filters in synonym_analyzer, especially word_delimiter_graph, which is commonly used in language normalization pipelines.

A similar issue was resolved in the past here: #16263 — perhaps this one can be handled in a similar fashion.

Related component

Indexing

To Reproduce

Create Mapping

{
  "settings": {
    "analysis": {
        "char_filter": {
        "custom_pattern_replace": {
          "type": "pattern_replace",
          "pattern": "[({.,\\[\\]“”/})]",
          "replacement": " "
        }
        },
      "filter": {
        "custom_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        },
        "custom_pattern_replace_filter": {
          "type": "pattern_replace",
          "pattern": "(-)",
          "replacement": " ",
          "all": true
        },
        "custom_synonym_graph-replacement_filter": {
          "type": "synonym_graph",
          "synonyms": [
            "laptop, notebook",
            "covid, covid-19, covid 19",
            "skydiving,sky diving,sky-diving",
            "handheld,hand-held"
          ],
         "synonym_analyzer": "no_split_synonym_analyzer"
        },
         "custom_word_delimiter": {
          "type": "word_delimiter_graph",
          "generate_word_parts": true,
          "catenate_all": true,
          "split_on_numerics": false,
          "split_on_case_change": false
        },
        "custom_hunspell_stemmer": {
          "type": "hunspell",
          "locale": "en_US"
        }
      },
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "char_filter": [
            "custom_pattern_replace"
          ],
          "tokenizer": "whitespace",
          "filter": [
            "custom_ascii_folding",
            "lowercase",
            "custom_word_delimiter",
            "custom_hunspell_stemmer",
            "custom_synonym_graph-replacement_filter",
            "custom_pattern_replace_filter",
            "flatten_graph"
          ]
        },
         "no_split_synonym_analyzer":{
            "type":"custom",
            "tokenizer":"whitespace"
        }
      }
    }
  }
}

Error:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
    },
    "status": 400
}

Expected behavior

  • The synonym_graph filter with whitespace/classic tokenizer should support analyzers that use filters like word_delimiter_graph, asciifolding, or hunspell in the main analyzer chain.
  • It should not throw errors when a custom synonym_analyzer is provided.
  • Currently, it works only if the synonym_analyzer uses the standard tokenizer with other filters.
  • It should also work with whitespace or classic tokenizer, allowing more flexibility

Additional Details

Host/Environment (please complete the following information):

  • Opensearch Version:2.19
@nupurjaiswal nupurjaiswal added bug Something isn't working untriaged labels Apr 22, 2025
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Apr 22, 2025
@msfroh
Copy link
Contributor

msfroh commented Apr 28, 2025

Something that I just noticed with the previous change, here:

if (synonymAnalyzerName != null) {
Analyzer customSynonymAnalyzer;
try {
customSynonymAnalyzer = analysisRegistry.getAnalyzer(synonymAnalyzerName);
} catch (IOException e) {
throw new RuntimeException(e);
}
if (customSynonymAnalyzer != null) {
return customSynonymAnalyzer;
}
}
return new CustomAnalyzer(
tokenizer,
charFilters.toArray(new CharFilterFactory[0]),
tokenFilters.stream().map(TokenFilterFactory::getSynonymFilter).toArray(TokenFilterFactory[]::new)
);

If the named analyzer is not returned from the AnalysisRegistry, then we fall through and try using the old behavior.

IMO, this is incorrect. If the named analyzer is not returned, it's an error. My hunch is this is what's happening. The if statement on line 159 should have an else that throws an exception.

Why does this happen? My guess is that we try to create the filter before the custom analyzer has been registered. We might need to rejig the loading of synonym analyzers to make it lazier.

@nupurjaiswal -- out of curiosity, if you use the built-in whitespace analyzer, does your mapping work? My hunch is that it will. I think right now, the synonym_analyzer will only work with built-in analyzers.

@nupurjaiswal
Copy link
Author

@msfroh The synonym graph filter works with the build in analyzer with other filters like word delimeter, hunspell. This was fixed in #16263.

But if I add a custom analyzer with whitespace tokenizer for the synonym graph filter, I get an exception while creating the index.

@msfroh
Copy link
Contributor

msfroh commented Apr 28, 2025

Right, so to be clear, your example above would work with "synonym_analyzer":"whitespace", right?

Obviously, you should be able to parse your synonyms with a custom analyzer as well -- we need to fix that. I just want to make sure that it works with any built-in analyzer, not just standard analyzer.

@nupurjaiswal
Copy link
Author

@msfroh That's correct. "synonym_analyzer":"whitespace" works with the other filters.

@erg
Copy link

erg commented May 13, 2025

I'm having a related (the same?) issue. if i include my synonym_filter line it fails in opensearch 2.19 in AWS. on 2.17 i could move the synonym_filter line around and it worked (not the way i need it to work). now i can't even move it around. changing to "standard" anaylzer does not fix it. opensearch >2.19 is not yet supported in AWS, so i can't test on that.

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Token filter [word_delimiter_filter] cannot be used to parse synonyms"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Token filter [word_delimiter_filter] cannot be used to parse synonyms"
  },
  "status": 400
}

shuffling the filters around i can also get this error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Failed to build synonyms"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Failed to build synonyms"
  },
  "status": 400
}

// order
          "filter": [
            "lowercase",
            "synonym_filter",
            "asciifolding",
            "edge_ngram_filter",
            "word_delimiter_filter",
            "unique_filter"
          ]
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "text_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "synonym_filter": {
          "type": "synonym_graph",
          "format": "solr",
          "synonyms_path": "$analysis/conjunctions.txt",
          "expand": true,
          "ignore_case": true
        },
        "unique_filter": {
          "type": "unique",
          "only_on_same_position": false
        },
        "edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 30,
          "preserve_original": true
        },
        "word_delimiter_filter": {
          "type": "word_delimiter_graph",
          "generate_word_parts": false,
          "generate_number_parts": false,
          "catenate_words": true,
          "catenate_numbers": true,
          "catenate_all": false,
          "split_on_numerics": false,
          "split_on_case_change": false,
          "preserve_original": true
        }
      },
      "analyzer": {
        "text_analyzer": {
          "tokenizer": "whitespace",
          "char_filter": ["html_strip"],
          "filter": [
            "lowercase",
            "asciifolding",
            "word_delimiter_filter",
            "edge_ngram_filter",
            "synonym_filter", // comment this line out and it works
            "unique_filter"
          ]
        }
      }
    }
  }
}

edit: i realized i have a preprocessor to inline my .txt files. here's the full json:

{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "text_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "synonym_filter": {
          "type": "synonym_graph",
          "format": "solr",
          "expand": true,
          "ignore_case": true,
          "synonyms": [
            "and,&,/,or"
          ]
        },
        "unique_filter": {
          "type": "unique",
          "only_on_same_position": false
        },
        "edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 30,
          "preserve_original": true
        },
        "word_delimiter_filter": {
          "type": "word_delimiter_graph",
          "generate_word_parts": false,
          "generate_number_parts": false,
          "catenate_words": true,
          "catenate_numbers": true,
          "catenate_all": false,
          "split_on_numerics": false,
          "split_on_case_change": false,
          "preserve_original": true
        }
      },
      "analyzer": {
        "text_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "synonym_filter",
            "asciifolding",
            "edge_ngram_filter",
            "word_delimiter_filter",
            "unique_filter"
          ]
        }
      }
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
None yet
Development

No branches or pull requests

4 participants