[BUG] synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer – similar to #16263 #18037

nupurjaiswal · 2025-04-22T21:20:33Z

Describe the bug

I’m encountering a bug similar to #16263 while configuring analyzers that use both word_delimiter_graph and synonym_graph. I'm currently migrating from Solr to OpenSearch 2.19 and encountered a limitation while working with the synonym_graph filter that uses a custom synonym_analyzer(whitespace tokenizer).

When I define a simple synonym analyzer using the whitespace tokenizer (i.e., no_split_synonym_analyzer) and apply the synonym_graph filter using this, everything works as expected.

However, the moment I add any additional filters such as word_delimiter_graph, asciifolding, or hunspell, I encounter the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
  },
  "status": 400
}

Use Case
In our Solr configuration, we handle synonym normalization for terms like:

covid, covid-19, covid 19

skydiving, sky diving, sky-diving

handheld, hand-held

This works seamlessly there even when using filters like WordDelimiterGraphFilterFactory, Hunspell, etc.

We want to achieve similar behavior in OpenSearch, including using a synonym_graph filter with a custom analyzer that includes:

word_delimiter_graph (with preserve_original or catenate_all)

asciifolding (with preserve_original)

hunspell

and a pattern_replace filter

Sample Config (Works):

"analyzer": {
  "test_analyzer": {
    "type": "custom",
    "tokenizer": "whitespace",
    "filter": [
      "lowercase",
      "custom_synonym_graph-replacement_filter"
    ]
  },
  "no_split_synonym_analyzer": {
    "type": "custom",
    "tokenizer": "whitespace"
  }
}

Sample Config (Fails):
When adding custom_word_delimiter, asciifolding, or hunspell to the same analyzer:

"test_analyzer": {
  "type": "custom",
  "tokenizer": "whitespace",
  "filter": [
    "lowercase",
    "custom_word_delimiter",
    "custom_hunspell_stemmer",
    "custom_synonym_graph-replacement_filter"
  ]
}

Results in:

Token filter [custom_word_delimiter] cannot be used to parse synonyms

It would be great if OpenSearch could enhance the synonym_graph behavior to:

Allow more flexible use of filters in synonym_analyzer, especially word_delimiter_graph, which is commonly used in language normalization pipelines.

A similar issue was resolved in the past here: #16263 — perhaps this one can be handled in a similar fashion.

Related component

Indexing

To Reproduce

Create Mapping

{
  "settings": {
    "analysis": {
        "char_filter": {
        "custom_pattern_replace": {
          "type": "pattern_replace",
          "pattern": "[({.,\\[\\]“”/})]",
          "replacement": " "
        }
        },
      "filter": {
        "custom_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        },
        "custom_pattern_replace_filter": {
          "type": "pattern_replace",
          "pattern": "(-)",
          "replacement": " ",
          "all": true
        },
        "custom_synonym_graph-replacement_filter": {
          "type": "synonym_graph",
          "synonyms": [
            "laptop, notebook",
            "covid, covid-19, covid 19",
            "skydiving,sky diving,sky-diving",
            "handheld,hand-held"
          ],
         "synonym_analyzer": "no_split_synonym_analyzer"
        },
         "custom_word_delimiter": {
          "type": "word_delimiter_graph",
          "generate_word_parts": true,
          "catenate_all": true,
          "split_on_numerics": false,
          "split_on_case_change": false
        },
        "custom_hunspell_stemmer": {
          "type": "hunspell",
          "locale": "en_US"
        }
      },
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "char_filter": [
            "custom_pattern_replace"
          ],
          "tokenizer": "whitespace",
          "filter": [
            "custom_ascii_folding",
            "lowercase",
            "custom_word_delimiter",
            "custom_hunspell_stemmer",
            "custom_synonym_graph-replacement_filter",
            "custom_pattern_replace_filter",
            "flatten_graph"
          ]
        },
         "no_split_synonym_analyzer":{
            "type":"custom",
            "tokenizer":"whitespace"
        }
      }
    }
  }
}

Error:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
    },
    "status": 400
}

Expected behavior

The synonym_graph filter with whitespace/classic tokenizer should support analyzers that use filters like word_delimiter_graph, asciifolding, or hunspell in the main analyzer chain.
It should not throw errors when a custom synonym_analyzer is provided.
Currently, it works only if the synonym_analyzer uses the standard tokenizer with other filters.
It should also work with whitespace or classic tokenizer, allowing more flexibility

Additional Details

Host/Environment (please complete the following information):

Opensearch Version:2.19

The text was updated successfully, but these errors were encountered:

msfroh · 2025-04-28T18:00:50Z

Something that I just noticed with the previous change, here:

OpenSearch/modules/analysis-common/src/main/java/org/opensearch/analysis/common/SynonymTokenFilterFactory.java

Lines 152 to 167 in 5068fad

    
           if (synonymAnalyzerName != null) { 
        
               Analyzer customSynonymAnalyzer; 
        
               try { 
        
                   customSynonymAnalyzer = analysisRegistry.getAnalyzer(synonymAnalyzerName); 
        
               } catch (IOException e) { 
        
                   throw new RuntimeException(e); 
        
               } 
        
               if (customSynonymAnalyzer != null) { 
        
                   return customSynonymAnalyzer; 
        
               } 
        
           } 
        
           return new CustomAnalyzer( 
        
               tokenizer, 
        
               charFilters.toArray(new CharFilterFactory[0]), 
        
               tokenFilters.stream().map(TokenFilterFactory::getSynonymFilter).toArray(TokenFilterFactory[]::new) 
        
           );

If the named analyzer is not returned from the AnalysisRegistry, then we fall through and try using the old behavior.

IMO, this is incorrect. If the named analyzer is not returned, it's an error. My hunch is this is what's happening. The if statement on line 159 should have an else that throws an exception.

Why does this happen? My guess is that we try to create the filter before the custom analyzer has been registered. We might need to rejig the loading of synonym analyzers to make it lazier.

@nupurjaiswal -- out of curiosity, if you use the built-in whitespace analyzer, does your mapping work? My hunch is that it will. I think right now, the synonym_analyzer will only work with built-in analyzers.

nupurjaiswal · 2025-04-28T18:29:50Z

@msfroh The synonym graph filter works with the build in analyzer with other filters like word delimeter, hunspell. This was fixed in #16263.

But if I add a custom analyzer with whitespace tokenizer for the synonym graph filter, I get an exception while creating the index.

msfroh · 2025-04-28T18:33:43Z

Right, so to be clear, your example above would work with "synonym_analyzer":"whitespace", right?

Obviously, you should be able to parse your synonyms with a custom analyzer as well -- we need to fix that. I just want to make sure that it works with any built-in analyzer, not just standard analyzer.

nupurjaiswal · 2025-04-28T19:55:50Z

@msfroh That's correct. "synonym_analyzer":"whitespace" works with the other filters.

erg · 2025-05-13T22:45:25Z

I'm having a related (the same?) issue. if i include my synonym_filter line it fails in opensearch 2.19 in AWS. on 2.17 i could move the synonym_filter line around and it worked (not the way i need it to work). now i can't even move it around. changing to "standard" anaylzer does not fix it. opensearch >2.19 is not yet supported in AWS, so i can't test on that.

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Token filter [word_delimiter_filter] cannot be used to parse synonyms"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Token filter [word_delimiter_filter] cannot be used to parse synonyms"
  },
  "status": 400
}

shuffling the filters around i can also get this error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Failed to build synonyms"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Failed to build synonyms"
  },
  "status": 400
}

// order
          "filter": [
            "lowercase",
            "synonym_filter",
            "asciifolding",
            "edge_ngram_filter",
            "word_delimiter_filter",
            "unique_filter"
          ]

{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "text_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "synonym_filter": {
          "type": "synonym_graph",
          "format": "solr",
          "synonyms_path": "$analysis/conjunctions.txt",
          "expand": true,
          "ignore_case": true
        },
        "unique_filter": {
          "type": "unique",
          "only_on_same_position": false
        },
        "edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 30,
          "preserve_original": true
        },
        "word_delimiter_filter": {
          "type": "word_delimiter_graph",
          "generate_word_parts": false,
          "generate_number_parts": false,
          "catenate_words": true,
          "catenate_numbers": true,
          "catenate_all": false,
          "split_on_numerics": false,
          "split_on_case_change": false,
          "preserve_original": true
        }
      },
      "analyzer": {
        "text_analyzer": {
          "tokenizer": "whitespace",
          "char_filter": ["html_strip"],
          "filter": [
            "lowercase",
            "asciifolding",
            "word_delimiter_filter",
            "edge_ngram_filter",
            "synonym_filter", // comment this line out and it works
            "unique_filter"
          ]
        }
      }
    }
  }
}

edit: i realized i have a preprocessor to inline my .txt files. here's the full json:

{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "text_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "synonym_filter": {
          "type": "synonym_graph",
          "format": "solr",
          "expand": true,
          "ignore_case": true,
          "synonyms": [
            "and,&,/,or"
          ]
        },
        "unique_filter": {
          "type": "unique",
          "only_on_same_position": false
        },
        "edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 30,
          "preserve_original": true
        },
        "word_delimiter_filter": {
          "type": "word_delimiter_graph",
          "generate_word_parts": false,
          "generate_number_parts": false,
          "catenate_words": true,
          "catenate_numbers": true,
          "catenate_all": false,
          "split_on_numerics": false,
          "split_on_case_change": false,
          "preserve_original": true
        }
      },
      "analyzer": {
        "text_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "synonym_filter",
            "asciifolding",
            "edge_ngram_filter",
            "word_delimiter_filter",
            "unique_filter"
          ]
        }
      }
    }
  }
}

nupurjaiswal added bug Something isn't working untriaged labels Apr 22, 2025

github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Apr 22, 2025

nupurjaiswal mentioned this issue Apr 28, 2025

[BUG] Token Filter Order: word_delimiter_graph and synonym_graph #16263

Closed

andrross removed the untriaged label May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer – similar to #16263 #18037

[BUG] synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer – similar to #16263 #18037

nupurjaiswal commented Apr 22, 2025

msfroh commented Apr 28, 2025 •

edited

Loading

Uh oh!

nupurjaiswal commented Apr 28, 2025

Uh oh!

msfroh commented Apr 28, 2025

Uh oh!

nupurjaiswal commented Apr 28, 2025

Uh oh!

erg commented May 13, 2025 •

edited

Loading

Uh oh!

[BUG] synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer – similar to #16263 #18037

[BUG] synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer – similar to #16263 #18037

Comments

nupurjaiswal commented Apr 22, 2025

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

msfroh commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nupurjaiswal commented Apr 28, 2025

Uh oh!

msfroh commented Apr 28, 2025

Uh oh!

nupurjaiswal commented Apr 28, 2025

Uh oh!

erg commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msfroh commented Apr 28, 2025 •

edited

Loading

erg commented May 13, 2025 •

edited

Loading