-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[BUG] synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer – similar to #16263 #18037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Something that I just noticed with the previous change, here: Lines 152 to 167 in 5068fad
If the named analyzer is not returned from the IMO, this is incorrect. If the named analyzer is not returned, it's an error. My hunch is this is what's happening. The Why does this happen? My guess is that we try to create the filter before the custom analyzer has been registered. We might need to rejig the loading of synonym analyzers to make it lazier. @nupurjaiswal -- out of curiosity, if you use the built-in |
Right, so to be clear, your example above would work with Obviously, you should be able to parse your synonyms with a custom analyzer as well -- we need to fix that. I just want to make sure that it works with any built-in analyzer, not just |
@msfroh That's correct. "synonym_analyzer":"whitespace" works with the other filters. |
I'm having a related (the same?) issue. if i include my
shuffling the filters around i can also get this error:
{
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "text_analyzer"
}
}
},
"settings": {
"analysis": {
"filter": {
"synonym_filter": {
"type": "synonym_graph",
"format": "solr",
"synonyms_path": "$analysis/conjunctions.txt",
"expand": true,
"ignore_case": true
},
"unique_filter": {
"type": "unique",
"only_on_same_position": false
},
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 30,
"preserve_original": true
},
"word_delimiter_filter": {
"type": "word_delimiter_graph",
"generate_word_parts": false,
"generate_number_parts": false,
"catenate_words": true,
"catenate_numbers": true,
"catenate_all": false,
"split_on_numerics": false,
"split_on_case_change": false,
"preserve_original": true
}
},
"analyzer": {
"text_analyzer": {
"tokenizer": "whitespace",
"char_filter": ["html_strip"],
"filter": [
"lowercase",
"asciifolding",
"word_delimiter_filter",
"edge_ngram_filter",
"synonym_filter", // comment this line out and it works
"unique_filter"
]
}
}
}
}
} edit: i realized i have a preprocessor to inline my .txt files. here's the full json: {
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "text_analyzer"
}
}
},
"settings": {
"analysis": {
"filter": {
"synonym_filter": {
"type": "synonym_graph",
"format": "solr",
"expand": true,
"ignore_case": true,
"synonyms": [
"and,&,/,or"
]
},
"unique_filter": {
"type": "unique",
"only_on_same_position": false
},
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 30,
"preserve_original": true
},
"word_delimiter_filter": {
"type": "word_delimiter_graph",
"generate_word_parts": false,
"generate_number_parts": false,
"catenate_words": true,
"catenate_numbers": true,
"catenate_all": false,
"split_on_numerics": false,
"split_on_case_change": false,
"preserve_original": true
}
},
"analyzer": {
"text_analyzer": {
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"synonym_filter",
"asciifolding",
"edge_ngram_filter",
"word_delimiter_filter",
"unique_filter"
]
}
}
}
}
} |
Describe the bug
I’m encountering a bug similar to #16263 while configuring analyzers that use both word_delimiter_graph and synonym_graph. I'm currently migrating from Solr to OpenSearch 2.19 and encountered a limitation while working with the synonym_graph filter that uses a custom synonym_analyzer(whitespace tokenizer).
When I define a simple synonym analyzer using the whitespace tokenizer (i.e., no_split_synonym_analyzer) and apply the synonym_graph filter using this, everything works as expected.
However, the moment I add any additional filters such as word_delimiter_graph, asciifolding, or hunspell, I encounter the following error:
Use Case
In our Solr configuration, we handle synonym normalization for terms like:
This works seamlessly there even when using filters like WordDelimiterGraphFilterFactory, Hunspell, etc.
We want to achieve similar behavior in OpenSearch, including using a synonym_graph filter with a custom analyzer that includes:
word_delimiter_graph (with preserve_original or catenate_all)
asciifolding (with preserve_original)
hunspell
and a pattern_replace filter
Sample Config (Works):
Sample Config (Fails):
When adding custom_word_delimiter, asciifolding, or hunspell to the same analyzer:
Results in:
Token filter [custom_word_delimiter] cannot be used to parse synonyms
It would be great if OpenSearch could enhance the synonym_graph behavior to:
Allow more flexible use of filters in synonym_analyzer, especially word_delimiter_graph, which is commonly used in language normalization pipelines.
A similar issue was resolved in the past here: #16263 — perhaps this one can be handled in a similar fashion.
Related component
Indexing
To Reproduce
Create Mapping
Error:
Expected behavior
Additional Details
Host/Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: