-
Notifications
You must be signed in to change notification settings - Fork 578
updating standard analyzer docs #9747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
updating standard analyzer docs #9747
Conversation
Signed-off-by: Anton Rubin <[email protected]>
Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged. Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer. When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review. |
Signed-off-by: kolchfa-aws <[email protected]>
@udabhas Could you please review this PR? Thanks! |
|
||
| Parameter | Type | Default | Description | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the documentation pages use Data type
instead of Type
. Eg: https://github.com/opensearch-project/documentation-website/pull/9479/files
Let's stick to a single nomenclature across documentations - either type
or data type
- either is fine, as long as it is consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sandeshkr419 thats updated now across the repo, All "Data Type" changed to "Data type"
Signed-off-by: Anton Rubin <[email protected]>
Signed-off-by: Anton Rubin <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>
@sandeshkr419 I think this addressed all the points, could you double check please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @AntonEliatra! Please see my comments and let me know if you have any questions.
- `standard` tokenizer: Removes most punctuation and splits text on spaces and other common delimiters. | ||
- `lowercase` token filter: Converts all tokens to lowercase, ensuring case-insensitive matching. | ||
- `stop` token filter: Removes common stopwords, such as "the", "is", and "and", from the tokenized output. | ||
- **Tokenization**: It uses the [`standard`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/standard/) tokenizer, which splits text into words based on Unicode text segmentation rules, handling spaces, punctuation, and common delimiters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **Tokenization**: It uses the [`standard`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/standard/) tokenizer, which splits text into words based on Unicode text segmentation rules, handling spaces, punctuation, and common delimiters. | |
- **Tokenization**: Uses the [`standard`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/standard/) tokenizer, which splits text into words based on Unicode text segmentation rules, handling spaces, punctuation, and common delimiters. |
- `lowercase` token filter: Converts all tokens to lowercase, ensuring case-insensitive matching. | ||
- `stop` token filter: Removes common stopwords, such as "the", "is", and "and", from the tokenized output. | ||
- **Tokenization**: It uses the [`standard`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/standard/) tokenizer, which splits text into words based on Unicode text segmentation rules, handling spaces, punctuation, and common delimiters. | ||
- **Lowercasing**: It applies the [`lowercase`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/lowercase/) token filter to convert all tokens to lowercase, ensuring consistent matching regardless of input case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **Lowercasing**: It applies the [`lowercase`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/lowercase/) token filter to convert all tokens to lowercase, ensuring consistent matching regardless of input case. | |
- **Lowercasing**: Applies the [`lowercase`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/lowercase/) token filter to convert all tokens to lowercase, ensuring consistent matching regardless of input case. |
|
||
Use the following command to create an index named `my_standard_index` with a `standard` analyzer: | ||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--- |
} | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--- |
## Parameters | ||
|
||
You can configure a `standard` analyzer with the following parameters. | ||
The `standard` analyzer supports the following optional parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `standard` analyzer supports the following optional parameters: | |
The `standard` analyzer supports the following optional parameters. |
|
||
Use the following command to configure an index with a custom analyzer that is equivalent to the `standard` analyzer: | ||
The following example creates index `products` and configures `max_token_length` and `stopwords`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following example creates index `products` and configures `max_token_length` and `stopwords`: | |
The following example creates a `products` index and configures the `max_token_length` and `stopwords` parameters: |
## Generated tokens | ||
|
||
Use the following request to examine the tokens generated using the analyzer: | ||
Use the following `_analyze` API to see how the `my_manual_stopwords_analyzer` processes text: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the following `_analyze` API to see how the `my_manual_stopwords_analyzer` processes text: | |
Use the following `_analyze` API request to see how the `my_manual_stopwords_analyzer` processes text: |
The response contains the generated tokens: | ||
The returned tokens are: | ||
|
||
- separated based on spacing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- separated based on spacing | |
- Split on spaces |
The returned tokens are: | ||
|
||
- separated based on spacing | ||
- lowercased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- lowercased | |
- Lowercased |
|
||
- separated based on spacing | ||
- lowercased | ||
- stopwords removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- stopwords removed | |
- Stopwords removed |
Description
updating standard analyzer docs
Version
all
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.