You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe
OpenSearch currently does not support a native field type for hashing input values at index time. This limits options for fingerprinting documents, detecting duplicates, or creating hashed bucketing strategies efficiently within the index.
Describe the solution you'd like
Introduce support for a murmur3 (or general-purpose hash) field type that automatically computes a consistent hash of string or keyword inputs during indexing. This hash should be stored and queryable. Ideally, this should allow configurable hash functions (starting with Murmur3).
Related component
No response
Describe alternatives you've considered
Why use Murmur3 instead of other hash functions?
Murmur3 is preferred in this context for several reasons:
Speed and performance: It is extremely fast and non-cryptographic, making it ideal for indexing-time transformations where security is not a concern.
Uniform distribution: It provides good hash distribution, which helps avoid skewed buckets in aggregations.
Lightweight: Unlike cryptographic hashes like SHA-256, Murmur3 has low CPU overhead.
Deterministic and portable: Hash values are consistent across systems and languages.
That said, allowing support for pluggable hash algorithms (e.g., Murmur3, CityHash, SHA-1) could be even more flexible and future-proof. Let me know your thoughts on the above! Thanks
Assumes content_hash automatically computes the hash of the input (e.g. value of content) at index time.
Note: hash can also be sha1 or another hash algorithm if support is added for this in the future.
POST /documents/_doc/1
{
"title": "First Document",
"content": "The quick brown fox jumps over the lazy dog.",
"content_hash": "The quick brown fox jumps over the lazy dog."
}
Internally, OpenSearch will hash that string using Murmur3 and store the hash in content_hash.
Uh oh!
There was an error while loading. Please reload this page.
Is your feature request related to a problem? Please describe
OpenSearch currently does not support a native field type for hashing input values at index time. This limits options for fingerprinting documents, detecting duplicates, or creating hashed bucketing strategies efficiently within the index.
Describe the solution you'd like
Introduce support for a murmur3 (or general-purpose hash) field type that automatically computes a consistent hash of string or keyword inputs during indexing. This hash should be stored and queryable. Ideally, this should allow configurable hash functions (starting with Murmur3).
Related component
No response
Describe alternatives you've considered
Why use Murmur3 instead of other hash functions?
Murmur3 is preferred in this context for several reasons:
That said, allowing support for pluggable hash algorithms (e.g., Murmur3, CityHash, SHA-1) could be even more flexible and future-proof. Let me know your thoughts on the above! Thanks
Reference: https://www.atatus.com/blog/understanding-murmur-hashing/
Additional context
Example 1: Basic field
Assumes
content_hash
automatically computes the hash of the input (e.g. value of content) at index time.Note:
hash
can also besha1
or another hash algorithm if support is added for this in the future.Internally, OpenSearch will hash that string using
Murmur3
and store the hash incontent_hash
.This searches for documents that match the given content hash (e.g. for duplicate detection).
Example 2: Using Multi-field
This stores:
user_id
as the original keyworduser_id.hash
as the Murmur3 hash of that keywordIndex Documents.
Cardinality aggregation on
user_id.hash
provides an efficient way to estimate the number of uniqueuser_ids
.The text was updated successfully, but these errors were encountered: