[Feature Request] Support for hash-based field type in OpenSearch #18175

deshsidd · 2025-05-01T19:35:21Z

Is your feature request related to a problem? Please describe

OpenSearch currently does not support a native field type for hashing input values at index time. This limits options for fingerprinting documents, detecting duplicates, or creating hashed bucketing strategies efficiently within the index.

Describe the solution you'd like

Introduce support for a murmur3 (or general-purpose hash) field type that automatically computes a consistent hash of string or keyword inputs during indexing. This hash should be stored and queryable. Ideally, this should allow configurable hash functions (starting with Murmur3).

Related component

No response

Describe alternatives you've considered

Why use Murmur3 instead of other hash functions?
Murmur3 is preferred in this context for several reasons:

Speed and performance: It is extremely fast and non-cryptographic, making it ideal for indexing-time transformations where security is not a concern.
Uniform distribution: It provides good hash distribution, which helps avoid skewed buckets in aggregations.
Lightweight: Unlike cryptographic hashes like SHA-256, Murmur3 has low CPU overhead.
Deterministic and portable: Hash values are consistent across systems and languages.

That said, allowing support for pluggable hash algorithms (e.g., Murmur3, CityHash, SHA-1) could be even more flexible and future-proof. Let me know your thoughts on the above! Thanks

Reference: https://www.atatus.com/blog/understanding-murmur-hashing/

Additional context

Example 1: Basic field

PUT /documents
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { "type": "text" },
      "content_hash": {
        "type": "hash",
        "hash": "murmur3"
      }
    }
  }
}

Assumes content_hash automatically computes the hash of the input (e.g. value of content) at index time.
Note: hash can also be sha1 or another hash algorithm if support is added for this in the future.

POST /documents/_doc/1
{
  "title": "First Document",
  "content": "The quick brown fox jumps over the lazy dog.",
  "content_hash": "The quick brown fox jumps over the lazy dog."
}

Internally, OpenSearch will hash that string using Murmur3 and store the hash in content_hash.

GET /documents/_search
{
  "query": {
    "term": {
      "content_hash": {
        "value": "ef73781effc5774100f87fe2f437a435"  // precomputed murmur3 hash
      }
    }
  }
}

This searches for documents that match the given content hash (e.g. for duplicate detection).

Example 2: Using Multi-field

PUT my-index
{
  "mappings": {
    "properties": {
      "user_id": {
        "type": "keyword",
        "fields": {
          "hash": {
            "type": "hash",
            "hash": "murmur3"
          }
        }
      }
    }
  }
}

This stores:
user_id as the original keyword
user_id.hash as the Murmur3 hash of that keyword

PUT my-index/_doc/1
{
  "user_id": "user123"
}

PUT my-index/_doc/2
{
  "user_id": "user456"
}

Index Documents.

GET my-index/_search
{
  "size": 0,
  "aggs": {
    "unique_users": {
      "cardinality": {
        "field": "user_id.hash"
      }
    }
  }
}

Cardinality aggregation on user_id.hash provides an efficient way to estimate the number of unique user_ids.

The text was updated successfully, but these errors were encountered:

shwetathareja · 2025-05-05T09:30:41Z

Thanks @deshsidd for opening the issue.

Have you looked into mapper-mumur3 plugin which already provides the capability to generate the hashes at the time of indexing and storing it. https://github.com/opensearch-project/OpenSearch/tree/main/plugins/mapper-murmur3/src

deshsidd added enhancement Enhancement or improvement to existing feature or request untriaged labels May 1, 2025

deshsidd self-assigned this May 1, 2025

github-actions bot added the _No response_ label May 1, 2025

deshsidd added Indexing & Search and removed _No response_ labels May 1, 2025

github-project-automation bot added this to Search Project Board May 1, 2025

github-project-automation bot moved this to 🆕 New in Search Project Board May 1, 2025

mch2 removed the untriaged label May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Support for hash-based field type in OpenSearch #18175

[Feature Request] Support for hash-based field type in OpenSearch #18175

deshsidd commented May 1, 2025 •

edited

Loading

shwetathareja commented May 5, 2025

Uh oh!

[Feature Request] Support for hash-based field type in OpenSearch #18175

[Feature Request] Support for hash-based field type in OpenSearch #18175

Comments

deshsidd commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Example 1: Basic field

Example 2: Using Multi-field

shwetathareja commented May 5, 2025

Uh oh!

deshsidd commented May 1, 2025 •

edited

Loading