Skip to content

[Feature Request] Support for hash-based field type in OpenSearch #18175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
deshsidd opened this issue May 1, 2025 · 1 comment
Open

[Feature Request] Support for hash-based field type in OpenSearch #18175

deshsidd opened this issue May 1, 2025 · 1 comment
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Indexing & Search

Comments

@deshsidd
Copy link
Contributor

deshsidd commented May 1, 2025

Is your feature request related to a problem? Please describe

OpenSearch currently does not support a native field type for hashing input values at index time. This limits options for fingerprinting documents, detecting duplicates, or creating hashed bucketing strategies efficiently within the index.

Describe the solution you'd like

Introduce support for a murmur3 (or general-purpose hash) field type that automatically computes a consistent hash of string or keyword inputs during indexing. This hash should be stored and queryable. Ideally, this should allow configurable hash functions (starting with Murmur3).

Related component

No response

Describe alternatives you've considered

Why use Murmur3 instead of other hash functions?
Murmur3 is preferred in this context for several reasons:

  • Speed and performance: It is extremely fast and non-cryptographic, making it ideal for indexing-time transformations where security is not a concern.
  • Uniform distribution: It provides good hash distribution, which helps avoid skewed buckets in aggregations.
  • Lightweight: Unlike cryptographic hashes like SHA-256, Murmur3 has low CPU overhead.
  • Deterministic and portable: Hash values are consistent across systems and languages.

That said, allowing support for pluggable hash algorithms (e.g., Murmur3, CityHash, SHA-1) could be even more flexible and future-proof. Let me know your thoughts on the above! Thanks

Reference: https://www.atatus.com/blog/understanding-murmur-hashing/

Additional context

Example 1: Basic field

PUT /documents
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { "type": "text" },
      "content_hash": {
        "type": "hash",
        "hash": "murmur3"
      }
    }
  }
}

Assumes content_hash automatically computes the hash of the input (e.g. value of content) at index time.
Note: hash can also be sha1 or another hash algorithm if support is added for this in the future.

POST /documents/_doc/1
{
  "title": "First Document",
  "content": "The quick brown fox jumps over the lazy dog.",
  "content_hash": "The quick brown fox jumps over the lazy dog."
}

Internally, OpenSearch will hash that string using Murmur3 and store the hash in content_hash.

GET /documents/_search
{
  "query": {
    "term": {
      "content_hash": {
        "value": "ef73781effc5774100f87fe2f437a435"  // precomputed murmur3 hash
      }
    }
  }
}

This searches for documents that match the given content hash (e.g. for duplicate detection).

Example 2: Using Multi-field

PUT my-index
{
  "mappings": {
    "properties": {
      "user_id": {
        "type": "keyword",
        "fields": {
          "hash": {
            "type": "hash",
            "hash": "murmur3"
          }
        }
      }
    }
  }
}

This stores:
user_id as the original keyword
user_id.hash as the Murmur3 hash of that keyword

PUT my-index/_doc/1
{
  "user_id": "user123"
}

PUT my-index/_doc/2
{
  "user_id": "user456"
}

Index Documents.

GET my-index/_search
{
  "size": 0,
  "aggs": {
    "unique_users": {
      "cardinality": {
        "field": "user_id.hash"
      }
    }
  }
}

Cardinality aggregation on user_id.hash provides an efficient way to estimate the number of unique user_ids.

@deshsidd deshsidd added enhancement Enhancement or improvement to existing feature or request untriaged labels May 1, 2025
@deshsidd deshsidd self-assigned this May 1, 2025
@shwetathareja
Copy link
Member

Thanks @deshsidd for opening the issue.

Have you looked into mapper-mumur3 plugin which already provides the capability to generate the hashes at the time of indexing and storing it. https://github.com/opensearch-project/OpenSearch/tree/main/plugins/mapper-murmur3/src

@mch2 mch2 removed the untriaged label May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing & Search
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants