Skip to content

[RFC] Cloud-native OpenSearch #17957

Open
@yupeng9

Description

@yupeng9

Is your feature request related to a problem? Please describe

Elasticsearch/OpenSearch, launched in 2010, was initially designed for on-premise deployments and smaller-scale applications, when cloud-computing was not prevailing at that time. This was shown in these observed design patterns:

  • Bundled Roles: Elasticsearch combined multiple roles (data storage, search, master node responsibilities) within a single node. This simplified setup and management for single-node or small clusters.
  • Mesh Design for Metadata: Cluster metadata was propagated to every node using a mesh design. This provided resilience against node failures as any node could provide the necessary information.
  • Data Movement Algorithms: Elasticsearch included algorithms to move data around the cluster. This was essential for rebalancing data in response to cluster growth, node additions, or node removals.

While this design initially provided convenience and ease of use, it has encountered significant challenges as cloud computing has become dominant and use cases have grown in scale and complexity:

  • Reliability Challenges at Scale: Managing very large Elasticsearch clusters (e.g., exceeding 200 nodes) becomes increasingly difficult. Data movement operations and cluster metadata replication can be time-consuming and lead to cluster pause.
  • Rolling Restarts and Upgrades: Performing rolling restarts or upgrades on large clusters can introduce latency spikes and potential data inconsistencies. Rollbacks can also be problematic. While Blue/Green deployments offer a safer alternative, they require significant additional capacity, which can be costly.
  • Challenges with Reactive Scaling: Dynamically adding or removing replicas in response to changes in serving traffic is challenging. Additionally, imbalances in traffic across shards can lead to uneven replica distribution, impacting performance and resource utilization.
  • Lack of Shard Distribution Control: Managing shard distribution in large clusters is complex. Fine-tuning shard size and distribution can be crucial for optimizing performance, especially for latency-sensitive applications.

We propose transitioning OpenSearch to a cloud-native architecture. In the cloud era, the shift towards cloud computing and the demands of modern, large-scale applications necessitate a rethinking of OpenSearch's architecture. A cloud-native architecture, leveraging cloud services and adopting a shared-nothing design, can address many of these challenges:

  • Cloud Components: Utilizing cloud-based services for storage, compute, and networking can provide greater scalability, elasticity, and resilience.
  • Shared-Nothing Architecture: In a shared-nothing architecture, nodes operate independently and do not share resources. This can simplify cluster management, improve scalability, and reduce dependencies between nodes.

By embracing a cloud-native approach, OpenSearch can better meet the requirements of large-scale, cloud-based deployments. This transition can lead to improved operational efficiency, enhanced reliability, and greater scalability, enabling OpenSearch to handle the demands of modern data-intensive applications.

Describe the solution you'd like

Goal

Remove the 200-node limit on cluster size to enable high scalability.
Allow rolling upgrades and restarts with a 20% capacity buffer, eliminating the need for Blue/Green deployments.
Support rapid horizontal scaling per shard to provide elasticity.
Improve the reliability and reduce the operational work for large clusters
Allow shard size and distribution fine tuning for better query performance

Architecture

The proposed architecture is shown as in Figure 1. And it includes several modular changes compared to the architecture today.

Image

Figure 1. Cloud-native OpenSearch architecture

External Metadata Storage

An important architectural change that can be made is the decoupling of logical metadata and physical cluster state, and storing the metadata in a shared external storage system. This would result in better scalability and faster access. Currently, cluster state propagation to all nodes can be slow in large clusters.

This idea was also proposed in an RFC that lists faster index creation/updates, better memory efficiency, and faster crash recovery among the benefits of this change. An additional benefit to external metadata storage is that it can simplify control plane implementation and delegate durable metadata state management to an external system, such as etcd/zookeeper, which are known to be reliable, distributed coordination systems. The RFC recommends RocksDB for metadata storage; however, RocksDB is not a distributed system and therefore does not provide the high availability and durability of strongly consistent KV storage systems such as etcd/zookeeper. Many technologies, including Kubernetes and Apache Kafka/Pinot/HDFS/Yarn, utilize etcd and zookeeper for metadata storage and coordination. As the Hadoop ecosystem is being replaced by more modern, cloud-native, and real-time data tools, it is proposed that etcd be the default external metadata storage compared to zookeeper. The metadata storage layer can be abstracted to allow for other storage options, such as zookeeper.

Additionally, etcd provides features such as watchers that are valuable for nodes in building reactive, event-driven features, including leader election, service discovery, and dynamic configuration.

Shared-nothing Data Nodes

Another important architecture change to enable cloud-native design is to make the data nodes shared-nothing. This means that each data node is independent and does not need to communicate with other data nodes. As shown in Figure 1, this is possible by the following changes:

  • Use segment-replication instead of the document-replication, so that the index segment is fetched from the remote storage (e.g. S3)
  • Use the external metadata storage as described above as the logical data and its goal state, so that a newly data node knows which index and shards that it serves along with other configurations. Unlike today’s design, the data node needs to fetch only the metadata relevant to it but not the global metadata. Also, the data node does not need to communicate with the master node to collect the cluster state.
  • Use a dedicated coordinator for serving the queries, so that the data node does not need to maintain the global cluster state for routing and forwarding the queries to other data nodes like today’s design
  • A pull-based ingestion (introduced in this RFC) can further enhance the shared-nothing design as it removes the need of translog, so a data node can quickly recover by replaying the messages from the streaming source. In fact, with the pull-based ingestion, we can even enable the leaderless (or primary-less) design because all the shard replicas are equivalent to the primary in terms of data ingestion capability.

This shared-nothing design brings in many benefits:

  • Horizontal scalability, the data nodes can be scaled out linearly and quickly without any shared bottlenecks. This is valuable for auto-scaling based on the traffic pattern.
  • Reduced contention, since the cluster state is not shared among data nodes, there is no waiting on its propagation
  • Better fault isolation. Since a data node does not depend on others, one data node crashing does not affect the other nodes, which improves the cluster resilience.
  • Easier operation. Since the data nodes become more modular, they are easier to operate and maintain. We can add/remove data nodes similar to microservices, and perform rolling restart/upgrade without the 2X capacity needed for Blue/Green deployment.

Controller with Goal State vs Actual State Convergence Model

With the metadata stored in an external storage system, we could simplify the master node design to adopt the goal state vs actual state convergence model, a design pattern proven to be resilient in large scale distributed systems with real-world examples like Kubernetes. A system that naturally drifts toward correctness, can provide strong self-healing capabilities and thus ease the operation and maintenance.

With the metadata stored in the external storage, the Controller can also store the goal state of the data nodes, such as the index and shard to serve for a given data node. The data node also writes its actual node state in the external storage such as the node health, configurations of the index it is serving. The controller can watch and compare the goal state and actual state, and update the goal state when needed, such as the leader selection of a shard to a given data node.

Stateless Coordinator

With the external metadata storage, it’s possible to make the coordinator a stateless service, so that it does not need to join the cluster like the data nodes . It can watch the metadata store and therefore it does not need to receive the cluster state from the master node. The cluster state includes the actual state of data nodes such as the shard allocation and node health, which can be used for building the routing table for the coordinator to dispatch the queries. A stateless coordinator can greatly simplify the deployment/operation, and it allows linear scalability of the coordinator horizontally.

Related component

Cluster Manager

Describe alternatives you've considered

No response

Additional context

This RFC is also written in this google doc for comments and discussion

Metadata

Metadata

Assignees

No one assigned

    Labels

    Cluster ManagerRFCIssues requesting major changesenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    Status

    🆕 New

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions