Skip to content

SWBus (Switch Bus)

yue-fred-gao edited this page Mar 13, 2025 · 7 revisions

Overview

SWBus is a high-performance and scalable message channel for SONiC internal services. It is designed to provide an easy-to-use interface for the intra/inter-switch communication for the internal services.

At its core, it provides a mesh network between all switches, and help with the message routing and forwarding between the switches.

SWBus is built on top of the gRPC framework and doesn't have any limitation on the message serialization for its payload.

Top level concepts

The SWBus is composed of the following components:

  • SWitch Bus Core (swbus-core): This is the core implementation of the swbus, which provides the protocol, connection management for the mesh network and message routing, but does not provide the actual message handling as well as any knowledge of whatever network we are trying to create.
  • SWitch Bus Daemon (swbusd): This is the services that runs on each switch. It runs SWitch Bus Core inside, provides the gRPC interfaces, parses the network configuration files and call SWitch Bus Core to create the network.
  • SWitch Bus API (swbus-api): This is the API for any services that want to use SWitch Bus to communicate with other services. It is essentially a wrapper around the gRPC interfaces provided by SWitch Bus Daemon. It also provides a message filter layer for:
    • Local message routing, so that only the messages sends to other services will be forwarded to the SWitch Bus Daemon.
    • Incoming message dispatching, so it can forward the messages to the correct message handler.

The 3 components are designed to be used together as graph shows below:

graph TB
subgraph NodeA
    subgraph ServiceA
        service-logic-a0("Service Logic")
        service-logic-a1("Service Logic")
        swbus-api-a("Switch Bus API (swbus-api)")

        service-logic-a0 <--> swbus-api-a
        service-logic-a1 <--> swbus-api-a
    end

    subgraph ServiceB
        service-logic-b("Service Logic")
        swbus-api-b("Switch Bus API (swbus-api)")

        service-logic-b <--> swbus-api-b
    end

    subgraph swbusd-a["Switch Bus Daemon (swbusd)"]
        grpc-server-a("gRPC Server")
        swbus-core-a

        swbus-api-a <--> grpc-server-a
        swbus-api-b <--> grpc-server-a
        grpc-server-a <--> swbus-core-a
    end
end

subgraph NodeB
    subgraph ServiceC
        service-logic-c0("Service Logic")
        service-logic-c1("Service Logic")
        swbus-api-c("Switch Bus API (swbus-api)")

        service-logic-c0 <--> swbus-api-c
        service-logic-c1 <--> swbus-api-c
    end

    subgraph swbusd-c["Switch Bus Daemon (swbusd)"]
        grpc-server-c("gRPC Server")
        swbus-core-c

        swbus-api-c <--> grpc-server-c
        grpc-server-c <--> swbus-core-c
    end
end

swbus-core-a <--> swbus-core-c
Loading

Service Logic (Resource) as Endpoint

In the network of SWBus, all service logic (not service) are defined as the endpoints of the network, i.e. entities that send and receive messages. Each endpoint has its own unique address called Service Path in the system, which has the structure as below:

Node Locator Service Locator Resource Locator
Region ID Cluster ID Node Id Service Type Service Id Resource Type Resource Id
region-a switch-cluster-a 10.0.0.1-dpu0 hamgrd 0 hascope eni-0a1b2c3d4e5f6

The service path can be represented as a string: region-a/switch-cluster-a/10.0.0.1-dpu0/hamgrd/0/hascope/eni-0a1b2c3d4e5f6.

We can also add dedicated message filters in swbus-api in order to implement the endpoints that is only available in the service itself, such as redis db. This simplifies and unifies the communication to other existing services using swss-common redis communication channel.

Region ID Cluster ID Node Id Service Type Service Id Resource Type Resource Id
region-b switch-cluster-b 10.0.0.2-dpu1 redis APPL.SOME_TABLE data some_key:some_subkey

The service path can be represented as this string: region-b/switch-cluster-b/10.0.0.2-dpu1/redis/0/APPL.SOME_TABLE/some_key:some_subkey.

In the network request, the service path is defined in the protobuf message as below:

message ServicePath {
  // Server location
  string region_id = 10;
  string cluster_id = 20;
  string node_id = 30;

  // Service info
  string service_type = 110;
  string service_id = 120;

  // Resource info
  string resource_type = 210;
  string resource_id = 220;
}

Network topology

Once the network is setup, it will form a mesh network between all the switches. Each switch will maintain a list of routes that points to each other, so that the message can be routed to the correct destination.

As an exmaple, the network topology is shown as below.

swbus-topo-full

Message routing

All messages in SWBus are unicast messages and routed based on the service path.

When a message is routed in the similar way as longest prefix match, it will:

  • In swbus-api:
    • Use the full service path to find the exact match. If there is a match, route the message there.
    • If not, try again with only the service location to find the match.
    • If not, forward to the SWBus Daemon to find the match.
  • In swbusd:
    • Use the service location to find the match.
    • If not, try again with only region id, cluster id, node id, to find the match.
    • If not, try again with only region id and cluster id to find the match.
    • If not, try again with only region id to find the match.
    • If still not, return NO_ROUTE error.

Life of a packet

With this architecture, let's walk through how a message gets send and handled.

Let's say we have 2 services in the network - HAMgrD on DPU0 and DPU1, and they would like to communicate with each other for the same ENI resource - 0a1b2c3d4e5f6.

  • Sender: region-a/switch-cluster-a/10.0.0.1-dpu0/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
  • Receiver: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6

The life of a packet is as follows:

  • HAMgrD in DPU0 sends a message to the receiver with:
    • Destination = region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
    • Source = region-a/switch-cluster-a/10.0.0.1-dpu0/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
  • In swbus-api:
    • First, use the full service path to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
    • Since not found, use the service location to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0
    • Still not found, forward to the SWBus Daemon to find the next hop.
  • In swbusd:
    • Use the service location to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0
    • Since not found, use the node id to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1
    • This time, the next hop will be found as an gRPC connection to swbusd runs on 10.0.0.2 for DPU1, so we will forward the message there.
  • On 10.0.0.2, in swbusd:
    • Use the service location to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0
    • We will find the next hop as the hamgrd service on DPU1, so we will forward the message there.
  • On 10.0.0.2, in HAMgrD swbus-api:
    • In the message filter layer, use the full service path to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
    • We will find the message handler for ENI resource 0a1b2c3d4e5f6, so we will forward the message there.
  • Upon receiving the message, the receiver can optionally respond a ACK message back to the sender after delivery, which goes through similar routing process.

Network management

Connection store

The core of the network management is the connection store and route table inside the swbus-core.

Whenever a new connection is estabilished, we will add this connection to the connection store:

  • Each connection will be a bi-directional gRPC stream.
  • A swbus connection will be created for each connection, whilch maintains:
    • The connection metadata
    • A worker for reading and writing messages
    • A proxy factory so that anyone can create a proxy to send messages to the connection

With this, we will update the route table so we can route message to this connection.

Multiplexer and route table

The route table and message relay functionality is implemented by the SwbusMultiplexer in the swbus-core.

The route table is essentially a hash map with a string (service path) as the key and a next hop object as value. The next hop object will contain:

  • The connection proxy that can relay the message to the connection.
  • The hop count that will be used for route update, so we only have the shortest path in the route table.
classDiagram
    class ServiceHost

    class SwbusConnStore
    class SwbusConn
    class SwbusConnInfo
    class SwbusConnWorker
    class SwbusConnProxy

    class SwbusMultiplexer
    class SwbusNextHop

    ServiceHost "1" *-- "1" SwbusConnStore
    ServiceHost "1" *-- "1" SwbusMultiplexer

    SwbusConnStore "1" *-- "0..n" SwbusConn
    SwbusConn "1" *-- "1" SwbusConnWorker
    SwbusConn --> SwbusConnInfo

    SwbusMultiplexer "1" *-- "0..n" SwbusNextHop
    SwbusNextHop --> SwbusConnInfo
    SwbusNextHop "1" *-- "1" SwbusConnProxy
    SwbusConnProxy .. SwbusConnWorker : Queue message to worker\nvia mpsc channel
    SwbusConnWorker --> SwbusMultiplexer : Forward message to Mux\nfor message forwarding
Loading

The route table will be updated in serveral situations:

  • When a new connection gets established, the initial route entries will be added. No matter if the connection is created locally or remotely, the hop count will be considered as 1.

Route scope and connection type

As we can see in the service path and message routing defined above, in swbusd, a route / service path can be at regional level, or down to service level, so not all routes needs to be sends to its peer.

To control how wide a route will be broadcast, we first defined a type for each connection, that describes the purpose or origin of the connection and determines what kind of routes it will receive:

  • Client: This connection is coming from CLI implementation.
  • Node: This connection is estabilished for intra-node communication for all services on that node.
  • Cluster: This connection is estabilished for cross-node communication within the same cluster.
  • Region: This connection is estabilished for cross-cluster communication within the same region.
  • Global: This connection is estabilished for cross-region communication.

When a route is added in swbusd, e.g., new peer is connected or new routes received from other peers, the route will be broadcasted to its peer based on the route scope and connection type as below:

Peer Connection Type
---
Route Scope
Client Node Intra-Cluster Inter-Cluster Inter-Region
Client No No No No No
Node No No No No No
Intra-Cluster No No Yes No No
Inter-Cluster No No Yes Yes No
Inter-Region No No Yes Yes Yes

Route Exchange Working Theory

Whenever there is a change to the local route table, swbusd will send route announcement to all the connections that are of InCluster, InRegion or Global type. The change can be triggered in 3 situations.

  • A new connection is established with a peer swbusd. The service path of the peer carried in the initial connection will be added to the local route table.
  • A connection to a peer swbusd is lost. All the routes from the peer via the connection will be removed from the route table.
  • Received route announcement from peer swbusd and local route table is updated after processing the announcement

Route Announcement

When swbusd starts up, an asynchronous Tokio task called RouteAnnouncer is spawned. Its primary role is to send route announcements to peer swbusd instances. A communication channel is established between the Multiplexer and RouteAnnouncer to handle route announcement requests. The Multiplexer acts as the producer, generating requests based on the three scenarios mentioned earlier. RouteAnnouncer, in turn, retrieves these requests from the channel and sends route announcements to each connected peer. If multiple requests accumulate in the channel, only a single route announcement is sent instead of one per request.

Route Processing Logic

  1. Retrieving Connections: RouteAnnouncer first fetches all active connections from the Multiplexer. For each connection, it determines the connection type and requests the Multiplexer to export its route table accordingly.
  2. Route Exporting Rules:
  • The Multiplexer exports only one route per target with the lowest hop-count that is not through the same peer to which the RouteAnnouncer is exporting. Local route table may contain multiple route entries for the same target through different peers.
  • If a route with lowest hop-count to a target has a next hop through the same peer to which the RouteAnnouncer is exporting, that route entry is excluded to prevent routing loops. If an alternative route to the target exists with either the same lowest hop-count or one hop higher, it is exported. Otherwise, the route target is completely omitted.
  • It exports the routes from my_routes with a hop count of 0 and a route scope greater than or equal to the target scope.
  1. Avoiding Routing Loops: swbusd does not announce a route to a peer if the route’s next hop is via that same peer. This precaution prevents routing loops in case the peer swbusd loses its path to the route target. However, if the local swbusd has an alternative path with a higher hop count, it can still provide connectivity. In such a scenario, when the peer swbusd loses its route, it sends an update to the local swbusd, indicating the removal of the route. The local swbusd then removes the entry and exposes the alternative path as the new lowest hop-count route. This updated route is subsequently announced to the peer swbusd, enabling it to establish a new route through the alternative path.

Route Updade

  • Multiplexer keeps announced routes per connection. When swbusd receives a new route announcement from the same connection, it finds new routes and deleted routes by comparing to the previous announced routes.
  • Multiplexer will first remove the deleted route entries then add the new ones.
  • When adding new routes, it will skip the ones matching 'my-routes', which are the ones it advertises to the peers (see 'routes' in below yaml config). My-routes are considered permanent and of the lowest cost.
  • If local route table is updated after processing the route announcement, it will send a route announce request RouterAnnouncer.

Network setup

Initial network setup

When swbusd is launched, it will read the network configuration, such as routes needs to be announced, and all peers it needs to connect.

Each peer will contain the following key information:

  • IP endpoint of the peer swbusd gRPC server
  • Service Path of the peer, with the region id, cluster id and node id. No service locator or resource location will be provided.

The configuration looks as below:

# Routes that will be advertised to all peers
routes:
- key: region-a/cluster-a/10.0.0.1-dpu0
  scope: cluster
peers:
- id: region-a/cluster-a/10.0.0.2-dpu0
  endpoint: 10.0.0.2:8000
  type: cluster
- id: region-a/cluster-a/10.0.0.3-dpu0
  endpoint: 10.0.0.3:8000
  type: cluster
...

In our case, since all nodes are in the same cluster, and all initial routes are cluster scope, the network will soon converge into the state below:

swbus-topo-route-only

Service locator announcement

Once the endpoints are connected and initial routes are setup, all services will start to connect to swbusd and announce the service locators.

Since the network is structured, we don't need to announce the service locator to any peer, hence all service locator can be added using Node scope. The end results is shown below:

swbus-topo-with-service

Resource location announcement

After the service is up, service will start to load its managed resources. Each resources can also be represented as a service path, but it only needs to exist in the swbus-api of the service itself. Hence, the resource locators are added as message filters in the swbus-api.

The end result will be the same as the example shown in the "Network topology" section.

Failure handling

Whenever a connection is broken, it will be detected by the SwbusConnWorker, due to the gRPC stream being closed. Then, then worker will break out of the worker loop and unregister itself from the Multiplexer, which getting the route table updated accordingly.

TODO: Aggresive retry and backoff mechanism for swbusd initialted connections.

Debug infra

To debug message routing issues, SWBus supports 2 type of messages: Ping and TraceRoute, which act similarly as the ICMP ping and trace route and frequently used in debugging regular network issues.

Both messages are handled in the same way - whenever Multiplexer or swbus-api receives the message, it will check these infra messages and handle them accordingly.

Ping

  • A Ping message contains the same header as regular messages, which contains the source and destination service path as well as the TTL.
  • When a Ping message is received by the endpoint (swbus-api), it will respond with an ACK message, which serves as the Pong.
  • If the ttl of the Ping message reaches 0 in the multiplexer, it will respond with a SWBUS_ERROR_CODE_UNREACHABLE error.
  • If not route is found, it will respond with a SWBUS_ERROR_CODE_NO_ROUTE error.
  • Otherwise, Multiplexer will forward the message to the next hop just like the regular message.

Trace Route

  • A TraceRouteRequest message contains the same header as regular messages, which contains the source and destination service path as well as the TTL.
  • If the ttl of the TraceRouteRequest message reaches 0 in the multiplexer, it will respond with a SWBUS_ERROR_CODE_UNREACHABLE error.
  • If not route is found, it will respond with a SWBUS_ERROR_CODE_NO_ROUTE error.
  • Otherwise, Multiplexer will first respond a TraceRouteResponse message with the same trace id, then forward the message to the next hop just like the regular message.
  • When a TraceRouteRequest message is received by the endpoint (swbus-api), it will respond with an ACK message, which serves as the complete signal of the trace route.

swbus-edge

swbus-edge is a component running in swbus client, such as hamgrd. It connects to the local swbusd with a service path representing all resources under the service. swbusd will route messages to the service to swbus-edge. swbus client registers handlers, each with a service path of a target resource. swbus-edge route message to the corresponding handler based on the service path. swbus-edge sends outgoing messages, those not matching any local routes, to swbusd. Below is the UML diagram of swbus-edge.

classDiagram

class SwbusCoreClient {
    +new()
    +register_svc()
    +unregister_svc()
    +push_svc()
    +connect()
    +start()
    +send()
}

class SwbusEdgeRuntime {
    +new()
    +start()
    +add_handler()
    +add_private_handler()
    +send()
}

class SwbusMessageRouter {
    +new()
    +start() Result
    +add_route()
    +add_private_route()
}

class SwbusMessageHandlerProxy {
    +new()
    +send()
}

class SimpleSwbusEdgeClient {
    +new()
    +recv()
    +handle_received_message()
    +send()
    +send_raw()
    +outgoing_message_to_swbus_message()
}

SwbusEdgeRuntime --> SwbusMessageRouter : has
SwbusMessageRouter --> SwbusCoreClient : has
SwbusMessageRouter --> RouteMap : uses
RouteMap "1" *--> "n" SwbusMessageHandlerProxy : has
SwbusMessageHandlerProxy --> SwbusMessage : uses
SimpleSwbusEdgeClient --> SwbusEdgeRuntime : uses
Loading