Skip to content

Commit 983ad63

Browse files
Merge pull request sonic-net#221 from bandaru-viswanath/master
Local Mode support for Drop Monitor (Silicon Telemetry)
2 parents 8b9dcd0 + b892644 commit 983ad63

File tree

1 file changed

+114
-9
lines changed

1 file changed

+114
-9
lines changed

devops/tam/tam-drop-monitor-hld.md

+114-9
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,8 @@
8282
|---|-----------|------------------|-----------------------------------|
8383
| 0.1 | 10/15/2019 | Shirisha Dasari | Initial Version |
8484
| 0.2 | 07/01/2020 | Bandaru Viswanath | Major update to accomodate enhancements to use new TAM infrastructure, DB schmas and UI |
85+
| 0.3 | 06/11/2021 | Bandaru Viswanath | Introduce the Local Mode |
86+
8587

8688
## About This Manual
8789

@@ -106,13 +108,25 @@ This document describes the high level design of Drop Monitor feature in SONiC.
106108

107109
# 1 Feature Overview
108110

109-
The Drop Monitor feature in SONiC allows the user to setup packet-drop monitoring sessions for specific flows. A Collector, identified by an IP address and associated transport parameters, can be configured on the switch to send packet drop-reports.
111+
The Drop Monitor feature in SONiC allows the user to setup packet-drop monitoring sessions for specific flows. A Collector, identified by an IP address and associated transport parameters, can be configured on the switch to send packet drop-reports. This mode where reports are sent to an external collector is termed `external` mode.
112+
113+
Additionally, to enable quick and targetted packet-drop debugging, the Drop Monitor feature supports reporting information locally about dropped flows without requiring an external Collector. This mode is termed `local` mode.
114+
115+
The two modes - *local* mode and *external* mode are mutually exclusive. That is, when an external collector is configured, information on dropped-flows is unavailable locally on the Switch. Likewise, when used in *local* mode, drop reports are not sent to any external collector.
116+
117+
The *external* mode is the default mode.
118+
119+
The *local* mode is meant for debugging purposes only and is limited interms of scale (number of flows that can be monitored). It is not expected as a replacement for true drop monitoring with an external Collector.
110120

111121
## 1.1 Requirements
112122

113123
### 1.1.1 Functional Requirements
114124

115-
1.0 Drop Monitor feature allows user to configure a Drop Monitor session on a given switch and send the drop-reports to a specified collector. Drop Monitor session is defined by flow classifiers that are used to identify a flow that needs to be monitored for packet drops.
125+
1.0 Drop Monitor feature allows user to configure a Drop Monitor session on a given switch. Drop Monitor session is defined by flow classifiers that are used to identify a flow that needs to be monitored for packet drops.
126+
127+
1.1 Drop Monitor supports *external* mode, where it can send the drop-reports to a specified collector.
128+
129+
1.2 Drop Monitor supports *local* mode where can provide information on dropped-flows on the Switch.
116130

117131
2.0 Drop Monitor provisioning as listed below.
118132

@@ -124,7 +138,9 @@ The Drop Monitor feature in SONiC allows the user to setup packet-drop monitorin
124138

125139
2.4 TAM collector configuration that can be attached to Drop Monitor session to send drop reports.
126140

127-
2.5 An aging-interval configuration. If the Drop Monitor feature doesn't notice packet drops for this duration, it considers packet drops to have stopped.
141+
2.5 The *local* mode is facilitated with a built-in collector named "local". This collector provides flow information locally on the Switch.
142+
143+
2.6 An aging-interval configuration. If the Drop Monitor feature doesn't notice packet drops for this duration, it considers packet drops to have stopped.
128144

129145
3.0 When the first packet of the flow is dropped by the switch, a "Drop-start" report is sent to the collector. This report contains the event type (Drop-start), first 128 bytes of the packet dropped, flow details and the drop reasons for the packet drop.
130146

@@ -149,11 +165,13 @@ The TAM Drop Monitor feature supports the new management framework and KLISH CLI
149165
- To activate / de-activate the feature
150166
- To create/clear appropriate Drop Monitor configuration on a per-flow-group basis and switch-wide.
151167
- To display current status and statistics for the Drop Monitor on a per flow-group basis.
168+
- To display packet drop information on a per-flow basis, when the Drop Monotor feature is used in *local* mode.
152169

153170
### 1.1.3 Scalability Requirements
154171

155172
- Number of Drop Monitor sessions that can be supported is proportional to the availability of resources in hardware such as ACLs. No specific constraints are imposed.
156173
- Only a single collector is supported.
174+
- When used in *local* mode, not more than 100 flows may be monitored for packet drops.
157175

158176
## 1.2 Design Overview
159177

@@ -225,6 +243,12 @@ The DropMonitorMgr runs in the TAM docker and is used to pass drop monitor confi
225243

226244
The DropMonitorMgr configures the source IP address to be used in drop reports to the system IP address. 9073 is configured as the source port number to be used in drop reports.
227245

246+
## 3.1.2 Local Mode
247+
248+
A thread DropMonitorCollector is run as part of the DropMonitorMgr daemon, when set in *local* mode. SAI is setup to send the drop reports locally, to a socket listening on a UDP port number. DropMonitorCollector thread receives the drop reports, deciphers them and loads appropriate information to the TAM_DROPMONITOR_FLOW_STATUS_TABLE table in the COUNTERS_DB.
249+
250+
A specific CPU queue is configured to receive the drop-reports from the hardware. This queue is rate-limited to 500pps to prevent flooding of the CPU. For the local debugging purposes, not more than 100 flows will be needed for monitoring. Given that drop reports are stateful (not all drops are reported by hardware), this number 500pps is more than sufficient.
251+
228252
## 3.2 DB Changes
229253

230254
### 3.2.1 CONFIG DB
@@ -236,6 +260,7 @@ TAM\_DROPMONITOR\_TABLE
236260
key = global ; Only one instance and
237261
; has a fixed key ”global".
238262
aging-interval = 1 * 5DIGIT ; Aging interval in seconds
263+
mode = 1 * 255VCHAR ; "external" or "local"
239264

240265
Example:
241266
> keys *TAM_DROPMONITOR_TABLE*
@@ -245,6 +270,8 @@ TAM\_DROPMONITOR\_TABLE
245270

246271
1) "aging-interval"
247272
2) 3600
273+
3) "mode"
274+
4) "external"
248275

249276
TAM\_DROPMONITOR\_SESSIONS\_TABLE
250277

@@ -354,7 +381,20 @@ N/A
354381

355382
### 3.2.5 COUNTER DB
356383

357-
N/A
384+
TAM\_DROPMONITOR\_FLOW_STATUS\_TABLE
385+
386+
;Defines TAM drop monitor flow status.
387+
388+
key = flow-id ; Flow Id, a unique integer
389+
src-ip = ipv4_address ; SRC IP of the flow 5-tuple
390+
src-port = 1 * 4DIGIT ; SRC L4 port number of the flow 5-tuple
391+
dst-ip = ipv4_address ; DST IP of the flow 5-tuple
392+
dst-port = 1 * 4DIGIT ; DST L4 port number of the flow 5-tuple
393+
protocol = 1 * 4DIGIT ; Protocol number of the flow 5-tuple
394+
state = 1*255VCHAR ; drop state for the flow
395+
; can be one of "dropping" or "inactive"
396+
timestamp = 1*255VCHAR ; time at which the drops were detected
397+
drop-reason = 1*255VCHAR ; Reason for packet drop
358398

359399

360400
## 3.3 Switch State Service Design
@@ -425,8 +465,8 @@ A Drop Monitoring session associated a previously defined flow-group as describe
425465
- The Drop Monitor session must have a unique name for referencing.
426466
- The flow-group must be previously created with the `flow-group` command (under `config-tam` hierarchy). For drop-monitoring, the flow-group must be associated with an interface.
427467
- The sampling-rate can be set, by referencing a previously created sampler, created with the `sampler` command (under `config-tam` hierarchy).
428-
- A collector must be associated with the session, where the drop-reports will be sent. The collector must be previously created with the `collector` command (under `config-tam` hierarchy)..
429-
468+
- A collector must be associated with the session, where the drop-reports will be sent. The collector must be previously created with the `collector` command (under `config-tam` hierarchy). When Drop Monitor is setup in `local` mode, the collector parameter is optional and is ignored.
469+
430470
When a sesssion that is previously created is removed (with the `no` command), the associated flows are no longer monitored for drops by the switch.
431471

432472
The following attribtes are supported for drop-monitor sessions.
@@ -435,7 +475,7 @@ The following attribtes are supported for drop-monitor sessions.
435475
|--------------------------|-------------------------------------|
436476
| `name` | A string that uniquely identifies the Drop Monitor session |
437477
| `flowgroup` | Specifies the name of *flow-group* |
438-
| `collector` | Specifies the name of the *collector* |
478+
| `collector` | Specifies the name of the *collector*|
439479
| `sample-rate` | Specifies the name of the *sampler* |
440480

441481

@@ -447,6 +487,32 @@ sonic(config-tam-dm)# session <name> flowgroup <fg-name> collector <col-name> [s
447487
sonic (config-tam-dm)# no session <name>
448488
```
449489

490+
#### 3.6.2.4 Setting up Drop Monitoring mode
491+
492+
The `mode` command changes the Drop Monitoring mode. By default, the `external` mode is used. This command can be used to change the mode to `local` and back. No active sessions must be present at the time of a mode switch.
493+
494+
The command syntax for setting up the aging interval for Drop Monitoring is as follows:
495+
496+
```
497+
sonic (config-tam-dm)# [no] mode { external | local }
498+
```
499+
| **Attribute** | **Description** |
500+
|--------------------------|-------------------------------------|
501+
| `mode` | One of the two strings `external` and `local`, representing the monitoring mode, Default value is `external` |
502+
503+
The no form of the command reverts the mode to the default i.e., `external` mode.
504+
505+
#### 3.6.2.5 Clearing dropped flows (Local Mode)
506+
507+
This commands clears all flows that are currently tracked as dropped-flows by the Drop Monitor while in Local Mode. It removes the associated information from the TAM_DROPMONITOR_FLOW_STATUS_TABLE. If the flow experiences drops again, they will be reported again.
508+
509+
The command syntax for clearing the Drop Monitor tracked dropped-flows is as follows:
510+
511+
```
512+
sonic# clear tam drop-monitor flows
513+
514+
```
515+
450516
### 3.6.3 Show Commands
451517

452518
#### 3.6.3.1 Listing the Drop Monitor attributes
@@ -464,12 +530,13 @@ sonic # show tam drop-monitor
464530
Status : Active
465531
Switch ID : 2020
466532
Aging Interval : 60
533+
Mode : external
467534
468535
```
469536

470-
#### 3.6.3.1 Listing the Drop Monitor sessions
537+
#### 3.6.3.2 Listing the Drop Monitor sessions
471538

472-
The following command lists the details for all drop-monitor sessions or for a specific session. Note that only explicitly configured tuples in the associated flow-group are displayed.
539+
The following command lists the details for all drop-monitor sessions or for a specific session. Note that only explicitly configured tuples in the associated flow-group are displayed. When configured in `local` mode, the names of the *Collector* are shown with the string *local*.
473540

474541
```
475542
sonic # show tam drop-monitor sessions [<name>]
@@ -502,6 +569,30 @@ Packet Count : 7656
502569
503570
```
504571

572+
#### 3.6.3.2 Listing the dropped flows (Local mode)
573+
574+
The following command lists the details for all flows that are dropped by the Switch. The details include the 5-tuple of the flow, time-stamp of the first detected drop and the drop-reason.
575+
576+
The flows listed in this command output are tracked until they are no longer dropped (drop-stop event) or user explicitly clears via the `clear` command.
577+
578+
This command provides appropriate data only when Drop Monitor is configured in `local` mode. Otherwise, it returns appropriate error.
579+
580+
```
581+
sonic # show tam drop-monitor flows
582+
```
583+
584+
Sample usage shown below.
585+
586+
```
587+
sonic # show tam drop-monitor flows
588+
589+
src-id dst-ip src-port dst-port protocol drop-reason timestamp
590+
----------- -------------- -------- -------- -------- --------------------- ---------------------
591+
10.10.1.1 10.10.2.2 5656 80 6 L3_DEST_MISS 2021-06-11 11:22AM
592+
10.10.1.1 10.10.2.2 5656 80 6 UNKNOWN_VLAB 2021-06-11 11:20AM
593+
594+
```
595+
505596
### 3.6.4 Sample Workflow
506597

507598
This section provides a sample Drop Monitor workflow using CLI, for monitoring the packet drops as described below.
@@ -721,6 +812,8 @@ TBD
721812

722813
* Drop Monitor feature is an *advanced* feature that is not available in all the Broadcom SONiC packages.
723814

815+
* The Drop Monitor feature is a BroadcomSONiC-Only feature. This will not be contributed to Community.
816+
724817
## Specific Limitations
725818

726819
Drop Monitor feature in SONiC inherits the limitations of the underlying firmware and the hardware. These are listed below.
@@ -729,6 +822,18 @@ Drop Monitor feature in SONiC inherits the limitations of the underlying firmwar
729822
2. Drop Monitor flows must be IPv4 flows
730823
3. Drop Monitor is supported on TD3-X7, TH2 and TH3 platforms only.
731824

825+
## Local Mode design notes
826+
827+
The 'Local' mode is meant for limited number of flows (<100 flows) for drop monitoring on the Switch. Otherwise, the number of reports may overwhelm the CPU. A specific CPU queue is assigned for this traffic and is ratelimited to 500pps for preventing CPU spikes.
828+
829+
A side effect of this rate-limiting is that some drop reports may get dropped.
830+
831+
1. If the drop-start reports are dropped, then the associated flows won't be reported (as dropped) in COUNTERS_DB.
832+
2. If the drop-active reports are dropped, then the drop-reasons are not updated COUNTERS_DB.
833+
3. If the drop-stop reports are dropped, then the flows remain in the COUNTERS_DB until they are explicitly cleared via the clear command.
834+
835+
However, given Local mode is used for limited debugguing - less than 100 flows - the worst-case number of drop-reports hitting CPU should always remain less than the rate-limit of 500pps.
836+
732837
## Supported Drop Reasons
733838

734839
The drop reasons supported are as below:-

0 commit comments

Comments
 (0)