Skip to content

Commit c5fd06a

Browse files
authored
[Auto-Techsupport] Minor Changes to Auto-Techsupport Feature Documented (#990)
Changes include the following: APP Extension Registry/De-Registry Remove the stale information related to SONiC-SONiC upgrade Section Related to Tech-support Locking mechanism
1 parent 05de04c commit c5fd06a

File tree

1 file changed

+48
-107
lines changed

1 file changed

+48
-107
lines changed

doc/auto_techsupport_and_coredump_mgmt.md

+48-107
Original file line numberDiff line numberDiff line change
@@ -7,25 +7,29 @@
77
* [1. Overview](#1-overview)
88
* [2. High Level Requirements](#2-high-level-requirements)
99
* [3. Core Dump Generation in SONiC](#3-core-dump-generation-in-sonic)
10-
* [4. Schema Additions](#4-schema-additions)
11-
* [6. CLI Enhancements](#5-cli-enhancements)
10+
* [4. Memory usage based techsupport invocation](#4-Memory-usage-based-techsupport-invocation)
11+
* [5. Schema Additions](#5-schema-additions)
12+
* [6. CLI Enhancements](#6-cli-enhancements)
1213
* [7. Design](#6-design)
13-
* [6.1 Modifications to coredump-compress script](#61-Modifications-to-coredump-compress-script)
14-
* [6.2 coredump_gen_handler script](#62-coredump_gen_handler-script)
15-
* [6.3 Modifications to generate_dump script](#64-Modifications-to-generate-dump-script)
16-
* [6.4 techsupport_cleanup script](#65-techsupport_cleanup-script)
17-
* [6.5 Warmboot consideration](#65-Warmboot-consideration)
18-
* [6.6 MultiAsic consideration](#66-MultiAsic-consideration)
19-
* [6.7 Design choices for max-techsupport-limit & max-techsupport-limit arguments](#67-Design-choices-for-max-core-limit-&-max-techsupport-limit-arguments)
20-
* [8. Test Plan](#7-Test-Plan)
21-
* [9. SONiC-to-SONiC Upgrade Considerations](#8-SONiC-to-SONiC-Upgrade-Considerations)
14+
* [7.1 Modifications to coredump-compress script](#71-Modifications-to-coredump-compress-script)
15+
* [7.2 coredump_gen_handler script](#72-coredump_gen_handler-script)
16+
* [7.3 Modifications to generate_dump script](#73-Modifications-to-generate-dump-script)
17+
* [7.4 techsupport_cleanup script](#74-techsupport_cleanup-script)
18+
* [7.5 Warmboot consideration](#75-Warmboot-consideration)
19+
* [7.6 MultiAsic consideration](#76-MultiAsic-consideration)
20+
* [7.7 Design choices for max-techsupport-limit & max-techsupport-limit arguments](#77-Design-choices-for-max-core-limit-&-max-techsupport-limit-arguments)
21+
* [7.8 Techsupport Locking](#78-Techsupport-Locking)
22+
* [8. Test Plan](#8-Test-Plan)
23+
* [9. SONiC-to-SONiC Upgrade Considerations](#9-SONiC-to-SONiC-Upgrade-Considerations)
24+
* [10. App Extension Consideration](#9-App-Extension-Considerations)
25+
* [11. Open questions](#10-Open-questions)
2226

2327

2428
### Revision
2529
| Rev | Date | Author | Change Description |
2630
|:---:|:-----------:|:-------------------------|:----------------------|
2731
| 1.0 | 06/22/2021 | Vivek Reddy Karri | Auto Invocation of Techsupport, triggered by a core dump |
28-
| 1.1 | TBD | Vivek Reddy Karri | Add the capability to Register/Deregister app extension to AUTO_TECHSUPPORT_FEATURE table |
32+
| 1.1 | 04/08/2022 | Vivek Reddy Karri | Add the capability to Register/Deregister app extension to AUTO_TECHSUPPORT_FEATURE table |
2933
| 2.0 | TBD | Vivek Reddy Karri | Extending Support for Kernel Dumps |
3034
| 3.0 | 02/2022 | Stepan Blyshchak | Extending Support for memory usage threshold crossed |
3135

@@ -380,13 +384,13 @@ sonic_dump_r-lionfish-16_20210901_222408 teamd python3.1630535045.34.
380384
381385
```
382386

383-
## 6. Design
387+
## 7. Design
384388

385-
### 6.1 Modifications to coredump-compress script
389+
### 7.1 Modifications to coredump-compress script
386390

387391
The coredump-compress script is updated to invoke the `coredump_gen_handler` script once it is done writing the core file to /var/core. Any stdout/stderr seen during the execution of `coredump_gen_handler` script is redirected to `/tmp/coredump_gen_handler.log`. This script is enhanced to determine which container the dump belongs to and passes it to the coredump_gen_handler script.
388392

389-
### 6.2 coredump_gen_handler script
393+
### 7.2 coredump_gen_handler script
390394

391395
A script under the name `coredump_gen_handler.py` is added to `/usr/local/bin/` directory which will be invoked after a coredump is generated. The script first checks if this feature is enabled by the user. The script then verifies if a core dump file is created within the last 20 sec and if yes, it moves forward.
392396

@@ -405,11 +409,11 @@ DATE sonic NOTICE coredump_gen_handler[pid]: coredump_cleanup is disabled. No c
405409
DATE sonic ERR coredump_gen_handler[pid]: "show techsupport --since '2 days ago'" was run, but no techsupport dump is found
406410
```
407411

408-
### 6.3 Modifications to generate_dump script
412+
### 7.3 Modifications to generate_dump script
409413

410414
The generate_dump script is updated to invoke the `techsupport_cleanup` script to handle the cleanup of techsupport files. Any stdout/stderr seen during the execution of `techsupport_cleanup` script is redirected to `/tmp/coredump_gen_handler.log`
411415

412-
### 6.4 techsupport_cleanup script
416+
### 7.4 techsupport_cleanup script
413417

414418
A script under the name `techsupport_cleanup.py` is added to `/usr/local/bin/` directory which will be invoked after a techsupport dump is created. The script first checks if the feature is enabled by the user. It then checks if the limit configured by the user has crossed and deletes the old techsupport files, if any.
415419

@@ -419,17 +423,17 @@ DATE sonic NOTICE techsupport_cleanup[pid]: techsupport_cleanup is disabled. No
419423
DATE sonic INFO coredump_gen_handler[pid]: max-techsupport-size argument is not set. No Cleanup is performed, current size occupied: 456 MB
420424
```
421425

422-
### 6.5 Warmboot consideration
426+
### 7.5 Warmboot consideration
423427

424428
No changes to this flow
425429

426-
### 6.6 MultiAsic consideration
430+
### 7.6 MultiAsic consideration
427431

428432
Configuration specified for the default feature name in the AUTO_TECHSUPPORT_FEATURE table is applied across all the masic instances.
429433

430434
i.e. rate_limit_interval defined in the AUTO_TECHSUPPORT_FEATURE|swss key is applied for swss1, swss2, etc
431435

432-
### 6.7 Design choices for max-techsupport-limit & max-techsupport-limit arguments
436+
### 7.7 Design choices for max-techsupport-limit & max-techsupport-limit arguments
433437

434438
Firstly, Size-based cleanup design was inspired from MaxUse= Argument in the systemd-coredump.conf https://www.freedesktop.org/software/systemd/man/coredump.conf.html
435439

@@ -452,7 +456,15 @@ A default value of 5% would amount to a minimum of 500 MB which is a already a d
452456

453457
Although if the admin feels otherwise, these values are configurable.
454458

455-
## 7. Test Plan
459+
### 7.8 Techsupport Locking
460+
461+
Recently, an enhancement was made to techsupport script to only run one instance at a time by using a locking mechanism. When other script instance of techsupport tries to run, it'll exit with a relevent code. This would apply nevertheless of how a techsupport was invoked i.e. manual or through auto-techsupport.
462+
463+
With this change, rate-limit-interval of zero would not make any difference. The locking mechanism would implicitly impose a minimum rate-limit-interval of techsupport execution time. And since, the techsupport execution time can't be found out and varies based on underlying machine and system state, the range of values configurable for the rate-limit-interval is left unchanged
464+
465+
A relevant message will be logged to syslog when the invocation fails because of LOCKFAIL exit code.
466+
467+
## 8. Test Plan
456468

457469
Enhance the existing techsupport sonic-mgmt test with the following cases.
458470

@@ -462,96 +474,25 @@ Enhance the existing techsupport sonic-mgmt test with the following cases.
462474
| 2 | Check if the techsupport cleanup is working as expected |
463475
| 3 | Check if the global rate-& & per-process rate-limit-interval is working as expected |
464476
| 4 | Check if the core-dump cleanup is working as expected |
465-
| 5 | Check if the core-dump generated when reaching memory threshold |
466-
## 8. SONiC-to-SONiC Upgrade Considerations
477+
| 5 | Check if the core-dump generated when reaching memory threshold |
467478

468-
The default config required for auto_techsupport is present in the init_cfg.json. Therefore, when a clean installation of SONiC is performed, the configuration is found in the config DB and the feature is active.
479+
## 9. SONiC-to-SONiC Upgrade Considerations
469480

470-
However, in the case of SONiC-SONiC upgrade, the previous config_db.json is migrated and init_cfg.json is not involved. In that case, it is the responsibility of the admin to provide the config, if the admin wants to leverage this feature.
481+
The configuration in the init_cfg.json is loaded to the running config i.e. CONFIG_DB even in the case of SONiC-SONiC upgrade from a older image which doesn't support this feature.
471482

472-
Load this Example config provided below to enable the feature. Each of the fields are explained in Section 4 and can be modified accordingly
483+
### 10 App Extension Considerations
484+
485+
Detailed Info related to Appliation Extension can be found here: https://github.com/Azure/SONiC/blob/master/doc/sonic-application-extension/sonic-application-extention-hld.md
486+
487+
A new AUTO_TECHSUPPORT_FEATURE register/deregister option will be introduced. The existing FeatureRegistry class will be enahcned to also add/delete configuration related to AUTO_TECHSUPPORT_FEATURE table.
488+
489+
This will be run when the application installs/uninstalls. Since, the auto-techsupport feature uses compile time flag to determine whether to enable/disable itself, it is not possible to determine that at runtime when the application is installed.
490+
491+
Thus the decision to whether or not to enable the new feature will be based on the current values of AUTO_TECHSUPPORT & AUTO_TECHSUPPORT_FEATURE tables. The default value for new feature will be disabled if the global state is shown disabled in init_cfg.json. If not, the feature will be enabled. The rate-limit-interval & memory threshold is set to 600 & 10% by default.
492+
493+
When the app get un-installed, all the config will be cleared unless keep-config option is used.
473494

474-
```
475-
{
476-
"AUTO_TECHSUPPORT": {
477-
"GLOBAL": {
478-
"state": "enabled",
479-
"rate_limit_interval": "180",
480-
"max_techsupport_limit": "10.0",
481-
"max_core_limit": "5.0",
482-
"available_mem_threashold": "10.0",
483-
"since": "2 days ago"
484-
}
485-
},
486-
"AUTO_TECHSUPPORT_FEATURE": {
487-
"bgp": {
488-
"state": "enabled",
489-
"rate_limit_interval": "600"
490-
},
491-
"database": {
492-
"state": "enabled",
493-
"rate_limit_interval": "600"
494-
},
495-
"lldp": {
496-
"state": "enabled",
497-
"rate_limit_interval": "600"
498-
},
499-
"pmon": {
500-
"state": "enabled",
501-
"rate_limit_interval": "600"
502-
},
503-
"radv": {
504-
"state": "enabled",
505-
"rate_limit_interval": "600"
506-
},
507-
"snmp": {
508-
"state": "enabled",
509-
"rate_limit_interval": "600"
510-
},
511-
"swss": {
512-
"state": "enabled",
513-
"rate_limit_interval": "600"
514-
},
515-
"syncd": {
516-
"state": "enabled",
517-
"rate_limit_interval": "600"
518-
},
519-
"teamd": {
520-
"state": "enabled",
521-
"rate_limit_interval": "600"
522-
},
523-
"dhcp_relay": {
524-
"state": "enabled",
525-
"rate_limit_interval": "600"
526-
},
527-
"mgmt-framework": {
528-
"state": "enabled",
529-
"rate_limit_interval": "600"
530-
},
531-
"mux": {
532-
"state": "enabled",
533-
"rate_limit_interval": "600"
534-
},
535-
"nat": {
536-
"state": "enabled",
537-
"rate_limit_interval": "600"
538-
},
539-
"sflow": {
540-
"state": "enabled",
541-
"rate_limit_interval": "600"
542-
},
543-
"macsec": {
544-
"state": "enabled",
545-
"rate_limit_interval": "600"
546-
},
547-
"telemetry": {
548-
"state": "enabled",
549-
"rate_limit_interval": "600"
550-
}
551-
}
552-
}
553-
```
554495

555-
# Open question
496+
## 11. Open questions
556497

557498
1. Is 10 % free memory/90 % used memory threshold a reasonable default?

0 commit comments

Comments
 (0)