Skip to content

Support ASIC/SDK health event #3020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 29, 2024

Conversation

stephenxs
Copy link
Collaborator

What I did

Support ASIC/SDK health event

  1. Initialization
    • Fetch capabilities and expose to STATE_DB
    • Register the event handler and categories for each severity when supported
  2. Handle suppress ASIC/SDK health event categories
  3. Handle ASIC/SDK health event reported by SAI redis in the callback context
    • Decode it
    • Log message
    • Send event
  4. Eliminate old events of each severity according to users' configuration

Signed-off-by: Stephen Sun [email protected]

Why I did it

How I verified it

Unit test.

Details if related

@stephenxs stephenxs changed the title ASIC/SDK health event Support ASIC/SDK health event Jan 23, 2024
@stephenxs stephenxs force-pushed the asic-sdk-health-event branch from 6155182 to 49a17e7 Compare February 2, 2024 03:21
@stephenxs stephenxs force-pushed the asic-sdk-health-event branch from 49a17e7 to 61cee46 Compare February 23, 2024 06:25
@prsunny
Copy link
Collaborator

prsunny commented Feb 26, 2024

@prabhataravind to review once the PR is ready

@stephenxs stephenxs force-pushed the asic-sdk-health-event branch from f4388b3 to 9291de7 Compare March 14, 2024 09:43
@stephenxs stephenxs marked this pull request as ready for review March 14, 2024 09:43
@stephenxs stephenxs requested a review from prsunny as a code owner March 14, 2024 09:43
@stephenxs stephenxs force-pushed the asic-sdk-health-event branch from 9291de7 to 38e58eb Compare March 19, 2024 14:12
@stephenxs stephenxs force-pushed the asic-sdk-health-event branch 3 times, most recently from 4133a0d to e1bfded Compare March 31, 2024 04:43
@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs stephenxs force-pushed the asic-sdk-health-event branch from e1bfded to 6a7c155 Compare April 1, 2024 22:45
@stephenxs
Copy link
Collaborator Author

Many covered lines were identified as not-covered. Retry for now

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny prsunny requested a review from prabhataravind April 15, 2024 18:55
@prsunny
Copy link
Collaborator

prsunny commented Apr 15, 2024

@kperumalbfn for viz

@stephenxs stephenxs force-pushed the asic-sdk-health-event branch from 6a7c155 to 93acfe5 Compare April 16, 2024 00:23
@stephenxs
Copy link
Collaborator Author

Looks like the coverage report is not accurate. retriggered

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs
Copy link
Collaborator Author

Looks like there is an issue in coverage report. Many covered lines were reported as uncovered.

(gdb) bt
#0  SwitchOrch::doCfgSuppressAsicSdkHealthEventTableTask (this=0x5555561431c0, consumer=...) at ../../orchagent/switchorch.cpp:948
#1  0x0000555555a0c691 in SwitchOrch::doTask (this=0x5555561431c0, consumer=...) at ../../orchagent/switchorch.cpp:1008
#2  0x000055555586d712 in Orch::doTask (this=0x5555561431c0) at ../../orchagent/orch.cpp:541
#3  0x000055555583d508 in switchorch_test::SwitchOrchTest_SwitchOrchTestSuppressCategories_Test::TestBody (this=<optimized out>) at switchorch_ut.cpp:158
#4  0x0000555555ceb1a7 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#5  0x0000555555ce143e in testing::Test::Run() ()
#6  0x0000555555ce1595 in testing::TestInfo::Run() ()
#7  0x0000555555ce1a29 in testing::TestSuite::Run() ()
#8  0x0000555555ce2072 in testing::internal::UnitTestImpl::RunAllTests() ()
#9  0x0000555555ceb717 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#10 0x0000555555ce1658 in testing::UnitTest::Run() ()
#11 0x00005555556d7050 in main ()

@stephenxs stephenxs force-pushed the asic-sdk-health-event branch from eedc5b6 to bac887f Compare April 18, 2024 13:54
@stephenxs
Copy link
Collaborator Author

Build failures were caused by UT which I didn't see locally. Maybe it is relevant to bookworm docker. Will fix it

Copy link

linux-foundation-easycla bot commented Apr 19, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@stephenxs stephenxs force-pushed the asic-sdk-health-event branch from 4d94f35 to 51ffd31 Compare April 19, 2024 09:30
@stephenxs
Copy link
Collaborator Author

Build failures were caused by UT which I didn't see locally. Maybe it is relevant to bookworm docker. Will fix it

Fixed. It was caused by the failure to load Lua script in the slave docker.

@stephenxs
Copy link
Collaborator Author

Hi @prsunny
Many covered lines were identified as uncovered. who can help to check this?
Thanks.

orchagent/switchorch.cpp | 34.2% | 148-150,161-162,205-206,228,231,878-879,883,885,887-888,895-896,909,913,915,917-918,920,922-924,926,928,930-932,938,940,942-944,947,949-951,953-954,956,958,960-963,966,968-969,973,975,978,980,983,985,989,992,994,1014,1016,1073,1080-1086,1088,1090,1092,1094-1095,1097-1098,1101-1104,1106,1108,1110-1111,1113,1117,1120,1122,1124,1128-1129,1132-1134,1136,1138,1140,1142,1144,1259,1261-1262,1264
Thread 1 "tests" hit Breakpoint 2, SwitchOrch::doCfgSuppressAsicSdkHealthEventTableTask (this=0x55555a8ad3e0, consumer=...) at ../../orchagent/switchorch.cpp:915
915         SWSS_LOG_ENTER();
(gdb) n
917         auto &map = consumer.m_toSync;
(gdb) 
918         auto it = map.begin();
(gdb) 
920         while (it != map.end())
(gdb) 
922             auto keyOpFieldsValues = it->second;
(gdb) 
923             auto key = kfvKey(keyOpFieldsValues);
(gdb) 
924             auto op = kfvOp(keyOpFieldsValues);
(gdb) 
926             SWSS_LOG_INFO("KEY: %s, OP: %s", key.c_str(), op.c_str());
(gdb) 
928             if (key.empty())
(gdb) 
938                 saiSeverity = switch_asic_sdk_health_event_severity_to_switch_attribute_map.at(key);
(gdb) 
947             if (op == SET_COMMAND)
(gdb) 
949                 bool categoriesConfigured = false;
(gdb) 
950                 bool continueMainLoop = false;
(gdb) 
951                 for (const auto &cit : kfvFieldsValues(keyOpFieldsValues))
(gdb) 
953                     auto fieldName = fvField(cit);
(gdb) 
954                     auto fieldValue = fvValue(cit);
(gdb) 
956                     SWSS_LOG_INFO("FIELD: %s, VALUE: %s", fieldName.c_str(), fieldValue.c_str());
(gdb) 
958                     if (m_supportedAsicSdkHealthEventAttributes.find(saiSeverity) == m_supportedAsicSdkHealthEventAttributes.end())
(gdb) 
966                     if (fieldName == "categories")
(gdb) 
968                         registerAsicSdkHealthEventCategories(saiSeverity, key, fieldValue);
(gdb) 

Thread 1 "tests" hit Breakpoint 1, SwitchOrch::registerAsicSdkHealthEventCategories (this=0x55555a8ad3e0, saiSeverity=SAI_SWITCH_ATTR_REG_WARNING_SWITCH_ASIC_SDK_HEALTH_CATEGORY, severityString="warning", 
    suppressed_category_list="software,cpu_hw,invalid_category", isInitializing=false) at ../../orchagent/switchorch.cpp:878
878             auto &&categories = tokenize(suppressed_category_list, ',');
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
Thread 1 "tests" hit Breakpoint 3, SwitchOrch::onSwitchAsicSdkHealthEvent (this=0x55555a8ad3e0, switch_id=141733920768, severity=SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL, timestamp=..., 
    category=SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_FW, data=..., description=...) at ../../orchagent/switchorch.cpp:1080
1080        std::vector<swss::FieldValueTuple> values;
(gdb) n
1081        const string &severity_str = switch_asic_sdk_health_event_severity_reverse_map.at(severity);
(gdb) 
1082        const string &category_str = switch_asic_sdk_health_event_category_reverse_map.at(category);
(gdb) 
1083        string description_str;
(gdb) 
1084        const std::time_t &t = (std::time_t)timestamp.tv_sec;
(gdb) 
1085        stringstream time_ss;
(gdb) 
1086        time_ss << std::put_time(std::localtime(&t), "%Y-%m-%d %H:%M:%S");
(gdb) 
1088        switch (data.data_type)
(gdb) 
1092            vector<uint8_t> description_with_terminator(description.list, description.list + description.count);
(gdb) 

1094            description_with_terminator.push_back(0);
(gdb) 
1095            description_str = string(reinterpret_cast<char*>(description_with_terminator.data()));
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1104                                      description_str.end()))
(gdb) 
1103                                      }),
(gdb) 
1104                                      description_str.end()))
(gdb) 
1097            if (description_str.end() !=
(gdb) 

1092            vector<uint8_t> description_with_terminator(description.list, description.list + description.count);
(gdb) 
1117            { "sai_timestamp", time_ss.str() },
(gdb) 
1120            { "description", description_str }};
(gdb) 
1122        if (0 == gMyAsicName.size())
(gdb) 
1128            SWSS_LOG_NOTICE("[%s] ASIC/SDK health event occurred at %s, asic %s, category %s: %s", severity_str.c_str(), time_ss.str().c_str(), gMyAsicName.c_str(), category_str.c_str(), description_str.c_str());
(gdb) 
1129            params["asic_name"] = gMyAsicName;
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1132        values.emplace_back("severity", severity_str);
(gdb) 
1133        values.emplace_back("category", category_str);
(gdb) 
1134        values.emplace_back("description", description_str);
(gdb) 
1136        m_asicSdkHealthEventTable->set(time_ss.str(),values);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1136        m_asicSdkHealthEventTable->set(time_ss.str(),values);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1138        event_publish(g_events_handle, "asic-sdk-health-event", &params);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1140        if (severity == SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL)
(gdb) 
1142            m_fatalEventCount++;
(gdb) 
1120            { "description", description_str }};
(gdb) 
1085        stringstream time_ss;
(gdb) 
1083        string description_str;
(gdb) 
1080        std::vector<swss::FieldValueTuple> values;
(gdb) 
1144    }
(gdb) 

@prsunny
Copy link
Collaborator

prsunny commented Apr 22, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny
Copy link
Collaborator

prsunny commented Apr 22, 2024

Hi @prsunny Many covered lines were identified as uncovered. who can help to check this? Thanks.

orchagent/switchorch.cpp | 34.2% | 148-150,161-162,205-206,228,231,878-879,883,885,887-888,895-896,909,913,915,917-918,920,922-924,926,928,930-932,938,940,942-944,947,949-951,953-954,956,958,960-963,966,968-969,973,975,978,980,983,985,989,992,994,1014,1016,1073,1080-1086,1088,1090,1092,1094-1095,1097-1098,1101-1104,1106,1108,1110-1111,1113,1117,1120,1122,1124,1128-1129,1132-1134,1136,1138,1140,1142,1144,1259,1261-1262,1264
Thread 1 "tests" hit Breakpoint 2, SwitchOrch::doCfgSuppressAsicSdkHealthEventTableTask (this=0x55555a8ad3e0, consumer=...) at ../../orchagent/switchorch.cpp:915
915         SWSS_LOG_ENTER();
(gdb) n
917         auto &map = consumer.m_toSync;
(gdb) 
918         auto it = map.begin();
(gdb) 
920         while (it != map.end())
(gdb) 
922             auto keyOpFieldsValues = it->second;
(gdb) 
923             auto key = kfvKey(keyOpFieldsValues);
(gdb) 
924             auto op = kfvOp(keyOpFieldsValues);
(gdb) 
926             SWSS_LOG_INFO("KEY: %s, OP: %s", key.c_str(), op.c_str());
(gdb) 
928             if (key.empty())
(gdb) 
938                 saiSeverity = switch_asic_sdk_health_event_severity_to_switch_attribute_map.at(key);
(gdb) 
947             if (op == SET_COMMAND)
(gdb) 
949                 bool categoriesConfigured = false;
(gdb) 
950                 bool continueMainLoop = false;
(gdb) 
951                 for (const auto &cit : kfvFieldsValues(keyOpFieldsValues))
(gdb) 
953                     auto fieldName = fvField(cit);
(gdb) 
954                     auto fieldValue = fvValue(cit);
(gdb) 
956                     SWSS_LOG_INFO("FIELD: %s, VALUE: %s", fieldName.c_str(), fieldValue.c_str());
(gdb) 
958                     if (m_supportedAsicSdkHealthEventAttributes.find(saiSeverity) == m_supportedAsicSdkHealthEventAttributes.end())
(gdb) 
966                     if (fieldName == "categories")
(gdb) 
968                         registerAsicSdkHealthEventCategories(saiSeverity, key, fieldValue);
(gdb) 

Thread 1 "tests" hit Breakpoint 1, SwitchOrch::registerAsicSdkHealthEventCategories (this=0x55555a8ad3e0, saiSeverity=SAI_SWITCH_ATTR_REG_WARNING_SWITCH_ASIC_SDK_HEALTH_CATEGORY, severityString="warning", 
    suppressed_category_list="software,cpu_hw,invalid_category", isInitializing=false) at ../../orchagent/switchorch.cpp:878
878             auto &&categories = tokenize(suppressed_category_list, ',');
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
Thread 1 "tests" hit Breakpoint 3, SwitchOrch::onSwitchAsicSdkHealthEvent (this=0x55555a8ad3e0, switch_id=141733920768, severity=SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL, timestamp=..., 
    category=SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_FW, data=..., description=...) at ../../orchagent/switchorch.cpp:1080
1080        std::vector<swss::FieldValueTuple> values;
(gdb) n
1081        const string &severity_str = switch_asic_sdk_health_event_severity_reverse_map.at(severity);
(gdb) 
1082        const string &category_str = switch_asic_sdk_health_event_category_reverse_map.at(category);
(gdb) 
1083        string description_str;
(gdb) 
1084        const std::time_t &t = (std::time_t)timestamp.tv_sec;
(gdb) 
1085        stringstream time_ss;
(gdb) 
1086        time_ss << std::put_time(std::localtime(&t), "%Y-%m-%d %H:%M:%S");
(gdb) 
1088        switch (data.data_type)
(gdb) 
1092            vector<uint8_t> description_with_terminator(description.list, description.list + description.count);
(gdb) 

1094            description_with_terminator.push_back(0);
(gdb) 
1095            description_str = string(reinterpret_cast<char*>(description_with_terminator.data()));
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1104                                      description_str.end()))
(gdb) 
1103                                      }),
(gdb) 
1104                                      description_str.end()))
(gdb) 
1097            if (description_str.end() !=
(gdb) 

1092            vector<uint8_t> description_with_terminator(description.list, description.list + description.count);
(gdb) 
1117            { "sai_timestamp", time_ss.str() },
(gdb) 
1120            { "description", description_str }};
(gdb) 
1122        if (0 == gMyAsicName.size())
(gdb) 
1128            SWSS_LOG_NOTICE("[%s] ASIC/SDK health event occurred at %s, asic %s, category %s: %s", severity_str.c_str(), time_ss.str().c_str(), gMyAsicName.c_str(), category_str.c_str(), description_str.c_str());
(gdb) 
1129            params["asic_name"] = gMyAsicName;
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1132        values.emplace_back("severity", severity_str);
(gdb) 
1133        values.emplace_back("category", category_str);
(gdb) 
1134        values.emplace_back("description", description_str);
(gdb) 
1136        m_asicSdkHealthEventTable->set(time_ss.str(),values);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1136        m_asicSdkHealthEventTable->set(time_ss.str(),values);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1138        event_publish(g_events_handle, "asic-sdk-health-event", &params);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1140        if (severity == SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL)
(gdb) 
1142            m_fatalEventCount++;
(gdb) 
1120            { "description", description_str }};
(gdb) 
1085        stringstream time_ss;
(gdb) 
1083        string description_str;
(gdb) 
1080        std::vector<swss::FieldValueTuple> values;
(gdb) 
1144    }
(gdb) 

I see coverage works for other PRs. lets check the latest result

@stephenxs stephenxs force-pushed the asic-sdk-health-event branch from 51ffd31 to f70164a Compare April 23, 2024 02:17
Copy link
Contributor

@prabhataravind prabhataravind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@stephenxs
Copy link
Collaborator Author

/apzw run

@prsunny
Copy link
Collaborator

prsunny commented Apr 29, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny prsunny merged commit 054ed34 into sonic-net:master Apr 29, 2024
13 checks passed
@stephenxs stephenxs deleted the asic-sdk-health-event branch April 29, 2024 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants