Skip to content

Routeorch calls SAI API set attribute for a routing entry which was just removed from SAI in the same bulk operation. Should recreate it in such case. #9434

Closed
sonic-net/sonic-swss
#2071
@stephenxs

Description

@stephenxs

Description

Orchagent exited due to setting a non-existing routing entry during bgp speaker test with the latest swss
The issue was observed during azure pipeline kvm t0 bgp speaker test of PR #9397.

Steps to reproduce the issue:

Describe the results you received:

The following error observed during bgp speaker test

Nov 30 18:18:03.998387 vlab-01 ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"193.9.80.128/25","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000022"} doesn't exist
Nov 30 18:18:03.998979 vlab-01 ERR swss#orchagent: :- flush_setting_entries: EntityBulker.flush set entry attribute failed, number of entries to set: 1000, status: SAI_STATUS_INVALID_PARAMETER
Nov 30 18:18:03.999571 vlab-01 ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"192.236.72.128/25","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000022"} doesn't exist
Nov 30 18:18:04.000069 vlab-01 ERR swss#orchagent: :- flush_setting_entries: EntityBulker.flush set entry attribute failed, number of entries to set: 1000, status: SAI_STATUS_INVALID_PARAMETER
Nov 30 18:18:04.000690 vlab-01 ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"192.254.80.128/25","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000022"} doesn't exist
Nov 30 18:18:04.001128 vlab-01 ERR swss#orchagent: :- flush_setting_entries: EntityBulker.flush set entry attribute failed, number of entries to set: 1000, status: SAI_STATUS_INVALID_PARAMETER
Nov 30 18:18:04.001549 vlab-01 ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"192.188.120.128/25","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000022"} doesn't exist
Nov 30 18:18:04.001984 vlab-01 ERR swss#orchagent: :- flush_setting_entries: EntityBulker.flush set entry attribute failed, number of entries to set: 321, status: SAI_STATUS_INVALID_PARAMETER
Nov 30 18:18:04.003547 vlab-01 ERR swss#orchagent: :- addRoutePost: Failed to set route 0.0.0.0/0 with packet action forward, -23
Nov 30 18:18:04.003869 vlab-01 ERR swss#orchagent: :- handleSaiSetStatus: Encountered failure in set operation, exiting orchagent, SAI API: SAI_API_ROUTE, status: SAI_STATUS_NOT_EXECUTED

route orchagent operating flow on the routing entries:

  • set,
  • then remove
  • and then set again.

So for the second set operation, route orchagent should call "create" instead of "set".
However, according to sairedis.rec, it called "set" for the second set operation when the routing entries didn't exist in SAI, which caused SAIredis failure.

ROUTE_TABLE in swss.rec:

2021-11-30.18:17:48.627923|ROUTE_TABLE:192.188.120.128/25|SET|nexthop:10.0.0.59,10.0.0.61,10.0.0.63|ifname:PortChannel0002,PortChannel0003,PortChannel0004
2021-11-30.18:17:49.928608|ROUTE_TABLE:192.236.72.128/25|SET|nexthop:10.0.0.59,10.0.0.61,10.0.0.63|ifname:PortChannel0002,PortChannel0003,PortChannel0004
2021-11-30.18:17:50.119089|ROUTE_TABLE:192.254.80.128/25|SET|nexthop:10.0.0.59,10.0.0.61,10.0.0.63|ifname:PortChannel0002,PortChannel0003,PortChannel0004
2021-11-30.18:17:50.380784|ROUTE_TABLE:193.9.80.128/25|SET|nexthop:10.0.0.59,10.0.0.61,10.0.0.63|ifname:PortChannel0002,PortChannel0003,PortChannel0004
2021-11-30.18:17:58.852071|ROUTE_TABLE:192.188.120.128/25|DEL
2021-11-30.18:17:58.872891|ROUTE_TABLE:192.254.80.128/25|DEL
2021-11-30.18:17:58.901643|ROUTE_TABLE:193.9.80.128/25|DEL
2021-11-30.18:17:58.941370|ROUTE_TABLE:192.236.72.128/25|DEL
2021-11-30.18:18:00.723525|ROUTE_TABLE:192.236.72.128/25|SET|nexthop:10.0.0.57|ifname:PortChannel0001
2021-11-30.18:18:00.997692|ROUTE_TABLE:193.9.80.128/25|SET|nexthop:10.0.0.57|ifname:PortChannel0001
2021-11-30.18:18:01.004276|ROUTE_TABLE:192.254.80.128/25|SET|nexthop:10.0.0.57|ifname:PortChannel0001
2021-11-30.18:18:01.455867|ROUTE_TABLE:192.188.120.128/25|SET|nexthop:10.0.0.57|ifname:PortChannel0001

Describe the results you expected:

Should not be error.

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

The flow should be like this.
During the previous test, it announced the routing entries at the beginning and withdrew them after test finished. And then the next test started, the same set of routing entries were announced again.
So for each routing entry P, fpmsyncd issued the next operations:

  • add route P
  • remove route P
  • add route P

routeorch should handle the notifications as below:

  • add route P: call SAI API create_route in bulk mode and then insert P into m_syncdRoutes after bulk call returned
  • remove route P: call SAI API remove_route in bulk mode and then remove it from m_syncdRoutes after bulk call returned
  • add route P: call SAI API create_route in bulk mode and then insert P into m_syncdRoutes after bulk call returned

However, since there were a large amount of routing entries being notified, the notifications of P for remove and the second add were packed and handled in the same bulk call.

Then the flow was:

  • add route P: called SAI API gRouteBulker.create_entry in bulk mode and then inserted P into m_syncdRoutes after bulk call returned
  • remove route P: called SAI API gRouteBulker.remove_entry in bulk mode, the call was pending.
  • add route P: called SAI API gRouteBulker.set_entry_attribute since P had not been removed from m_syncdRoutes, which is problematic because when this operation would be handled by SAI P had been removed.
  • called bulk.flush and then handled the result:

When SAI handled the operations,

  • remove_entry: notified vendor SAI to remove P and removed P from m_saiObjectCollection as well.
  • set_entry_attribute: reported error since P had been removed from m_saiObjectCollection when remove_entry was called for P. Therefore, it notified orchagent to exit.

I think this is a bug in the routeorch for a while but it occurs only in the very rare scenarios.
It is observed after PR sonic-net/sonic-swss#1992 was merged because in that PR there is an optimization that pushes notifications from redis-db table to m_toSync as many as possible:
The old logic:

        std::deque<KeyOpFieldsValuesTuple> entries;
        subTable->pops(entries);
        return addToSync(entries);

The new logic:

        do
        {
            std::deque<KeyOpFieldsValuesTuple> entries;
            subTable->pops(entries);
            update_size = addToSync(entries);
            total_size += update_size;
        } while (update_size != 0);
        return total_size;

So in the new logic, there is a higher probability for orchagent to merge all notifications together and for remove and the second set to be bundled in one bulk, and therefore, it is more likely to trigger the bug.

A solution can be:

  • Introduce a flag indicating whether a prefix is pending removal.
  • In add route process, if a prefix with the pending removal flag is about to be added, don't handle it until the bulk has been flushed.

WAs can be:

  • Revert the optimization
  • Add delay between two tests where orchagent was aborted.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions