-
Notifications
You must be signed in to change notification settings - Fork 1.4k
zebra: kernel level graceful restart functionality. #4197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💚 Basic BGPD CI results: SUCCESS, 0 tests failedResults table
For details, please contact louberger |
Continuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-7356/ This is a comment from an automated CI system. Warnings Generated during build:Checkout code: Successful with additional warnings
CLANG Static Analyzer Summary
12 Static Analyzer issues remaining.See details at |
Thanks for your contribution! Could you break this up into a few logically separate commits? I'd put the lib and debug changes into separate commits. And then maybe the startup options. |
so this looks like a basic form of graceful restart that we know other people are working on. Let me see if I can get them over here and looking at what you have done. |
@donaldsharp: Sure Donald, Thanks. |
Hi Stephen After getting some reviews on startup part, I will break it into multiple commits if needed. It is not a big diff I feel, so it should be easy to through in single shot. |
@donaldsharp @srimohans @qlyoung Hi Folks, I request you to review this PR or get attention of right person. We use FRR on SONIC (https://azure.github.io/SONiC/), and it is critical to us that when we perform s/w upgrade to FRR, there is no down time. If we can add this PR as part of main FRR repo then we can continue use FRR without any additional maintenance work. We do not want to keep this fix as extra patch, which we need to apply for each FRR version while deploying, that would be cumbersome. Kindly review and let me know if any thing can be improved in diff, Thanks a lot. |
I'm not a big fan of calling this Additionally I would like to see some thought from @riw777 about what a good length of time should be for holding the stale routes |
@donaldsharp : I decided to start with kernel_reconcile, because we do not throw old kernel routes till timer_expire. We try to reconcile first with latest update from protocols. I am happy to choose any good name. We have to choose name for 2 main things: It may be that, this fix can be changed, as you mentioned there are folks working on similar solution. But till then, this fix is critical to us as part of FRR, which we deploy over SONiC. Thanks for looking into it. |
I think this might be easier to find in the future if it were renamed to something like "kernel level graceful restart support" or something... the current name is going to make it hard to figure out this is related to the protocol level graceful restart capabilities. |
Thanks Russ, |
I am planning to go for following logical commits to address review comments: 1.) Move statistic variable in zebra_router.h and may change name for these variables. |
💚 Basic BGPD CI results: SUCCESS, 0 tests failedResults table
For details, please contact louberger |
Continuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-7416/ This is a comment from an automated CI system. Warnings Generated during build:Checkout code: Successful with additional warnings
CLANG Static Analyzer Summary
No Changes in Static Analysis warnings compared to base12 Static Analyzer issues remaining.See details at |
💚 Basic BGPD CI results: SUCCESS, 0 tests failedResults table
For details, please contact louberger |
Continuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-7419/ This is a comment from an automated CI system. Warnings Generated during build:Checkout code: Successful with additional warnings
CLANG Static Analyzer Summary
No Changes in Static Analysis warnings compared to base12 Static Analyzer issues remaining.See details at |
Pending Tasks: |
💚 Basic BGPD CI results: SUCCESS, 0 tests failedResults table
For details, please contact louberger |
Continuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-7422/ This is a comment from an automated CI system. Warnings Generated during build:Checkout code: Successful with additional warnings
CLANG Static Analyzer Summary
No Changes in Static Analysis warnings compared to base12 Static Analyzer issues remaining.See details at |
@praveen-li once you are done with pending tasks let me know. I will check, go ahead and approve. |
@srimohans Sure, I will add support for multicast and will run my tests ( which are only for unicast as of now), and then I will approach you for approval. Thanks for actively looking into it. I appreciate it.
|
@praveen-li me, as well... |
lib: zebra kernel stale route flag. Signed-off-by: Praveen Chaudhary <[email protected]>
Signed-off-by: Praveen Chaudhary <[email protected]>
and add kernel_reconcile_rt in if condition. Signed-off-by: Praveen Chaudhary <[email protected]>
Make kernel_gr_timer configurable. lib: Change the name for zebra flag.
Signed-off-by: Praveen Chaudhary [email protected]
1da299f
to
92fdecd
Compare
@riw777 @srimohans @donaldsharp Hi Reviewers, I have done with all the changes. Kindly note: for multicast, there is no work needed, since we do not push those routes to Kernel. I confirmed the same with Donald. Earlier I was trying to create actual topology to learn routes from peers, but I found in this case, sharpd is best way to test quickly. Copied below a summary for few tests. Kindly review. Thanks a lot for help. Note: I tested with 900K max routes. Test case 1.) Add 12 k routes from sharpd and then restart zebra with -K 35 option. After restart, enter new routes as shown below. Routes before restart: Routes after restart: Expectation: Restart options:
Test case 2.) Add 14 k routes from sharpd and then restart zebra with -K 55 option secs. After restart add no routes. Expectation: Routes before restart Restart options:
Test 3: Test with 900 K routes, with -K 25. Install same 900K routes after restart. Routes before and after restart: Expectation: ZEbra logs and kernel Routes:
Test 4: Test with 900 K routes, with -K 20. Do not install any routes after restart. Routes before restart: Expectation: ZEbra logs and kernel Routes:
|
💚 Basic BGPD CI results: SUCCESS, 0 tests failedResults table
For details, please contact louberger |
Continuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-7491/ This is a comment from an automated CI system. Warnings Generated during build:Checkout code: Successful with additional warnings
CLANG Static Analyzer Summary
No Changes in Static Analysis warnings compared to base12 Static Analyzer issues remaining.See details at |
nice feature. |
Thanks Philippe, Yes after my last commit, I made it configurable with all positive values allowed. Also I updated test results in my last comment with different # of routes and with different timer values. Cheers. |
Signed-off-by: Praveen Chaudhary [email protected]
Continuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-7560/ This is a comment from an automated CI system. Warnings Generated during build:Checkout code: Successful with additional warnings
CLANG Static Analyzer Summary
No Changes in Static Analysis warnings compared to base12 Static Analyzer issues remaining.See details at |
💚 Basic BGPD CI results: SUCCESS, 0 tests failedResults table
For details, please contact louberger |
@srimohans, @donaldsharp, @riw777 Hi Reviewers Feel free to let me know, if any thing pending from my end. As mentioned, I have not added yield in rib_sweep_table as of now, because It is a common code. And yield may have issues while marking route as stale specially during client<->zebra GR, where we may mark new route as stale. So yill further comment, I will assume approve\merge of this feature will be decided after Client<->Zebra design. Thanks for reviewing this. |
Praveen, we had concluded on the design for clietn-Zebra interaction and it will be time stamp based. @donaldsharp will syncup with you on if your code needs any redesign. I can approve after that. I don't have any more comments on this PR. Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a specification doc that bolts down how exactly things are supposed to work, just so responsibilities & states are clear on shutdown & startup
pretty please :)
(same comment as on #4301)
I have a design doc at this location: I can modify it just to have FRR information. |
same comment as #4301 => we want this to be part of the config instead of a startup flag => but for that we need to know when we've finished loading config => working on adding a "config done" mechanism |
@praveen-li interesting, let me find some time to read that & see if it answers the GR details i'm looking for |
@eqvinox : Sure, Kindly note that PR 4301 is based on this PR, so I may close this PR later. And only 4301 will make the final cut. Also, my document includes details of how we were trying to solve it in SONiC so you may skip few topics. Thanks, |
FRR doc without sonic details. |
This comment has been minimized.
This comment has been minimized.
this needs a rebase |
the alternative approach was already committed. This should be closed |
alternate implementation was committed |
zebra: kernel level graceful restart functionality.
lib: zebra kernel stale route flag.
Signed-off-by: Praveen Chaudhary [email protected]
Problem Statement:
On some systems FRR runs inside a docker with other process with it, mainly FPM Related. While s/w upgrade to FRR docker, minimum disturbance to Control Plane Route (i.e Kernel Route) should happen. Today with -r flag or kill -9, Zebra Route are retained in kernel. But during startup Zebra
i.) Without -k flag: Cleans all kernel routes and insert them back if learned via BGPD/OSPF etc again. This may take 10-15 secs.
ii.) With -k flags: Zebra rib contains stale routes, which will not be deleted ever.
Solution Proposed:
Perform Kernel level graceful restart with routes learned from kernel during start up.
Steps:
Perform below if Zebra runs with -K flag .(capital K)
1.) Add a special flag to all routes learned from kernel during start up.
2.) Start a timer of 60 (Configurable) secs to clean stale kernel routes.
3.) If within 60 secs, a route is learned for same prefix (from same or different protocol), then call rib_delnode() on old kernel route. This way rib_process() will always choose new route.
4.) After 60 secs, sweep tables and delete remaining old kernel routes until all stale kernel routes are marked for deletion.
Note:
. -K is chosen to keep it similar to -k. Both deals with old kernel routes.
. Since FRR may run different protocol after startup, so a timer period should be chosen without keeping any particular protocol in mind. I chose 60 secs as per Russ White`s advice. Any reasonable value is good. Also, It is configurable now.
Thanks a lot. Kindly Review:
Test Results: (I have not pasted scaled test result, not to make description so lengthy). [Test results with 900K routes are copied in one of the comment below after latest commit.]
———————————————————————————————————————
Testing for FRR Master:
--Setup includes--
2 Arista BGP Peer.
DUT (Debian Kernel)
———————————————————————————————————————
Routes are published to DUT from 2 BGP peers: (Track 192.168.0.* routes)
Peer1:
Peer2:
Zebra DUT:
Kernel Route: route -n
Stop FRR
Changes in Published route
Start FRR
Zebra logs:
Note: All logs may not be part of final diff
Ip monitor:
Zebra DUT:
Kernel: