-
Notifications
You must be signed in to change notification settings - Fork 1.1k
tests/k8s: fix kbs installation on Azure AKS #11164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @Xynnn007 , let's see if it fixes in the CI. |
ok, it still failed... I will change the code to wait for the IP address to show up. |
d640227
to
9d39811
Compare
JFYI I merged #11157 |
Latest CI execution showed that this fixed the kbs installation, however the job got "red" because tests failed. So the code is ready for review. In meanwhile I will re-trigger the job, see if the tests failures were caused by infra or not. |
9d39811
to
57c171b
Compare
Rebased to main and squashed commits. Ready to review. |
KBS is installed correctly but tests still failing. I cannot think in a reason why my changes would break the tests, so it seems some instability on CI. I suggest to merge this PR then keep on eyes on the next nightly jobs. |
I wouldn't be so quick to dismiss this as infra instability - I'm seeing tests failing consistently that I haven't seen before. It seems like this is the common failure mode now:
Also this seems suspicious:
Is it possible that this PR is introducing new control plane containers that eat into our CPU limits? |
Hi @sprt !
You are right! Thanks for the heads up!
I followed the instructions in https://learn.microsoft.com/en-us/azure/aks/app-routing , and essentially I'm creating the ingress resource with It will create a nginx pods as a result, under the hoods. I don't know why the pending one.
It is a sound hypothesis. I tested with the azure subscription granted to me by Red Hat, and I didn't notice that problem. Well, I also didn't bother to run the tests either... let me put this PR on draft and see if I can reproduce locally. In meanwhile, if you have any tips about the problem and solution, please, talk to your little brother here ;) |
Ok, after further investigation my findings are:
I'm trying to avoid to deploy our own nginx ingress, but honestly I'm not seeing many alternatives. Ideas? |
The AKS CLI recently introduced a regression that prevents using aks-preview extensions (Azure/azure-cli#31345), and hence create CI clusters. To address this, we temporarily hardcode the last known good version of aks-preview. Note that I removed the comment about this being a Mariner requirement, as aks-preview is also a requirement of AKS App Routing, which will be introduced soon in #11164. Signed-off-by: Aurélien Bombo <[email protected]>
(Heads up: changing any label will trigger a rerun of the CI, so we're seeing different results right now.)
If I'm understanding correctly, the added nginx pod maxes out the node CPU capacity. How about moving the affected tests to the |
You got it correctly. Either we (somehow) lower the cpu requested of each nginx pod or increase the worker node size. I don't know how to lower the cpu requests of the azure managed nginx ingress, so we can try to increase the VM sku as you suggested @sprt. How? Is it configured on the subscription? |
This is actually defined in the CI scripts below. The "normal" VM SKU has 4 vCPUs, the "small" one has 2. I would:
kata-containers/tests/integration/kubernetes/run_kubernetes_tests.sh Lines 36 to 99 in 1de466f
kata-containers/tests/gha-run-k8s-common.sh Lines 39 to 51 in 1de466f
|
The Azure AKS addon-http-application-routing add-on is deprecated and cannot be enabled on new clusters which has caused some CI jobs to fail. Migrated our code to use approuting instead. Unlike addon-http-application-routing, this add-on doesn't configure a managed cluster DNS zone, but the created ingress has a public IP. To avoid having to deal with DNS setup, we will be using that address from now on. Thus, some functions no longer used are deleted. Fixes kata-containers#11156 Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
It's used an AKS managed ingress controller which keeps two nginx pod replicas where both request 500m of CPU. On small VMs like we've used on CI for running the CoCo non-TEE tests, it left only a few amount of CPU for the tests. Actually, one of these pod replicas won't even get started. So let's patch the ingress controller to have only one replica of nginx. Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
_print_instance_type() returns the instance type of the AKS nodes, based on the host type. Tests are grouped per host type in "small" and "normal" sets based on the CPU requirements: "small" tests require few CPUs and "normal" more. There is an 3rd case: "all" host type maps to the union of "small" and "normal" tests, which should be handled by _print_instance_type() properly. In this case, it should return the largest instance type possible because "normal" tests will be executed too. Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
By running on "all" host type there are two consequences: 1) run the "normal" tests too (until now, it's only "small" tests), so increasing the coverage 2) create AKS cluster with larger VMs. This is a new requirement due to the current ingress controller for the KBS service eating too much vCPUs and lefting only few for the tests (resulting on failures) Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
local ingress="${1:-}" | ||
|
||
if [[ "${ingress}" = "aks" ]]; then | ||
echo "Patch the ingress controller to have only one replica of nginx" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a one-line comment to quickly explain why this is needed 🙂
7d8426a
to
460c339
Compare
I've just added a few commits.... @sprt 's comment sparked some ideas. Here, I'm now exporting As I changed the workflow, I pushed the changes to a topic branch to run tests...as soon I get the results I will post here. |
In https://github.com/kata-containers/kata-containers/actions/runs/14713221646/job/41292948604 kBS got installed and all tests except for pid-ns passed. I'm going to spin the job again and see if same test fail. |
@wainersm Seeing a bunch of failures here:
|
Please disregard the jobs triggered on this PR. The change on the workflow matter for those tests stop failing, so I've tested using the |
ok, the last job succeeded: https://github.com/kata-containers/kata-containers/actions/runs/14713221646/job/41305482015 It's ready to be merged if no other comments. In meanwhile I will re-trigger the job one more time just for the case... |
@@ -419,13 +419,20 @@ function kbs_k8s_deploy() { | |||
fi | |||
} | |||
|
|||
# Return the kbs service host name in case ingress is configured | |||
# Return the kbs service public IP in case ingress is configured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should we make this "echos the..." as (unfortunately) we can't return a string value here?
while true; do | ||
host=$(kubectl get ingress "${KBS_INGRESS_NAME}" -n "${KBS_NS}" -o jsonpath='{.status.loadBalancer.ingress[0].ip}') | ||
[[ -z "${host}" && ${SECONDS} -lt 30 ]] || break | ||
sleep 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to do SECONDS += 5 here? At the moment I think it will loop forever if there is an issue with ingress coming up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rá! This is a magic that I learned with @ldoktor while ago: that's a bash variable which is updated each time it's referenced. Quoting bash's manpage:
SECONDS
Each time this parameter is referenced, it expands to the number
of seconds since shell invocation. If a value is assigned to
SECONDS, the value returned upon subsequent references is the
number of seconds since the assignment plus the value assigned.
The number of seconds at shell invocation and the current time
are always determined by querying the system clock. If SECONDS
is unset, it loses its special properties, even if it is subse‐
quently reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - so it's just a magic variable. TIL!
The re-trigger from yesterday passed too: https://github.com/kata-containers/kata-containers/actions/runs/14713221646/job/41314671718 Now it's the fun part: merge to realize that on production it fails :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wainersm!
Force-merged as a workflow change tested in a feature. |
The Azure AKS addon-http-application-routing add-on is deprecated and cannot be enabled on new clusters which has caused some CI jobs to fail.
Migrated our code to use approuting instead. Unlike addon-http-application-routing, this add-on doesn't configure a managed cluster DNS zone, but the created ingress has a public IP. To avoid having to deal with DNS setup, we will be using that address from now on. Thus, some functions no longer used are deleted.
Fixes #11156
Signed-off-by: Wainer dos Santos Moschetta [email protected]
This is how I tested it locally:
The
Warning: annotation "kubernetes.io/ingress.class" is deprecated, please use 'spec.ingressClassName' instead
warning I've a fix that I will send to trustee. This doesn't break the installation, so we don't need to bump the kbs version used here.