tests/k8s: fix kbs installation on Azure AKS #11164

wainersm · 2025-04-17T18:30:20Z

The Azure AKS addon-http-application-routing add-on is deprecated and cannot be enabled on new clusters which has caused some CI jobs to fail.

Migrated our code to use approuting instead. Unlike addon-http-application-routing, this add-on doesn't configure a managed cluster DNS zone, but the created ingress has a public IP. To avoid having to deal with DNS setup, we will be using that address from now on. Thus, some functions no longer used are deleted.

Fixes #11156
Signed-off-by: Wainer dos Santos Moschetta [email protected]

This is how I tested it locally:

$ export KBS_INGRESS=aks
$ ./tests/integration/kubernetes/gha-run.sh create-cluster
<output omitted>
$ ./tests/integration/kubernetes/gha-run.sh get-cluster-credentials
<output omitted>
$ ./tests/integration/kubernetes/gha-run.sh deploy-coco-kbs
<output omitted>
::endgroup::
/tmp/trustee/kbs/config/kubernetes/overlays /tmp/trustee/kbs/config/kubernetes ~/src/github.com/kata-containers/kata-containers
::group::/tmp/trustee/kbs/config/kubernetes/overlays/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: kbs
  namespace: coco-tenant
  annotations:
    kubernetes.io/ingress.class: webapprouting.kubernetes.azure.com
spec:
  rules:
  - host: ""
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: kbs
            port:
              number: 8080
::endgroup::
/tmp/trustee/kbs/config/kubernetes ~/src/github.com/kata-containers/kata-containers
::group::Deploy the KBS
namespace/coco-tenant created
configmap/kbs-config-d655m9k8gk created
configmap/policy-config-tf9h6d9bb2 created
secret/kbs-auth-public-key-t2f5fd6t89 created
secret/keys-gmf2g75547 created
service/kbs created
deployment.apps/kbs created
Warning: annotation "kubernetes.io/ingress.class" is deprecated, please use 'spec.ingressClassName' instead
ingress.networking.k8s.io/kbs created
~/src/github.com/kata-containers/kata-containers
::endgroup::
::group::Check the service healthy
pod/kbs-checker-767919 created
pod "kbs-checker-767919" deleted
KBS service respond to requests
::endgroup::
::group::Check the kbs service is exposed
Trying to connect at http://51.8.197.252:80. Timeout=350
KBS service respond to requests at http://51.8.197.252:80
::endgroup::

The Warning: annotation "kubernetes.io/ingress.class" is deprecated, please use 'spec.ingressClassName' instead warning I've a fix that I will send to trustee. This doesn't break the installation, so we don't need to bump the kbs version used here.

wainersm · 2025-04-17T18:31:05Z

Hi @Xynnn007 , let's see if it fixes in the CI.

wainersm · 2025-04-17T19:28:08Z

ok, it still failed... I will change the code to wait for the IP address to show up.

sprt · 2025-04-22T17:16:10Z

JFYI I merged #11157

wainersm · 2025-04-23T16:27:24Z

Latest CI execution showed that this fixed the kbs installation, however the job got "red" because tests failed. So the code is ready for review.

In meanwhile I will re-trigger the job, see if the tests failures were caused by infra or not.

wainersm · 2025-04-23T18:25:33Z

Rebased to main and squashed commits. Ready to review.

wainersm · 2025-04-23T19:47:56Z

KBS is installed correctly but tests still failing. I cannot think in a reason why my changes would break the tests, so it seems some instability on CI.

I suggest to merge this PR then keep on eyes on the next nightly jobs.

sprt · 2025-04-23T19:57:37Z

I wouldn't be so quick to dismiss this as infra instability - I'm seeing tests failing consistently that I haven't seen before.

It seems like this is the common failure mode now:

https://github.com/kata-containers/kata-containers/actions/runs/14625338669/job/41036764485?pr=11164#step:17:5498

#   Warning  FailedScheduling  90s   default-scheduler  0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Also this seems suspicious:

https://github.com/kata-containers/kata-containers/actions/runs/14625338669/job/41036764485?pr=11164#step:17:5498

app-routing-system   nginx-6564f49b47-c6xcs                          1/1     Running   0          23m
app-routing-system   nginx-6564f49b47-tdbd7                          0/1     Pending   0          23m

Two nginx pods in this app-routing-system namespace - are we creating these?
One of them is in the pending state

Is it possible that this PR is introducing new control plane containers that eat into our CPU limits?

wainersm · 2025-04-23T20:32:01Z

Hi @sprt !

I wouldn't be so quick to dismiss this as infra instability - I'm seeing tests failing consistently that I haven't seen before.

You are right! Thanks for the heads up!

It seems like this is the common failure mode now:

https://github.com/kata-containers/kata-containers/actions/runs/14625338669/job/41036764485?pr=11164#step:17:5498
#   Warning  FailedScheduling  90s   default-scheduler  0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Also this seems suspicious:

https://github.com/kata-containers/kata-containers/actions/runs/14625338669/job/41036764485?pr=11164#step:17:5498
app-routing-system   nginx-6564f49b47-c6xcs                          1/1     Running   0          23m
app-routing-system   nginx-6564f49b47-tdbd7                          0/1     Pending   0          23m
* Two nginx pods in this app-routing-system namespace - are we creating these?

I followed the instructions in https://learn.microsoft.com/en-us/azure/aks/app-routing , and essentially I'm creating the ingress resource with ingressClassName set to webapprouting.kubernetes.azure.com likewise.

It will create a nginx pods as a result, under the hoods. I don't know why the pending one.

* One of them is in the pending state
Is it possible that this PR is introducing new control plane containers that eat into our CPU limits?

It is a sound hypothesis. I tested with the azure subscription granted to me by Red Hat, and I didn't notice that problem. Well, I also didn't bother to run the tests either... let me put this PR on draft and see if I can reproduce locally. In meanwhile, if you have any tips about the problem and solution, please, talk to your little brother here ;)

wainersm · 2025-04-24T19:09:27Z

Ok, after further investigation my findings are:

the app-rounting-system add-on creates a managed nginx ingress controller
it's deployed an auto-escaler configured for the minimum of 2 replicas. That's why in tests/k8s: fix kbs installation on Azure AKS #11164 (comment) we see two nginx pods
- I can manage to set 1 replica by patching the controller:
```
kubectl patch nginxingresscontroller/default -n app-routing-system --type=json -p='[{"op": "add", "path": "/spec/scaling/minReplicas", "value": 1}]'
```
- More info about the controller can be found here
Each nginx pod request 500m of cpu, which is too much for our cluster (single node). Reduce the replicas to one helps but it's still too much. Here is the problem: I could not find a way to lower the cpu requirements via controller. I can edit/patch the deployment/nginx object however the controller re-creates it with the default settings (i.e. new pods still get 500m of cpu)

I'm trying to avoid to deploy our own nginx ingress, but honestly I'm not seeing many alternatives. Ideas?

The AKS CLI recently introduced a regression that prevents using aks-preview extensions (Azure/azure-cli#31345), and hence create CI clusters. To address this, we temporarily hardcode the last known good version of aks-preview. Note that I removed the comment about this being a Mariner requirement, as aks-preview is also a requirement of AKS App Routing, which will be introduced soon in #11164. Signed-off-by: Aurélien Bombo <[email protected]>

sprt · 2025-04-24T20:41:51Z

(Heads up: changing any label will trigger a rerun of the CI, so we're seeing different results right now.)

Each nginx pod request 500m of cpu, which is too much for our cluster (single node). Reduce the replicas to one helps but it's still too much. Here is the problem: I could not find a way to lower the cpu requirements via controller. I can edit/patch the deployment/nginx object however the controller re-creates it with the default settings (i.e. new pods still get 500m of cpu)

I'm trying to avoid to deploy our own nginx ingress, but honestly I'm not seeing many alternatives. Ideas?

If I'm understanding correctly, the added nginx pod maxes out the node CPU capacity. How about moving the affected tests to the normal VM SKU, instead of small?

wainersm · 2025-04-24T21:46:39Z

(Heads up: changing any label will trigger a rerun of the CI, so we're seeing different results right now.)

Each nginx pod request 500m of cpu, which is too much for our cluster (single node). Reduce the replicas to one helps but it's still too much. Here is the problem: I could not find a way to lower the cpu requirements via controller. I can edit/patch the deployment/nginx object however the controller re-creates it with the default settings (i.e. new pods still get 500m of cpu)

I'm trying to avoid to deploy our own nginx ingress, but honestly I'm not seeing many alternatives. Ideas?

If I'm understanding correctly, the added nginx pod maxes out the node CPU capacity. How about moving the affected tests to the normal VM SKU, instead of small?

You got it correctly. Either we (somehow) lower the cpu requested of each nginx pod or increase the worker node size. I don't know how to lower the cpu requests of the azure managed nginx ingress, so we can try to increase the VM sku as you suggested @sprt. How? Is it configured on the subscription?

sprt · 2025-04-25T15:24:54Z

This is actually defined in the CI scripts below. The "normal" VM SKU has 4 vCPUs, the "small" one has 2.

I would:

Decrease the number of nginx replicas to 1. (Assuming that, as you said, we can't reliably decrease the CPU requests)
Move tests that are failing with Insufficient cpu from K8S_TEST_SMALL_HOST_UNION to K8S_TEST_NORMAL_HOST_UNION.
Get a few passing runs to convince ourselves that this is a good solution.

kata-containers/tests/integration/kubernetes/run_kubernetes_tests.sh

Lines 36 to 99 in 1de466f

    
           K8S_TEST_SMALL_HOST_ATTESTATION_REQUIRED_UNION=( \ 
        
           	"k8s-guest-pull-image-encrypted.bats" \ 
        
           	"k8s-guest-pull-image-authenticated.bats" \ 
        
           	"k8s-guest-pull-image-signature.bats" \ 
        
           	"k8s-initdata.bats" \ 
        
           	"k8s-confidential-attestation.bats" \ 
        
           ) 
        
           K8S_TEST_SMALL_HOST_UNION=( \ 
        
           	"k8s-guest-pull-image.bats" \ 
        
           	"k8s-confidential.bats" \ 
        
           	"k8s-sealed-secret.bats" \ 
        
           	"k8s-attach-handlers.bats" \ 
        
           	"k8s-block-volume.bats" \ 
        
           	"k8s-caps.bats" \ 
        
           	"k8s-configmap.bats" \ 
        
           	"k8s-copy-file.bats" \ 
        
           	"k8s-cpu-ns.bats" \ 
        
           	"k8s-credentials-secrets.bats" \ 
        
           	"k8s-cron-job.bats" \ 
        
           	"k8s-custom-dns.bats" \ 
        
           	"k8s-empty-dirs.bats" \ 
        
           	"k8s-env.bats" \ 
        
           	"k8s-exec.bats" \ 
        
           	"k8s-file-volume.bats" \ 
        
           	"k8s-hostname.bats" \ 
        
           	"k8s-inotify.bats" \ 
        
           	"k8s-job.bats" \ 
        
           	"k8s-kill-all-process-in-container.bats" \ 
        
           	"k8s-limit-range.bats" \ 
        
           	"k8s-liveness-probes.bats" \ 
        
           	"k8s-measured-rootfs.bats" \ 
        
           	"k8s-memory.bats" \ 
        
           	"k8s-nested-configmap-secret.bats" \ 
        
           	"k8s-oom.bats" \ 
        
           	"k8s-optional-empty-configmap.bats" \ 
        
           	"k8s-optional-empty-secret.bats" \ 
        
           	"k8s-pid-ns.bats" \ 
        
           	"k8s-pod-quota.bats" \ 
        
           	"k8s-policy-hard-coded.bats" \ 
        
           	"k8s-policy-deployment.bats" \ 
        
           	"k8s-policy-job.bats" \ 
        
           	"k8s-policy-logs.bats" \ 
        
           	"k8s-policy-pod.bats" \ 
        
           	"k8s-policy-pvc.bats" \ 
        
           	"k8s-policy-rc.bats" \ 
        
           	"k8s-port-forward.bats" \ 
        
           	"k8s-projected-volume.bats" \ 
        
           	"k8s-qos-pods.bats" \ 
        
           	"k8s-replication.bats" \ 
        
           	"k8s-seccomp.bats" \ 
        
           	"k8s-sysctls.bats" \ 
        
           	"k8s-security-context.bats" \ 
        
           	"k8s-shared-volume.bats" \ 
        
           	"k8s-volume.bats" \ 
        
           	"k8s-nginx-connectivity.bats" \ 
        
           ) 
        
           K8S_TEST_NORMAL_HOST_UNION=( \ 
        
           	"k8s-number-cpus.bats" \ 
        
           	"k8s-parallel.bats" \ 
        
           	"k8s-sandbox-vcpus-allocation.bats" \ 
        
           	"k8s-scale-nginx.bats" \ 
        
           )

kata-containers/tests/gha-run-k8s-common.sh

Lines 39 to 51 in 1de466f

    
           function _print_instance_type() { 
        
           	case "${K8S_TEST_HOST_TYPE}" in 
        
           		small) 
        
           			echo "Standard_D2s_v5" 
        
           			;; 
        
           		normal) 
        
           			echo "Standard_D4s_v5" 
        
           			;; 
        
           		*) 
        
           			echo "Unknown instance type '${K8S_TEST_HOST_TYPE}'" >&2 
        
           			exit 1 
        
           	esac 
        
           }

The Azure AKS addon-http-application-routing add-on is deprecated and cannot be enabled on new clusters which has caused some CI jobs to fail. Migrated our code to use approuting instead. Unlike addon-http-application-routing, this add-on doesn't configure a managed cluster DNS zone, but the created ingress has a public IP. To avoid having to deal with DNS setup, we will be using that address from now on. Thus, some functions no longer used are deleted. Fixes kata-containers#11156 Signed-off-by: Wainer dos Santos Moschetta <[email protected]>

It's used an AKS managed ingress controller which keeps two nginx pod replicas where both request 500m of CPU. On small VMs like we've used on CI for running the CoCo non-TEE tests, it left only a few amount of CPU for the tests. Actually, one of these pod replicas won't even get started. So let's patch the ingress controller to have only one replica of nginx. Signed-off-by: Wainer dos Santos Moschetta <[email protected]>

_print_instance_type() returns the instance type of the AKS nodes, based on the host type. Tests are grouped per host type in "small" and "normal" sets based on the CPU requirements: "small" tests require few CPUs and "normal" more. There is an 3rd case: "all" host type maps to the union of "small" and "normal" tests, which should be handled by _print_instance_type() properly. In this case, it should return the largest instance type possible because "normal" tests will be executed too. Signed-off-by: Wainer dos Santos Moschetta <[email protected]>

By running on "all" host type there are two consequences: 1) run the "normal" tests too (until now, it's only "small" tests), so increasing the coverage 2) create AKS cluster with larger VMs. This is a new requirement due to the current ingress controller for the KBS service eating too much vCPUs and lefting only few for the tests (resulting on failures) Signed-off-by: Wainer dos Santos Moschetta <[email protected]>

sprt · 2025-04-28T15:46:51Z

tests/integration/kubernetes/confidential_kbs.sh

+	local ingress="${1:-}"
+
+	if [[ "${ingress}" = "aks" ]]; then
+		echo "Patch the ingress controller to have only one replica of nginx"


I'd add a one-line comment to quickly explain why this is needed 🙂

wainersm · 2025-04-28T17:05:25Z

I've just added a few commits.... @sprt 's comment sparked some ideas. Here, I'm now exporting K8S_TEST_HOST_TYPE=all and tweaked the script to create a cluster with larger ("normal") VMs. One side effect of this, is on coco non-tee job it will begin to run a few tests from the "normal" subset, which increases coverage. Also I added the code to lower the amount of nginx replicas to one, although with a bigger cluster it would be able to launch the default two replicas.

As I changed the workflow, I pushed the changes to a topic branch to run tests...as soon I get the results I will post here.

wainersm · 2025-04-28T19:12:13Z

In https://github.com/kata-containers/kata-containers/actions/runs/14713221646/job/41292948604 kBS got installed and all tests except for pid-ns passed. I'm going to spin the job again and see if same test fail.

sprt · 2025-04-28T19:24:18Z

@wainersm Seeing a bunch of failures here:

https://github.com/kata-containers/kata-containers/actions/runs/14713397922/job/41292694318?pr=11164#step:17:11250

[run_kubernetes_tests.sh:147] ERROR: Tests FAILED from suites: k8s-cpu-ns.bats k8s-credentials-secrets.bats k8s-limit-range.bats k8s-pod-quota.bats k8s-qos-pods.bats k8s-nginx-connectivity.bats

wainersm · 2025-04-28T21:36:27Z

@wainersm Seeing a bunch of failures here:

https://github.com/kata-containers/kata-containers/actions/runs/14713397922/job/41292694318?pr=11164#step:17:11250
[run_kubernetes_tests.sh:147] ERROR: Tests FAILED from suites: k8s-cpu-ns.bats k8s-credentials-secrets.bats k8s-limit-range.bats k8s-pod-quota.bats k8s-qos-pods.bats k8s-nginx-connectivity.bats

Please disregard the jobs triggered on this PR. The change on the workflow matter for those tests stop failing, so I've tested using the topic/wainersm-fix_kbs_on_aks topic branch. The latest and greatest job is https://github.com/kata-containers/kata-containers/actions/runs/14713221646/job/41305482015

wainersm · 2025-04-29T00:56:27Z

ok, the last job succeeded: https://github.com/kata-containers/kata-containers/actions/runs/14713221646/job/41305482015

It's ready to be merged if no other comments. In meanwhile I will re-trigger the job one more time just for the case...

stevenhorsman · 2025-04-29T08:27:58Z

tests/integration/kubernetes/confidential_kbs.sh

@@ -419,13 +419,20 @@ function kbs_k8s_deploy() {
 	fi
 }

-# Return the kbs service host name in case ingress is configured
+# Return the kbs service public IP in case ingress is configured


nit: Should we make this "echos the..." as (unfortunately) we can't return a string value here?

stevenhorsman · 2025-04-29T08:29:58Z

tests/integration/kubernetes/confidential_kbs.sh

+		while true; do
+			host=$(kubectl get ingress "${KBS_INGRESS_NAME}" -n "${KBS_NS}" -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
+			[[ -z "${host}" && ${SECONDS} -lt 30 ]] || break
+			sleep 5


Do you need to do SECONDS += 5 here? At the moment I think it will loop forever if there is an issue with ingress coming up?

Rá! This is a magic that I learned with @ldoktor while ago: that's a bash variable which is updated each time it's referenced. Quoting bash's manpage:

SECONDS Each time this parameter is referenced, it expands to the number of seconds since shell invocation. If a value is assigned to SECONDS, the value returned upon subsequent references is the number of seconds since the assignment plus the value assigned. The number of seconds at shell invocation and the current time are always determined by querying the system clock. If SECONDS is unset, it loses its special properties, even if it is subse‐ quently reset.

Nice - so it's just a magic variable. TIL!

wainersm · 2025-04-29T13:36:31Z

ok, the last job succeeded: https://github.com/kata-containers/kata-containers/actions/runs/14713221646/job/41305482015

It's ready to be merged if no other comments. In meanwhile I will re-trigger the job one more time just for the case...

The re-trigger from yesterday passed too: https://github.com/kata-containers/kata-containers/actions/runs/14713221646/job/41314671718

Now it's the fun part: merge to realize that on production it fails :D

sprt

Thank you @wainersm!

sprt · 2025-04-29T17:25:48Z

Force-merged as a workflow change tested in a feature.

wainersm added area/ci Issues affecting the continuous integration area/testing Issues related to testing labels Apr 17, 2025

wainersm requested review from sprt and stevenhorsman April 17, 2025 18:30

katacontainersbot added the size/medium Average sized task label Apr 17, 2025

wainersm added the ok-to-test label Apr 17, 2025

wainersm force-pushed the fix_kbs_on_aks branch from d640227 to 9d39811 Compare April 17, 2025 21:10

wainersm force-pushed the fix_kbs_on_aks branch from 9d39811 to 57c171b Compare April 23, 2025 18:24

wainersm mentioned this pull request Apr 23, 2025

kbs/config/kubernetes: update for deploying on AKS confidential-containers/trustee#778

Merged

wainersm added the do-not-merge PR has problems or depends on another label Apr 23, 2025

sprt mentioned this pull request Apr 24, 2025

temp: ci: Fix AKS cluster creation #11189

Merged

wainersm added 4 commits April 28, 2025 12:08

sprt reviewed Apr 28, 2025

View reviewed changes

wainersm force-pushed the fix_kbs_on_aks branch from 7d8426a to 460c339 Compare April 28, 2025 16:57

wainersm removed the do-not-merge PR has problems or depends on another label Apr 29, 2025

sprt self-requested a review April 29, 2025 01:15

stevenhorsman reviewed Apr 29, 2025

View reviewed changes

stevenhorsman approved these changes Apr 29, 2025

View reviewed changes

sprt approved these changes Apr 29, 2025

View reviewed changes

sprt merged commit 19371e2 into kata-containers:main Apr 29, 2025
951 of 984 checks passed

wainersm mentioned this pull request Apr 29, 2025

tests: k8s: Retry output of kubectl exec in k8s-cpu-ns #11141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/k8s: fix kbs installation on Azure AKS #11164

tests/k8s: fix kbs installation on Azure AKS #11164

wainersm commented Apr 17, 2025

wainersm commented Apr 17, 2025

wainersm commented Apr 17, 2025

sprt commented Apr 22, 2025

wainersm commented Apr 23, 2025

wainersm commented Apr 23, 2025

wainersm commented Apr 23, 2025

sprt commented Apr 23, 2025

wainersm commented Apr 23, 2025

wainersm commented Apr 24, 2025

sprt commented Apr 24, 2025

wainersm commented Apr 24, 2025

sprt commented Apr 25, 2025

sprt Apr 28, 2025

wainersm commented Apr 28, 2025

wainersm commented Apr 28, 2025

sprt commented Apr 28, 2025

wainersm commented Apr 28, 2025

wainersm commented Apr 29, 2025

stevenhorsman Apr 29, 2025

stevenhorsman Apr 29, 2025

wainersm Apr 29, 2025

stevenhorsman Apr 29, 2025

wainersm commented Apr 29, 2025

sprt left a comment

sprt commented Apr 29, 2025

tests/k8s: fix kbs installation on Azure AKS #11164

tests/k8s: fix kbs installation on Azure AKS #11164

Conversation

wainersm commented Apr 17, 2025

wainersm commented Apr 17, 2025

wainersm commented Apr 17, 2025

sprt commented Apr 22, 2025

wainersm commented Apr 23, 2025

wainersm commented Apr 23, 2025

wainersm commented Apr 23, 2025

sprt commented Apr 23, 2025

wainersm commented Apr 23, 2025

wainersm commented Apr 24, 2025

sprt commented Apr 24, 2025

wainersm commented Apr 24, 2025

sprt commented Apr 25, 2025

sprt Apr 28, 2025

Choose a reason for hiding this comment

wainersm commented Apr 28, 2025

wainersm commented Apr 28, 2025

sprt commented Apr 28, 2025

wainersm commented Apr 28, 2025

wainersm commented Apr 29, 2025

stevenhorsman Apr 29, 2025

Choose a reason for hiding this comment

stevenhorsman Apr 29, 2025

Choose a reason for hiding this comment

wainersm Apr 29, 2025

Choose a reason for hiding this comment

stevenhorsman Apr 29, 2025

Choose a reason for hiding this comment

wainersm commented Apr 29, 2025

sprt left a comment

Choose a reason for hiding this comment

sprt commented Apr 29, 2025