Calculate metrics of credential fetching from Pods & upload to s3 #512

xdu31 · 2025-05-29T03:08:34Z

Issue #, if available:

Description of changes:

Change Pod spec to invoke Pod Identity in normal container instead of from initContainer
Push credential fetching metrics
Push credential fetching time range, sample count, p50, p90 and p99 to s3

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

hakuna-matatah · 2025-05-29T20:01:04Z

tests/tekton-resources/tasks/generators/clusterloader/load-pod-identity.yaml

+      DIMENSION_NAME="ClusterName"
+      DIMENSION_VALUE=$CLUSTER_NAME
+      METRIC_LATENCY_NAME="CredentialFetchLatency"
+      PERIOD=300


this should be tunable parameter

hakuna-matatah · 2025-05-29T20:04:07Z

tests/tekton-resources/tasks/generators/clusterloader/load-pod-identity.yaml

+      METRIC_LATENCY_NAME="CredentialFetchLatency"
+      PERIOD=300
+
+      START_TIME=$(aws eks $ENDPOINT_FLAG --region $REGION describe-cluster \


could you add comments on why you consider this as start time ?

Also, please add comments overall to make it more readable for future users/consumers on your team, especially wherever you are making assumptions.

hakuna-matatah · 2025-05-29T20:09:02Z

tests/tekton-resources/tasks/generators/clusterloader/load-pod-identity.yaml

+        "total_samples": $total_samples,
+        "p50": $p50,
+        "p90": $p90,
+        "p99": $p99


I see you are printing the latency numbers, but i don't see you validating the p99 under x seconds. Am i missing something here ?

There are 2 options:

fail the task on threshold breaching

integrate with alarm on threshold breaching

Intent of these test is to alert teams when results not meet expectations.

hakuna-matatah · 2025-06-02T15:52:45Z

tests/tekton-resources/pipelines/eks/awscli-cl2-load-with-addons-slos.yaml

@@ -84,6 +84,18 @@ spec:
    default: "200"
  - name: cl2-uniform-qps
    default: "100"
+  - name: cl2-metric-dimension-name
+    description: "default metric dimension name"
+    default: "ClusterName"


You have the same defaults at task level, we don't have to carry this to pipeline. Pipeline would take task level defaults when not supplied.

Same for other params as well.

but when we make a runpipeline, that's on pipeline level? So we don't need to make another code change here to change the name

Yeah, but run pipeline will take task level defaults when not supplied.

hakuna-matatah · 2025-06-02T16:01:05Z

tests/tekton-resources/tasks/generators/clusterloader/load-pod-identity.yaml

+        echo "p99 is less than 2"
+      else
+        echo "p99 is 2 or more"
+        exit 1


We don't want to exit and fail the test, we need to capture the test result like this

kubernetes-iteration-toolkit/tests/tekton-resources/tasks/generators/clusterloader/load-slos.yaml

Lines 171 to 177 in 3f5523d

exit_code=$?

if [ $exit_code -eq 0 ]; then

echo "1" | tee $(results.datapoint.path)

else

echo "0" | tee $(results.datapoint.path)

fi

exit $exit_code

and use that to emit result like how we do it here -

kubernetes-iteration-toolkit/tests/tekton-resources/pipelines/eks/awscli-cl2-load-with-addons-slos.yaml

Line 360 in 3f5523d

value: $(tasks.generate.results.datapoint)

So it can used configure alarm and cut tickets to your team.

hakuna-matatah · 2025-06-02T16:10:25Z

tests/tekton-resources/tasks/generators/clusterloader/load-pod-identity.yaml

+        --statistics SampleCount \
+        --output json)
+
+      total_samples=$(echo "$response" | jq '[.Datapoints[].SampleCount] | add // 0')


We need to capture what is the rate at which credentials are fetched from service. To be able to compare that against the scheduler throughput.

Currently you have some kind of measuring by able to control the timeout param https://github.com/awslabs/kubernetes-iteration-toolkit/pull/512/files#diff-d2d660edac904aa96e330bfae7bf67ef6885190877c5ab7668f0f157057da03fR61
accordingly by computing total number of pods and client pod creation rate/scheduler rate and what your service throughput could be.

This will only give some kind of approximation but not fully accurate throughput of your service using this kind of measurement.

kmala · 2025-06-02T16:37:57Z

tests/assets/eks-pod-identity/pod-default.yaml

+          sleep "$SLEEP_TIME"
+        done
+
+        # s3 api call
        while ! aws s3 ls; do


i think its better to verify the role being used to make sure we are not using the instance role.
aws sts get-caller-identity | grep <role_name>

I'm a bit concerned about sts quota we consume running in this account

i understand about the limits but i think we need a way to make sure that the command is successful because of token by pod identity and not using the instance role.

I added a comment on how we can make sure this test is working as intended without checking on the assumed role identity

kmala · 2025-06-02T16:40:14Z

tests/assets/eks-pod-identity/pod-default.yaml

+        start_epoch=$(date +%s%3N)
+        # fetch credentials
+        for i in $(seq 0 $((MAX_ATTEMPTS - 1))); do
+          if curl -S -H "Authorization: $AUTH_TOKEN" http://169.254.170.23/v1/credentials; then


do we need to verify that we got a successful response? may be status code 2xx ?

hakuna-matatah · 2025-06-13T03:40:43Z

lgtm

hakuna-matatah reviewed May 29, 2025

View reviewed changes

xdu31 force-pushed the pia-pod-spec branch from 7d0614b to 80ed125 Compare May 30, 2025 15:30

xdu31 requested a review from hakuna-matatah May 30, 2025 19:20

hakuna-matatah reviewed Jun 2, 2025

View reviewed changes

kmala reviewed Jun 2, 2025

View reviewed changes

xdu31 requested review from hakuna-matatah and kmala June 2, 2025 16:48

xdu31 force-pushed the pia-pod-spec branch 3 times, most recently from e28b3ba to 3b3722a Compare June 6, 2025 19:45

xdu31 added 4 commits June 6, 2025 14:39

Calculate metrics of credential fetching from Pods & upload to s3

75cc295

review comments

835c9e2

review comments

265ec49

Remove calls to STS

17053ff

xdu31 force-pushed the pia-pod-spec branch from 3b3722a to 17053ff Compare June 6, 2025 21:47

kmala approved these changes Jun 11, 2025

View reviewed changes

hakuna-matatah approved these changes Jun 13, 2025

View reviewed changes

hakuna-matatah merged commit 84c4e3a into awslabs:main Jun 13, 2025
4 checks passed

	exit_code=$?
	if [ $exit_code -eq 0 ]; then
	echo "1" \| tee $(results.datapoint.path)
	else
	echo "0" \| tee $(results.datapoint.path)
	fi
	exit $exit_code

Calculate metrics of credential fetching from Pods & upload to s3 #512

Calculate metrics of credential fetching from Pods & upload to s3 #512

Uh oh!

Conversation

xdu31 commented May 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hakuna-matatah Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hakuna-matatah commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

hakuna-matatah Jun 2, 2025 •

edited

Loading