Skip to content

Calculate metrics of credential fetching from Pods & upload to s3 #512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 13, 2025

Conversation

xdu31
Copy link
Contributor

@xdu31 xdu31 commented May 29, 2025

Issue #, if available:

Description of changes:

  1. Change Pod spec to invoke Pod Identity in normal container instead of from initContainer
  2. Push credential fetching metrics
  3. Push credential fetching time range, sample count, p50, p90 and p99 to s3

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

DIMENSION_NAME="ClusterName"
DIMENSION_VALUE=$CLUSTER_NAME
METRIC_LATENCY_NAME="CredentialFetchLatency"
PERIOD=300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be tunable parameter

METRIC_LATENCY_NAME="CredentialFetchLatency"
PERIOD=300

START_TIME=$(aws eks $ENDPOINT_FLAG --region $REGION describe-cluster \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add comments on why you consider this as start time ?

Also, please add comments overall to make it more readable for future users/consumers on your team, especially wherever you are making assumptions.

"total_samples": $total_samples,
"p50": $p50,
"p90": $p90,
"p99": $p99
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you are printing the latency numbers, but i don't see you validating the p99 under x seconds. Am i missing something here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 options:

  1. fail the task on threshold breaching
  2. integrate with alarm on threshold breaching

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intent of these test is to alert teams when results not meet expectations.

@xdu31 xdu31 requested a review from hakuna-matatah May 30, 2025 19:20
@@ -84,6 +84,18 @@ spec:
default: "200"
- name: cl2-uniform-qps
default: "100"
- name: cl2-metric-dimension-name
description: "default metric dimension name"
default: "ClusterName"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have the same defaults at task level, we don't have to carry this to pipeline. Pipeline would take task level defaults when not supplied.

Same for other params as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but when we make a runpipeline, that's on pipeline level? So we don't need to make another code change here to change the name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but run pipeline will take task level defaults when not supplied.

echo "p99 is less than 2"
else
echo "p99 is 2 or more"
exit 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to exit and fail the test, we need to capture the test result like this

exit_code=$?
if [ $exit_code -eq 0 ]; then
echo "1" | tee $(results.datapoint.path)
else
echo "0" | tee $(results.datapoint.path)
fi
exit $exit_code

and use that to emit result like how we do it here -

So it can used configure alarm and cut tickets to your team.

--statistics SampleCount \
--output json)

total_samples=$(echo "$response" | jq '[.Datapoints[].SampleCount] | add // 0')
Copy link
Contributor

@hakuna-matatah hakuna-matatah Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to capture what is the rate at which credentials are fetched from service. To be able to compare that against the scheduler throughput.

Currently you have some kind of measuring by able to control the timeout param https://github.com/awslabs/kubernetes-iteration-toolkit/pull/512/files#diff-d2d660edac904aa96e330bfae7bf67ef6885190877c5ab7668f0f157057da03fR61
accordingly by computing total number of pods and client pod creation rate/scheduler rate and what your service throughput could be.

This will only give some kind of approximation but not fully accurate throughput of your service using this kind of measurement.

sleep "$SLEEP_TIME"
done

# s3 api call
while ! aws s3 ls; do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its better to verify the role being used to make sure we are not using the instance role.
aws sts get-caller-identity | grep <role_name>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about sts quota we consume running in this account

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand about the limits but i think we need a way to make sure that the command is successful because of token by pod identity and not using the instance role.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment on how we can make sure this test is working as intended without checking on the assumed role identity

start_epoch=$(date +%s%3N)
# fetch credentials
for i in $(seq 0 $((MAX_ATTEMPTS - 1))); do
if curl -S -H "Authorization: $AUTH_TOKEN" http://169.254.170.23/v1/credentials; then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to verify that we got a successful response? may be status code 2xx ?

@xdu31 xdu31 requested review from hakuna-matatah and kmala June 2, 2025 16:48
@xdu31 xdu31 force-pushed the pia-pod-spec branch 3 times, most recently from e28b3ba to 3b3722a Compare June 6, 2025 19:45
@hakuna-matatah
Copy link
Contributor

lgtm

@hakuna-matatah hakuna-matatah merged commit 84c4e3a into awslabs:main Jun 13, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants