-
Notifications
You must be signed in to change notification settings - Fork 44
Calculate metrics of credential fetching from Pods & upload to s3 #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
DIMENSION_NAME="ClusterName" | ||
DIMENSION_VALUE=$CLUSTER_NAME | ||
METRIC_LATENCY_NAME="CredentialFetchLatency" | ||
PERIOD=300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be tunable parameter
METRIC_LATENCY_NAME="CredentialFetchLatency" | ||
PERIOD=300 | ||
|
||
START_TIME=$(aws eks $ENDPOINT_FLAG --region $REGION describe-cluster \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add comments on why you consider this as start time ?
Also, please add comments overall to make it more readable for future users/consumers on your team, especially wherever you are making assumptions.
"total_samples": $total_samples, | ||
"p50": $p50, | ||
"p90": $p90, | ||
"p99": $p99 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you are printing the latency numbers, but i don't see you validating the p99 under x
seconds. Am i missing something here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 options:
- fail the task on threshold breaching
- integrate with alarm on threshold breaching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intent of these test is to alert teams when results not meet expectations.
@@ -84,6 +84,18 @@ spec: | |||
default: "200" | |||
- name: cl2-uniform-qps | |||
default: "100" | |||
- name: cl2-metric-dimension-name | |||
description: "default metric dimension name" | |||
default: "ClusterName" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have the same defaults at task level, we don't have to carry this to pipeline. Pipeline would take task level defaults when not supplied.
Same for other params as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but when we make a runpipeline, that's on pipeline level? So we don't need to make another code change here to change the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but run pipeline will take task level defaults when not supplied.
echo "p99 is less than 2" | ||
else | ||
echo "p99 is 2 or more" | ||
exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to exit and fail the test, we need to capture the test result like this
kubernetes-iteration-toolkit/tests/tekton-resources/tasks/generators/clusterloader/load-slos.yaml
Lines 171 to 177 in 3f5523d
exit_code=$? | |
if [ $exit_code -eq 0 ]; then | |
echo "1" | tee $(results.datapoint.path) | |
else | |
echo "0" | tee $(results.datapoint.path) | |
fi | |
exit $exit_code |
and use that to emit result like how we do it here -
Line 360 in 3f5523d
value: $(tasks.generate.results.datapoint) |
So it can used configure alarm and cut tickets to your team.
--statistics SampleCount \ | ||
--output json) | ||
|
||
total_samples=$(echo "$response" | jq '[.Datapoints[].SampleCount] | add // 0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to capture what is the rate
at which credentials are fetched from service. To be able to compare that against the scheduler throughput.
Currently you have some kind of measuring by able to control the timeout param
https://github.com/awslabs/kubernetes-iteration-toolkit/pull/512/files#diff-d2d660edac904aa96e330bfae7bf67ef6885190877c5ab7668f0f157057da03fR61
accordingly by computing total number of pods and client pod creation rate/scheduler rate and what your service throughput could be.
This will only give some kind of approximation but not fully accurate throughput of your service using this kind of measurement.
sleep "$SLEEP_TIME" | ||
done | ||
|
||
# s3 api call | ||
while ! aws s3 ls; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its better to verify the role being used to make sure we are not using the instance role.
aws sts get-caller-identity | grep <role_name>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about sts quota we consume running in this account
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand about the limits but i think we need a way to make sure that the command is successful because of token by pod identity and not using the instance role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment on how we can make sure this test is working as intended without checking on the assumed role identity
start_epoch=$(date +%s%3N) | ||
# fetch credentials | ||
for i in $(seq 0 $((MAX_ATTEMPTS - 1))); do | ||
if curl -S -H "Authorization: $AUTH_TOKEN" http://169.254.170.23/v1/credentials; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to verify that we got a successful response? may be status code 2xx ?
e28b3ba
to
3b3722a
Compare
lgtm |
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.