-
Notifications
You must be signed in to change notification settings - Fork 939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If spec.replicas is not defined, should default to the current value #3689
If spec.replicas is not defined, should default to the current value #3689
Comments
Yeah this is a serious issue that affects me as well. I use an HPA and if I don't set the replicas field to the current replica count, it wipes out all but one of my pods before recreating them. I have persistent client connections I can't afford to lose in this way. |
I came across this issue while debugging the same behavior using autoscaler/v2 HPA in the CNCF slack channel for argo rollouts as well: https://cloud-native.slack.com/archives/C01U781DW2E/p1736195993217089 |
What do you mean by "new version rollout"? Do you have a process that will delete the rollout object and recreate it? Or something that removes the Kubernetes audit logs can be useful to find who updated an object. Sometimes, the ManagedFields can also provide some information. |
I believe I'm seeing that same bug, here's a timeline of a change that does two things:
I've pulled the timestamps and strings from audit logs. Timestamps (requestReceivedTimestamp) are typically when rollouts-controller creates the corresponding Event. Actual resource creation might differ mildly.
Note that ~2s before argo even created the replicaset for the next version, it scaled down the existing, healthy replicaset to 1. 10 seconds later, the HPA sets the scale to the minimum. Argo then proceeds with starting the canary run. Between |
@chrono So what is impactful is
By doing this, rollout is told that no replicas are specified, so it defauts to 1 as expected. Instead, you could deploy the HPA first and let it manage the replicas field. If you use Argo CD to manage the replicas, then you can remove the config for Git and use IgnoreDifferences so you never apply a |
That's the point of the bug. There's already a live Rollout with |
I could also try to teach every developer to
That's why folks are using ArgoCD. As an abstraction so that practitioners don't have to understand and execute nitty gritty details. And if things can't be done in a safe, predictable, manner -- perhaps the tooling should fail safe and throw the hands up "I need manual intervention here". |
At this point, Argo CD does not know the behavior of your Kubernetes resources. It is told to match your desired state, and it does that by removing the replicas field as instructed. Argo Rollouts also does as instructed: It has no replicas defined in the manifest, so it default to 1 (Same as the I believe the underlying issue is an Argo CD issue to stop managing the "replicas" field in the desired git state, without removing it from the live state. This use case seems documented in Argo CD Documentation for |
Almost any behavior would be better than what currently happens. The 'replicas' in the rollout should be what is the final state, but it immediately updates replica counts on the existing kubernetes objects. Let's say I have a replicaset controlled by HPA and although my original rollout said 5 pods, it is currently running 10. If I run a new rollout, only updating the image tag, it will immediately destroy 5 of the active pods to bring that original replicaset in line with the spec. If I don't specify replicas, we get slashed to 1 replica. |
Perhaps it's useful to temporarily ignore the HPA in this discussion and focus on the Rollout, and its management of Replicasets. A developer for some [leave this undefined] reason unsets the replica count, or sets it to some value that will not result in a healthy system. When they're doing this change as part of change using a deployment strategy like canary or bluegreen then I'd expect argo doing its best to treat this new configuration as part of the change it's trying to validate and roll out. So I wouldn't expect argo to scale down the old, healthy, replicaset to the new (potentially unsafe / broken / untested) value immediately before starting the deployment strategy. I would expect that argo tries to introduce the new configuration as part of the strategy, and allow automatic health assessments to judge its performance. |
@insylogo The HPA will set the You mention that your HPA controls the ReplicaSet. The answer above is assuming that the Rollout is configured in your
@chrono That would be an interesting Idea to solve the problem, but In my opinion, I think this is not a problem that should be solved within the Rollout or Deployment controller. I think a feature could be added in Argo CD to help prevent these kind of issues since |
Something that could be added in my opinion is a validation that would fail the resource update if the
I think that would meet this behavior. If you are using Argo CD, the sync would fail, blocked by this validation if the Application is not configured to not dangerously override the But the field |
Checklist:
Describe the bug
In #119, a defaulting of missing
replicas:
field to 1 was added. This behaviour could impact reliability of a service. Defaulting was added to workaround HPA's deficiency (kubernetes/kubernetes#111781)We are using a custom autoscaler, not HPA. Our deployments tend not to have
replicas:
field defined. Our expectations is that during new version rollout, the number of replicas is carried-over from previous version.What we observed, when
replicas:
field is not set, argo-rollouts sets it to 1. After few seconds, our autoscaler setsreplicas:
back to the proper value. Unfortunately during those few seconds Kubernetes starts killing the pods, and until new pods are spun up and ready, the service is degraded.To Reproduce
Prepare a deployment without a
spec.replicas:
field. Observe"Defaulting .spec.replica to 1"
message in argo rollout's log.Expected behavior
The number of replicas, if not defined, should be carried over from current state. The code should do something like
(Above isn't syntactically correct, sorry).
rollout.status.replicas:
is NOT marked as an optional field, so should be available all the time.The value of
0
(zero) of the number of status replicas may be wanted/correct and should be carried over, too.Screenshots
Version
1.7.0
Logs
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: