Application Controller's live state cache doesn't use watch cache when talking to Kubernetes #18838
Open
3 tasks done
Labels
bug
Something isn't working
component:core
Syncing, diffing, cluster state cache
type:bug
version:2.14
Latest confirmed affected version is 2.14
Checklist:
argocd version
.Describe the bug
When maintaining a registered cluster's live state cache to track the state of K8s resources, the Application Controller is using a peculiar API call pattern that is poorly performing at scale, especially when there's a lot of resources of a particular kind.
At the moment, we can see here that for every API resource kind, we create a separate goroutine that:
timeout
parameter to the WATCH request options, instead, you run the watch connection (for 10 min by default) by stopping the watcher here and nullifying the RV there as well.RetryUntilSucceed
is retried again (since it failed explicitly after that 10 min timeout).etcd
once again with the page size equal to 500.This approach has a couple of problems:
etcd
are much more heavy-weight than lists issued to thekube-apiserver
's watch cache.kube-apiserver
simply gets a copy of all of the resources from the cache (which already contains deserialized data) and sends it to the client.kube-apiserver
has to get this information frometcd
directly (applying non-trivial load to it), decode and deserialize it when fetching it from there.etcd
list calls if there are lots of resources of a particular kind.etcd
compaction window which defaults to 1min.To Reproduce
Follow steps 1-6 from the Getting Started to register any cluster in Argo CD in a default setup.
Observe logs of
kube-apiserver
to see periodic (every 10 minutes) LISTs of all resources issued directly toetcd
(noresourceVersion
parameter in the URI of logged API request) rather than to thekube-apiserver
's watch cache (resourceVersion=0
in the URI string).Expected behavior
argocd-application-controller
's live state cache properly implements the List&Watch pattern (when tracking state of cluster resources) where it issues a LIST API call from the watch cache (i.e. withresourceVersion=0
) and follows it with WATCH requests only (with increasing RV).Screenshots
Version
Logs
Logs from
kube-apiserver
for Pods from my small dev cluster I used for debugging:The text was updated successfully, but these errors were encountered: