Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application Controller's live state cache doesn't use watch cache when talking to Kubernetes #18838

Open
3 tasks done
tosi3k opened this issue Jun 27, 2024 · 4 comments
Open
3 tasks done
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache type:bug version:2.14 Latest confirmed affected version is 2.14

Comments

@tosi3k
Copy link

tosi3k commented Jun 27, 2024

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

When maintaining a registered cluster's live state cache to track the state of K8s resources, the Application Controller is using a peculiar API call pattern that is poorly performing at scale, especially when there's a lot of resources of a particular kind.

At the moment, we can see here that for every API resource kind, we create a separate goroutine that:

This approach has a couple of problems:

  • Lists issued to etcd are much more heavy-weight than lists issued to the kube-apiserver's watch cache.
    • When using the watch cache, kube-apiserver simply gets a copy of all of the resources from the cache (which already contains deserialized data) and sends it to the client.
    • Otherwise, kube-apiserver has to get this information from etcd directly (applying non-trivial load to it), decode and deserialize it when fetching it from there.
  • Default page size of 500 for K8s API calls imposes lots of (paginated) etcd list calls if there are lots of resources of a particular kind.
    • This multiplies the overload effect from the above point.
    • If you have a ginormous amount of a particular resource kind, e.g. 150 thousand Pods, the List API call with small page size takes ages and might be continuously hitting error 410 after falling off etcd compaction window which defaults to 1min.

To Reproduce

Follow steps 1-6 from the Getting Started to register any cluster in Argo CD in a default setup.

Observe logs of kube-apiserver to see periodic (every 10 minutes) LISTs of all resources issued directly to etcd (no resourceVersion parameter in the URI of logged API request) rather than to the kube-apiserver's watch cache (resourceVersion=0 in the URI string).

Expected behavior

argocd-application-controller's live state cache properly implements the List&Watch pattern (when tracking state of cluster resources) where it issues a LIST API call from the watch cache (i.e. with resourceVersion=0) and follows it with WATCH requests only (with increasing RV).

Screenshots

Version

argocd: v2.10.9+c071af8
  BuildDate: 2024-04-30T16:39:16Z
  GitCommit: c071af808170bfc39cbdf6b9be4d0212dd66db0c
  GitTreeState: clean
  GoVersion: go1.21.9
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.10.9+c071af8

Logs

Logs from kube-apiserver for Pods from my small dev cluster I used for debugging:

INFO 2024-06-27T16:26:36.948644Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="52.720934ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="2a65317a-6c0d-4cb8-9550-04ca9cfd5d0a" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="52.044796ms" resp=200
INFO 2024-06-27T16:26:37.229913Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2174400&watch=true" latency="1.275025ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="16b9589b-636a-4ce4-9ffb-8693c6ab20d0" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="734.606µs" apf_execution_time="736.272µs" resp=0
INFO 2024-06-27T16:36:37.960481Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="64.554862ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="02353768-038e-4011-ba95-be3b75b7c768" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="63.743979ms" resp=200
INFO 2024-06-27T16:36:38.226185Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2181052&watch=true" latency="1.345608ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="9bc33fd4-7352-4b13-aa83-c8240da8948c" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="564.697µs" apf_execution_time="566.634µs" resp=0
INFO 2024-06-27T16:46:38.954864Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="58.856642ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="140aaf47-e307-4eaa-b446-098f43f6d292" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="58.256763ms" resp=200
INFO 2024-06-27T16:46:39.237466Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2187695&watch=true" latency="1.368998ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="f61a7719-25c7-4fa5-887c-aadd5cd23b02" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="661.437µs" apf_execution_time="663.141µs" resp=0
@tosi3k tosi3k added the bug Something isn't working label Jun 27, 2024
@tosi3k tosi3k changed the title Application ControllerLive state cache doesn't use watch cache when talking to Kubernetes Application Controller's live state cache doesn't use watch cache when talking to Kubernetes Jun 27, 2024
@wojtek-t
Copy link

/cc

@tosi3k
Copy link
Author

tosi3k commented Jul 9, 2024

I'll try crafting some solution this week.

@andrii-korotkov-verkada
Copy link
Contributor

ArgoCD versions 2.10 and below have reached EOL. Can you upgrade and let us know if the issue is still present, please?

@andrii-korotkov-verkada andrii-korotkov-verkada added the version:EOL Latest confirmed affected version has reached EOL label Nov 11, 2024
@tosi3k
Copy link
Author

tosi3k commented Nov 20, 2024

@andrii-korotkov-verkada the issue is still present in Argo CD, unfortunately.

@andrii-korotkov-verkada andrii-korotkov-verkada added version:2.14 Latest confirmed affected version is 2.14 and removed version:EOL Latest confirmed affected version has reached EOL labels Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache type:bug version:2.14 Latest confirmed affected version is 2.14
Projects
None yet
Development

No branches or pull requests

4 participants