Application Controller's live state cache doesn't use watch cache when talking to Kubernetes #18838

tosi3k · 2024-06-27T15:57:09Z

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

When maintaining a registered cluster's live state cache to track the state of K8s resources, the Application Controller is using a peculiar API call pattern that is poorly performing at scale, especially when there's a lot of resources of a particular kind.

At the moment, we can see here that for every API resource kind, we create a separate goroutine that:

Runs initial paginated listing from etcd with default page size equal to 500.
Kicks off another goroutine running the watchEvents method that implements the API call pattern in a following way:
- You don't provide the timeout parameter to the WATCH request options, instead, you run the watch connection (for 10 min by default) by stopping the watcher here and nullifying the RV there as well.
- The function passed as an argument in RetryUntilSucceed is retried again (since it failed explicitly after that 10 min timeout).
- Since RV was set to empty string in the aforementioned deferred method, we reload the state here by kicking off LIST call directly to etcd once again with the page size equal to 500.

This approach has a couple of problems:

Lists issued to etcd are much more heavy-weight than lists issued to the kube-apiserver's watch cache.
- When using the watch cache, kube-apiserver simply gets a copy of all of the resources from the cache (which already contains deserialized data) and sends it to the client.
- Otherwise, kube-apiserver has to get this information from etcd directly (applying non-trivial load to it), decode and deserialize it when fetching it from there.
Default page size of 500 for K8s API calls imposes lots of (paginated) etcd list calls if there are lots of resources of a particular kind.
- This multiplies the overload effect from the above point.
- If you have a ginormous amount of a particular resource kind, e.g. 150 thousand Pods, the List API call with small page size takes ages and might be continuously hitting error 410 after falling off etcd compaction window which defaults to 1min.

To Reproduce

Follow steps 1-6 from the Getting Started to register any cluster in Argo CD in a default setup.

Observe logs of kube-apiserver to see periodic (every 10 minutes) LISTs of all resources issued directly to etcd (no resourceVersion parameter in the URI of logged API request) rather than to the kube-apiserver's watch cache (resourceVersion=0 in the URI string).

Expected behavior

argocd-application-controller's live state cache properly implements the List&Watch pattern (when tracking state of cluster resources) where it issues a LIST API call from the watch cache (i.e. with resourceVersion=0) and follows it with WATCH requests only (with increasing RV).

Screenshots

Version

argocd: v2.10.9+c071af8
  BuildDate: 2024-04-30T16:39:16Z
  GitCommit: c071af808170bfc39cbdf6b9be4d0212dd66db0c
  GitTreeState: clean
  GoVersion: go1.21.9
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.10.9+c071af8

Logs

Logs from kube-apiserver for Pods from my small dev cluster I used for debugging:

INFO 2024-06-27T16:26:36.948644Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="52.720934ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="2a65317a-6c0d-4cb8-9550-04ca9cfd5d0a" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="52.044796ms" resp=200
INFO 2024-06-27T16:26:37.229913Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2174400&watch=true" latency="1.275025ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="16b9589b-636a-4ce4-9ffb-8693c6ab20d0" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="734.606µs" apf_execution_time="736.272µs" resp=0
INFO 2024-06-27T16:36:37.960481Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="64.554862ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="02353768-038e-4011-ba95-be3b75b7c768" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="63.743979ms" resp=200
INFO 2024-06-27T16:36:38.226185Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2181052&watch=true" latency="1.345608ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="9bc33fd4-7352-4b13-aa83-c8240da8948c" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="564.697µs" apf_execution_time="566.634µs" resp=0
INFO 2024-06-27T16:46:38.954864Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="58.856642ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="140aaf47-e307-4eaa-b446-098f43f6d292" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="58.256763ms" resp=200
INFO 2024-06-27T16:46:39.237466Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2187695&watch=true" latency="1.368998ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="f61a7719-25c7-4fa5-887c-aadd5cd23b02" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="661.437µs" apf_execution_time="663.141µs" resp=0

The text was updated successfully, but these errors were encountered:

wojtek-t · 2024-06-28T11:34:57Z

/cc

tosi3k · 2024-07-09T12:42:33Z

I'll try crafting some solution this week.

andrii-korotkov-verkada · 2024-11-11T06:09:39Z

ArgoCD versions 2.10 and below have reached EOL. Can you upgrade and let us know if the issue is still present, please?

tosi3k · 2024-11-20T15:47:27Z

@andrii-korotkov-verkada the issue is still present in Argo CD, unfortunately.

tosi3k added the bug Something isn't working label Jun 27, 2024

tosi3k changed the title ~~Application ControllerLive state cache doesn't use watch cache when talking to Kubernetes~~ Application Controller's live state cache doesn't use watch cache when talking to Kubernetes Jun 27, 2024

alexmt added component:core Syncing, diffing, cluster state cache type:bug labels Jul 9, 2024

This was referenced Jul 23, 2024

feat: Drop unnecessary listing for the sake of watch reinitialization argoproj/gitops-engine#616

Open

feat: Make cluster cache target the watch cache instead of etcd argoproj/gitops-engine#617

Closed

andrii-korotkov-verkada added the version:EOL Latest confirmed affected version has reached EOL label Nov 11, 2024

andrii-korotkov-verkada added version:2.14 Latest confirmed affected version is 2.14 and removed version:EOL Latest confirmed affected version has reached EOL labels Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application Controller's live state cache doesn't use watch cache when talking to Kubernetes #18838

Application Controller's live state cache doesn't use watch cache when talking to Kubernetes #18838

tosi3k commented Jun 27, 2024 •

edited

Loading

wojtek-t commented Jun 28, 2024

tosi3k commented Jul 9, 2024

andrii-korotkov-verkada commented Nov 11, 2024

tosi3k commented Nov 20, 2024

Application Controller's live state cache doesn't use watch cache when talking to Kubernetes #18838

Application Controller's live state cache doesn't use watch cache when talking to Kubernetes #18838

Comments

tosi3k commented Jun 27, 2024 • edited Loading

wojtek-t commented Jun 28, 2024

tosi3k commented Jul 9, 2024

andrii-korotkov-verkada commented Nov 11, 2024

tosi3k commented Nov 20, 2024

tosi3k commented Jun 27, 2024 •

edited

Loading