gcsfuse gcs-fuse-csi-driver increase default memory capacity #372

raed114 · 2024-11-11T13:52:27Z

What you would like to accomplish:
Increasing the default amount of the GCSFuse sidecar's memory from 100Mi to 200Mi instead of using a mutation webhook.

How this should work:
Implemented either automatically and/or will have a native option to increase the memory of the sidecar container.*

Explanation of the problem:
The gcs-fuse-csi-driver sidecar container seems to be repeatedly restarting and shows as OOMKilled all the while the node doesn't seem to run out of resources. or doesn't have any memory pressure issues. While the restart of this container eventually happened, it took 3 hours in this case (instead of a few minutes), which made the workload unresponsive, it became stuck until the eventual eviction of the pod

Several error messages after a deep investigation into it:

MountVolume.SetUp failed for volume "<VOLUME_NAME>" : kubernetes.io/csi: mounter.SetUpAt failed to determine if the node service has VOLUME_MOUNT_GROUP capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/plugins/gcsfuse.csi.storage.gke.io/csi.sock: connect: connection refused"
(combined from similar events): Memory cgroup out of memory: Killed process 2960932 (gcs-fuse-csi-dr) total-vm:1340248kB, anon-rss:100796kB, file-rss:26844kB, shmem-rss:0kB, UID:0 pgtables:372kB oom_score_adj:-997

Since this sidecar is created automatically and is managed by default, problems like these can cause serious downtime for the workload.

The text was updated successfully, but these errors were encountered:

hime · 2024-11-18T03:45:11Z

Hi @raed114,

The connection refused error tends to be related to unavailability from the gcs-fuse-csi-driver, which makes me suspect there's something deeper going on. Could you provide the following?

Are you using GKE Autopilot?
Are you using managed driver on GKE?
Could you verify the gcsfuse-node-* pod is healthy on your nodes?
Could you share how many gcsfuse-backed pods you are running per VM/Node?
Could you share the Cluster ID with me? You can get the id by running gcloud container clusters describe <cluster-name> --location <cluster-location> | grep id:

Ahmed-Alhameedawi · 2024-12-17T15:31:00Z

Do you happen to have any updates on this? I have a similar issue; the problem occurs when copying any file over 1Gib in size.

hime · 2024-12-19T01:56:53Z

Hi @Ahmed-Alhameedawi, please refer to our Public Documentation where you can adjust sidecar resources according to your workload needs.

roelschr-ft · 2025-03-11T14:36:52Z

Hi @hime I'm also facing similar issues. We are using Argo Workflows that will mount GCS buckets via the CSI driver. We encountered this issue when submitting many workflow requests in parallel. Here are some more info from our side:

GKE managed CSI driver (enabled via terraform);
Pod status:

gcsfusecsi-node-p2z69                                      1/2     CrashLoopBackOff   31 (3m11s ago)   4d1h
gcsfusecsi-node-qfk7j                                      1/2     CrashLoopBackOff   29 (4m1s ago)    4d1h
gcsfusecsi-node-sqdwh                                      1/2     CrashLoopBackOff   27 (3m6s ago)    4d1h

Pod desc:

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 11 Mar 2025 15:27:11 +0100
      Finished:     Tue, 11 Mar 2025 15:27:12 +0100
    Ready:          False

...
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Warning  BackOff  3m14s (x525 over 118m)  kubelet  Back-off restarting failed container gcs-fuse-csi-driver in pod gcsfusecsi-node-sqdwh_kube-system(4e236cb2-33dc-4504-8489-00afef490824)

Node desc:

Non-terminated Pods:          (8 in total)
  Namespace                   Name                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                         ------------  ----------  ---------------  -------------  ---
  kube-system                 anetd-fvcjx                  205m (5%)     1 (25%)     230Mi (1%)       65Mi (0%)      4d1h
  kube-system                 fluentbit-gke-rq9jd          105m (2%)     1 (25%)     230Mi (1%)       565Mi (4%)     4d1h
  kube-system                 gcsfusecsi-node-sqdwh        10m (0%)      0 (0%)      30Mi (0%)        150Mi (1%)     4d1h
  kube-system                 ip-masq-agent-7pqfw          10m (0%)      0 (0%)      16Mi (0%)        0 (0%)         4d1h
  kube-system                 netd-45jcf                   8m (0%)       1 (25%)     65Mi (0%)        65Mi (0%)      4d1h
  kube-system                 pdcsi-node-xtrdg             10m (0%)      0 (0%)      20Mi (0%)        500Mi (3%)     4d1h
  mmsm                        metadata-56bcb478f7-wvk49    250m (6%)     500m (12%)  512Mi (3%)       512Mi (3%)     23h
  mmsm                        recval-64c7fbcb45-r7q7p      250m (6%)     250m (6%)   512Mi (3%)       512Mi (3%)     175m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                848m (21%)    3750m (95%)
  memory             1615Mi (12%)  2369Mi (17%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
Events:
  Type     Reason      Age                 From            Message
  ----     ------      ----                ----            -------
  Warning  OOMKilling  80s (x20 over 99m)  kernel-monitor  (combined from similar events): Memory cgroup out of memory: Killed process 1291761 (gcs-fuse-csi-dr) total-vm:1405596kB, anon-rss:100680kB, file-rss:31928kB, shmem-rss:0kB, UID:0 pgtables:376kB oom_score_adj:-997

workflow pod events:

  Warning  FailedMount  2m51s (x107 over 3h24m)  kubelet  MountVolume.SetUp failed for volume "bucket0" : kubernetes.io/csi: mounter.SetUpAt failed to determine if the node service has VOLUME_MOUNT_GROUP capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/gcsfuse.csi.storage.gke.io/csi.sock: connect: connection refused"

hime · 2025-03-11T17:20:57Z

@roelschr-ft could you provide GKE version that you are using?

roelschr-ft · 2025-03-12T08:36:33Z

@roelschr-ft could you provide GKE version that you are using?

cluster: 1.30.8-gke.1261000
nodes: 1.27.3-gke.100

raj-prince mentioned this issue Nov 17, 2024

gcsfuse gcs-fuse-csi-driver increase default memory capacity GoogleCloudPlatform/gcsfuse#2682

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcsfuse gcs-fuse-csi-driver increase default memory capacity #372

gcsfuse gcs-fuse-csi-driver increase default memory capacity #372

raed114 commented Nov 11, 2024

hime commented Nov 18, 2024 •

edited

Loading

Ahmed-Alhameedawi commented Dec 17, 2024

hime commented Dec 19, 2024

roelschr-ft commented Mar 11, 2025 •

edited

Loading

hime commented Mar 11, 2025

roelschr-ft commented Mar 12, 2025 •

edited

Loading

gcsfuse gcs-fuse-csi-driver increase default memory capacity #372

gcsfuse gcs-fuse-csi-driver increase default memory capacity #372

Comments

raed114 commented Nov 11, 2024

hime commented Nov 18, 2024 • edited Loading

Ahmed-Alhameedawi commented Dec 17, 2024

hime commented Dec 19, 2024

roelschr-ft commented Mar 11, 2025 • edited Loading

hime commented Mar 11, 2025

roelschr-ft commented Mar 12, 2025 • edited Loading

hime commented Nov 18, 2024 •

edited

Loading

roelschr-ft commented Mar 11, 2025 •

edited

Loading

roelschr-ft commented Mar 12, 2025 •

edited

Loading