Skip to content

Serve an LLM with multiple GPUs in GKE is doesn't work and fails with The node was low on resource: ephemeral-storage. #1581

@raushan2016

Description

@raushan2016
2025-01-03 11:06:50.367 PST
[2m2025-01-03T19:06:50.366843Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Download: [13/30] -- ETA: 0:06:42.769236
  • Somehow instead of /data some other mount /etc/hosts is getting filled and eventually runs out of disk space.
root@llm-689555d8bf-62gjd:/etc# df
\Filesystem     1K-blocks     Used Available Use% Mounted on
overlay         98831908 75370476  23445048  77% /
tmpfs              65536        0     65536   0% /dev
/dev/nvme0n2   153707984       28 153691572   1% /data
tmpfs           62914560       12  62914548   1% /dev/shm
/dev/nvme0n1p1  98831908 75370476  23445048  77% /etc/hosts
tmpfs           62914560       12  62914548   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           49441048        0  49441048   0% /proc/acpi
tmpfs           49441048        0  49441048   0% /proc/scsi
tmpfs           49441048        0  49441048   0% /sys/firmware
root@llm-fb5d99cb-569b7:/usr/src# df  
Filesystem     1K-blocks      Used Available Use% Mounted on
overlay         98831908  25463724  73351800  26% /
tmpfs              65536         0     65536   0% /dev
/dev/nvme0n2   153707984 137809580  15882020  90% /data
tmpfs           62914560     48980  62865580   1% /dev/shm
/dev/nvme0n1p1  98831908  25463724  73351800  26% /etc/hosts
tmpfs           62914560        12  62914548   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           49439752         0  49439752   0% /proc/acpi
tmpfs           49439752         0  49439752   0% /proc/scsi
tmpfs           49439752         0  49439752   0% /sys/firmware

NOTE: There might be other sample also impacted with the above change. Since we don't have any automated gates for the validation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions