-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Open
Description
- Impacted sample: https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/blob/main/ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml
- Impacted public tutorial: https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-multiple-gpu#autopilot
- The tutorial doesn't work end to end and the container keep on crashing with "The node was low on resource: ephemeral-storage. Threshold quantity: 10120387530, available: 9166736Ki. Container llm was using 57900024Ki, request is 0, has larger consumption of ephemeral-storage"
- Container logs is stuck at 13/30
2025-01-03 11:06:50.367 PST
[2m2025-01-03T19:06:50.366843Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Download: [13/30] -- ETA: 0:06:42.769236
- Somehow instead of /data some other mount /etc/hosts is getting filled and eventually runs out of disk space.
root@llm-689555d8bf-62gjd:/etc# df
\Filesystem 1K-blocks Used Available Use% Mounted on
overlay 98831908 75370476 23445048 77% /
tmpfs 65536 0 65536 0% /dev
/dev/nvme0n2 153707984 28 153691572 1% /data
tmpfs 62914560 12 62914548 1% /dev/shm
/dev/nvme0n1p1 98831908 75370476 23445048 77% /etc/hosts
tmpfs 62914560 12 62914548 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 49441048 0 49441048 0% /proc/acpi
tmpfs 49441048 0 49441048 0% /proc/scsi
tmpfs 49441048 0 49441048 0% /sys/firmware
- The suspected change in the underlying GIT repo is Update references to Hugging Face DLC for TGI #1495
- Using the old image
ghcr.io/huggingface/text-generation-inference:2.0.4
result in to successful run. Below is the disk usage for the same.
root@llm-fb5d99cb-569b7:/usr/src# df
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 98831908 25463724 73351800 26% /
tmpfs 65536 0 65536 0% /dev
/dev/nvme0n2 153707984 137809580 15882020 90% /data
tmpfs 62914560 48980 62865580 1% /dev/shm
/dev/nvme0n1p1 98831908 25463724 73351800 26% /etc/hosts
tmpfs 62914560 12 62914548 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 49439752 0 49439752 0% /proc/acpi
tmpfs 49439752 0 49439752 0% /proc/scsi
tmpfs 49439752 0 49439752 0% /sys/firmware
NOTE: There might be other sample also impacted with the above change. Since we don't have any automated gates for the validation.
Metadata
Metadata
Assignees
Labels
No labels