Description
Hello! I was just trying out llgtrt yesterday and was blown away by how easy it was it get LoRA adapters up and running. I was struggling for over a week to get Phi4 + LoRA to work with TensorRT-LLM + Triton Server and finally gave up. The performance of this server is phenomenal as well and I am so happy not to have to mess with these gigantic config.pbtxt files that change every release too. Wow, I am so impressed with this project! Thank you!
One thing I noticed though is that the docker image is quite large - 35 GB as measured by dive. It includes a bunch of stuff like the Rust toolchain, a bunch of Python libs, etc. that are not required for runtime. Much of the fault lies with the NVIDIA tensorrt container, which is a total mess.
I was able to get the image size down to 7.7 GB (78% smaller) by having this as the final stage:
FROM nvcr.io/nvidia/cuda:12.8.1-runtime-ubuntu24.04 AS llgtrt_prod
RUN DEBIAN_FRONTEND=noninteractive apt-get update \
&& apt-get upgrade -y \
&& apt-get install -y --no-install-recommends \
# These are runtime dependencies of tensorrt_llm
libpython3.12-dev \
libopenmpi-dev \
&& rm -rf /var/lib/apt/lists/*
COPY --from=llgtrt_builder /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs /usr/local/lib
COPY --from=llgtrt_builder /usr/lib/x86_64-linux-gnu/libnvinfer.so.10 /usr/local/lib/libnvinfer.so.10
COPY --from=llgtrt_builder /workspaces/llgtrt/target/release/llgtrt /usr/local/bin/llgtrt
I haven't tested/don't know the full range of llgtrt capabilities, but Phi4 + five LoRA adapters works great with this as the final image.
I recommend producing two docker images, one for model building (that should include the lora export script, btw) and one for runtime. To avoid breaking changes, the runtime stage could be published as llgtrt:<version>-runtime
.
Having a smaller image also helps with security (just surface area) and auto-scale time (doesn't take as long to pull the image).