Add note on hf-token for llama3 model (#386)

sivanantha321 · web-flow · commit 99b939aa22c7 · 2024-08-17T21:12:39.000-04:00
Signed-off-by: Sivanantham Chinnaiyan &lt;sivanantham.chinnaiyan@ideas2it.com&gt;
diff --git a/docs/modelserving/v1beta1/llm/huggingface/text_generation/README.md b/docs/modelserving/v1beta1/llm/huggingface/text_generation/README.md
@@ -6,6 +6,23 @@ In this example, We demonstrate how to deploy `Llama3 model` for text generation
 KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel.
 If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.
 
+!!! note
+    The Llama3 model requires huggingface hub token to download the model. You can set the token using `HF_TOKEN` 
+    environment variable.
+
+Create a secret with the Hugging Face token.
+
+=== "Yaml"
+    ```yaml
+    apiVersion: v1
+    kind: Secret
+    metadata:
+        name: hf-secret
+    type: Opaque	
+    stringData:
+        HF_TOKEN: <token>
+    ```
+
 === "Yaml"
 
     ```yaml
@@ -22,6 +39,13 @@ If the model is not supported by vLLM, KServe falls back to HuggingFace backend
           args:
             - --model_name=llama3
             - --model_id=meta-llama/meta-llama-3-8b-instruct
+          env:
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-secret
+                  key: HF_TOKEN
+                  optional: false
           resources:
             limits:
               cpu: "6"
@@ -150,6 +174,23 @@ curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
 You can use `--backend=huggingface` argument to perform the inference using Hugging Face API. KServe Hugging Face backend runtime also 
 supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference.
 
+!!! note
+    The Llama3 model requires huggingface hub token to download the model. You can set the token using `HF_TOKEN` 
+    environment variable.
+
+Create a secret with the Hugging Face token.
+
+=== "Yaml"
+    ```yaml
+    apiVersion: v1
+    kind: Secret
+    metadata:
+        name: hf-secret
+    type: Opaque	
+    stringData:
+        HF_TOKEN: <token>
+    ```
+
 === "Yaml"
 
     ```yaml
@@ -167,6 +208,13 @@ supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for i
             - --model_name=llama3
             - --model_id=meta-llama/meta-llama-3-8b-instruct
             - --backend=huggingface
+          env:
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-secret
+                  key: HF_TOKEN
+                  optional: false
           resources:
             limits:
               cpu: "6"