Skip to content

Commit 99b939a

Browse files
Add note on hf-token for llama3 model (#386)
Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
1 parent f991e85 commit 99b939a

File tree

1 file changed

+48
-0
lines changed
  • docs/modelserving/v1beta1/llm/huggingface/text_generation

1 file changed

+48
-0
lines changed

docs/modelserving/v1beta1/llm/huggingface/text_generation/README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,23 @@ In this example, We demonstrate how to deploy `Llama3 model` for text generation
66
KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel.
77
If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.
88

9+
!!! note
10+
The Llama3 model requires huggingface hub token to download the model. You can set the token using `HF_TOKEN`
11+
environment variable.
12+
13+
Create a secret with the Hugging Face token.
14+
15+
=== "Yaml"
16+
```yaml
17+
apiVersion: v1
18+
kind: Secret
19+
metadata:
20+
name: hf-secret
21+
type: Opaque
22+
stringData:
23+
HF_TOKEN: <token>
24+
```
25+
926
=== "Yaml"
1027

1128
```yaml
@@ -22,6 +39,13 @@ If the model is not supported by vLLM, KServe falls back to HuggingFace backend
2239
args:
2340
- --model_name=llama3
2441
- --model_id=meta-llama/meta-llama-3-8b-instruct
42+
env:
43+
- name: HF_TOKEN
44+
valueFrom:
45+
secretKeyRef:
46+
name: hf-secret
47+
key: HF_TOKEN
48+
optional: false
2549
resources:
2650
limits:
2751
cpu: "6"
@@ -150,6 +174,23 @@ curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
150174
You can use `--backend=huggingface` argument to perform the inference using Hugging Face API. KServe Hugging Face backend runtime also
151175
supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference.
152176

177+
!!! note
178+
The Llama3 model requires huggingface hub token to download the model. You can set the token using `HF_TOKEN`
179+
environment variable.
180+
181+
Create a secret with the Hugging Face token.
182+
183+
=== "Yaml"
184+
```yaml
185+
apiVersion: v1
186+
kind: Secret
187+
metadata:
188+
name: hf-secret
189+
type: Opaque
190+
stringData:
191+
HF_TOKEN: <token>
192+
```
193+
153194
=== "Yaml"
154195

155196
```yaml
@@ -167,6 +208,13 @@ supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for i
167208
- --model_name=llama3
168209
- --model_id=meta-llama/meta-llama-3-8b-instruct
169210
- --backend=huggingface
211+
env:
212+
- name: HF_TOKEN
213+
valueFrom:
214+
secretKeyRef:
215+
name: hf-secret
216+
key: HF_TOKEN
217+
optional: false
170218
resources:
171219
limits:
172220
cpu: "6"

0 commit comments

Comments
 (0)