Skip to content

DeepSeek-R1-Distill-Llama-70B asym_int4 result error #13233

Open
@Zjq9409

Description

@Zjq9409

模型启动脚本:

#!/bin/bash
MODEL_PATH=${MODEL_PATH:-"/llm/models/DeepSeek-R1-Distill-Llama-70B/"}
SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"DeepSeek-R1-Distill-Llama-70B"}
TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE:-4}

MAX_NUM_SEQS=${MAX_NUM_SEQS:-256}
MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS:-8000}
MAX_MODEL_LEN=${MAX_MODEL_LEN:-8000}
LOAD_IN_LOW_BIT=${LOAD_IN_LOW_BIT:-"asym_int4"}
PORT=${PORT:-8006}

echo "Starting service with model: $MODEL_PATH"
echo "Served model name: $SERVED_MODEL_NAME"
echo "Tensor parallel size: $TENSOR_PARALLEL_SIZE"
echo "Max num sequences: $MAX_NUM_SEQS"
echo "Max num batched tokens: $MAX_NUM_BATCHED_TOKENS"
echo "Max model length: $MAX_MODEL_LEN"
echo "Load in low bit: $LOAD_IN_LOW_BIT"
echo "Port: $PORT"

export CCL_WORKER_COUNT=2
export SYCL_CACHE_PERSISTENT=1
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0

export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0

export VLLM_USE_V1=0
export IPEX_LLM_LOWBIT=$LOAD_IN_LOW_BIT

source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --port $PORT \
  --model $MODEL_PATH \
  --trust-remote-code \
  --block-size 8 \
  --gpu-memory-utilization 0.95 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit $LOAD_IN_LOW_BIT \
  --host 0.0.0.0 \
  --max-model-len $MAX_MODEL_LEN \
  --max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \
  --max-num-seqs $MAX_NUM_SEQS \
  --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
  --disable-async-output-proc \
  --distributed-executor-backend ray

python 客户端测试脚本

import requests
import json

# API 地址
url = "http://localhost:8005/v1/chat/completions"

# 请求头
headers = {
    "Content-Type": "application/json"
}

# 请求体
data = {
    "model": "/llm/models/DeepSeek-R1-Distill-Llama-70B/",
    "messages": [
        {"role": "system", "content": "你是一个中文翻译助手,会将中文翻译成英文"},
        {"role": "user", "content": "你好呀,你来自哪里?"}
    ],
    "temperature": 0.5,
    "max_tokens": 1024
}

# 发送 POST 请求
response = requests.post(url, headers=headers, data=json.dumps(data))

# 打印响应结果
print("Status Code:", response.status_code)
try:
    result = response.json()
    print(json.dumps(result, indent=2, ensure_ascii=False))

    # 提取并打印 assistant 的回答内容
    if "choices" in result:
        for choice in result["choices"]:
            print("\n【Assistant】:", choice["message"]["content"])
except Exception as e:
    print("Failed to parse response:", e)
    print(response.text)

输出结果不对:

【Assistant】

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions