Open
Description
Our test results show that the concurrency is not high. Is there a problem with my configuration?
PrefillWorker handles a single request every time. Is it the problem here? How to change it to batch?
dynamo/examples/llm/components/prefill_worker.py
Lines 103 to 106 in 457e78f
Dyname result: Request throughput is 1.10 (TP1 BF16 3P1D)
============ Serving Benchmark Result ============
Backend: vllm
Traffic request rate: 32.0
Max reqeuest concurrency: not set
Successful requests: 1000
Benchmark duration (s): 907.39
Total input tokens: 4007607
Total generated tokens: 56814
Total generated tokens (retokenized): 50731
Request throughput (req/s): 1.10
Input token throughput (tok/s): 4416.63
Output token throughput (tok/s): 62.61
Total token throughput (tok/s): 4479.24
Concurrency: 481.16
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 436601.67
Median E2E Latency (ms): 440035.82
---------------Time to First Token----------------
Mean TTFT (ms): 433931.46
Median TTFT (ms): 437615.93
P99 TTFT (ms): 865498.48
---------------Inter-Token Latency----------------
Mean ITL (ms): 11.60
Median ITL (ms): 11.52
P95 ITL (ms): 12.52
P99 ITL (ms): 13.84
Max ITL (ms): 119.50
==================================================
sglang result: Request throughput is 6.42 (TP1 BF16 3P1D)
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 32.0
Max reqeuest concurrency: not set
Successful requests: 1000
Benchmark duration (s): 155.80
Total input tokens: 4007607
Total generated tokens: 109354
Total generated tokens (retokenized): 76166
Request throughput (req/s): 6.42
Input token throughput (tok/s): 25723.23 8574.41 tok/s/gpu
Output token throughput (tok/s): 701.90
Total token throughput (tok/s): 26425.13
Concurrency: 398.61
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 62102.68
Median E2E Latency (ms): 62686.63
---------------Time to First Token----------------
Mean TTFT (ms): 60663.67
Median TTFT (ms): 61226.33
P99 TTFT (ms): 120785.82
---------------Inter-Token Latency----------------
Mean ITL (ms): 13.30
Median ITL (ms): 12.51
P95 ITL (ms): 15.42
P99 ITL (ms): 29.09
Max ITL (ms): 322.48
==================================================
graph config: 3 Prefill worker and 1 decode worker
Frontend:
served_model_name: model-thought-7b
endpoint: dynamo.Processor.chat/completions
port: 8000
ServiceArgs:
workers: 4
Processor:
model: /mnt/yscfs/model-thought/rag_fc_7b_20250314_Qwen25_7b_instruct_yuanshi_chat_v4_bf16_thought_action_250313/
router: round-robin
ServiceArgs:
workers: 4
VllmWorker:
model: /mnt/yscfs/model-thought/rag_fc_7b_20250314_Qwen25_7b_instruct_yuanshi_chat_v4_bf16_thought_action_250313/
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
max-model-len: 16384
remote-prefill: true
conditional-disagg: true
max-local-prefill-length: 10
tensor-parallel-size: 1
ServiceArgs:
workers: 1
resources:
gpu: 1
PrefillWorker:
model: /mnt/yscfs/model-thought/rag_fc_7b_20250314_Qwen25_7b_instruct_yuanshi_chat_v4_bf16_thought_action_250313/
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
max-model-len: 16384
max-num-batched-tokens: 16384
tensor-parallel-size: 1
ServiceArgs:
workers: 3
resources:
gpu: 1
Metadata
Metadata
Assignees
Labels
No labels