Skip to content

Compared to SGLANG, Dynamo services have lower throughput. Is it a problem with my configuration? #387

Open
@a4zhangfei

Description

@a4zhangfei

Our test results show that the concurrency is not high. Is there a problem with my configuration?
PrefillWorker handles a single request every time. Is it the problem here? How to change it to batch?

prefill_request = await prefill_queue.dequeue_prefill_request()
if prefill_request is not None:
print(f"Dequeued prefill request: {prefill_request.request_id}")
async for _ in self.generate(prefill_request):

Dyname result: Request throughput is 1.10 (TP1 BF16 3P1D)

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    32.0
Max reqeuest concurrency:                not set
Successful requests:                     1000
Benchmark duration (s):                  907.39
Total input tokens:                      4007607
Total generated tokens:                  56814
Total generated tokens (retokenized):    50731
Request throughput (req/s):              1.10
Input token throughput (tok/s):          4416.63
Output token throughput (tok/s):         62.61
Total token throughput (tok/s):          4479.24
Concurrency:                             481.16
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   436601.67
Median E2E Latency (ms):                 440035.82
---------------Time to First Token----------------
Mean TTFT (ms):                          433931.46
Median TTFT (ms):                        437615.93
P99 TTFT (ms):                           865498.48
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.60
Median ITL (ms):                         11.52
P95 ITL (ms):                            12.52
P99 ITL (ms):                            13.84
Max ITL (ms):                            119.50
==================================================

sglang result: Request throughput is 6.42 (TP1 BF16 3P1D)

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    32.0
Max reqeuest concurrency:                not set
Successful requests:                     1000
Benchmark duration (s):                  155.80
Total input tokens:                      4007607
Total generated tokens:                  109354
Total generated tokens (retokenized):    76166
Request throughput (req/s):              6.42
Input token throughput (tok/s):          25723.23    8574.41 tok/s/gpu
Output token throughput (tok/s):         701.90
Total token throughput (tok/s):          26425.13
Concurrency:                             398.61
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   62102.68
Median E2E Latency (ms):                 62686.63
---------------Time to First Token----------------
Mean TTFT (ms):                          60663.67
Median TTFT (ms):                        61226.33
P99 TTFT (ms):                           120785.82
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.30
Median ITL (ms):                         12.51
P95 ITL (ms):                            15.42
P99 ITL (ms):                            29.09
Max ITL (ms):                            322.48
==================================================

graph config: 3 Prefill worker and 1 decode worker

Frontend:
  served_model_name: model-thought-7b
  endpoint: dynamo.Processor.chat/completions
  port: 8000
  ServiceArgs:
    workers: 4

Processor:
  model: /mnt/yscfs/model-thought/rag_fc_7b_20250314_Qwen25_7b_instruct_yuanshi_chat_v4_bf16_thought_action_250313/
  router: round-robin
  ServiceArgs:
    workers: 4

VllmWorker:
  model: /mnt/yscfs/model-thought/rag_fc_7b_20250314_Qwen25_7b_instruct_yuanshi_chat_v4_bf16_thought_action_250313/
  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
  max-model-len: 16384
  remote-prefill: true
  conditional-disagg: true
  max-local-prefill-length: 10
  tensor-parallel-size: 1
  ServiceArgs:
    workers: 1
    resources:
      gpu: 1

PrefillWorker:
  model: /mnt/yscfs/model-thought/rag_fc_7b_20250314_Qwen25_7b_instruct_yuanshi_chat_v4_bf16_thought_action_250313/
  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
  max-model-len: 16384
  max-num-batched-tokens: 16384
  tensor-parallel-size: 1
  ServiceArgs:
    workers: 3
    resources:
      gpu: 1

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions