Skip to content

Is this performance normal for qwen3 8b with llama.cpp? #13232

Open
@markussiebert

Description

@markussiebert

I have a question regarding the performance of the Qwen3 model (specifically the 8B q8k_xl variant) when running on an A770 GPU.

Current Observations:

Memory Bandwidth (IMC):
IMC Read: 25,000 MiB/s
IMC Write: 50 MiB/s
Compute Utilization: Approximately 30%
CPU Core Usage: 10 out of 12 cores are at 100% utilization.

The inference speed is really slow, about 8 tokens/second. Is this an expected result?

The deepseek 0528 Model nearly uses 100% compute and only one core.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions