Is this performance normal for qwen3 8b with llama.cpp?

I have a question regarding the performance of the Qwen3 model (specifically the 8B q8k_xl variant) when running on an A770 GPU.

Current Observations:
```
Memory Bandwidth (IMC):
IMC Read: 25,000 MiB/s
IMC Write: 50 MiB/s
Compute Utilization: Approximately 30%
CPU Core Usage: 10 out of 12 cores are at 100% utilization.
```
The inference speed is really slow, about 8 tokens/second. Is this an expected result?

The deepseek 0528 Model nearly uses 100% compute and only one core.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is this performance normal for qwen3 8b with llama.cpp? #13232

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is this performance normal for qwen3 8b with llama.cpp? #13232

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions