ggml-cpu: Support Q3_K SIMD on s390x #13301

taronaeo · 2025-05-04T13:00:29Z

This pull request aims to include SIMD instruction set for Q3_K quantisation on the s390x platform.

Before SIMD Benchmark

model	size	params	backend	threads	test	t/s
qwen2 1.5B Q3_K - Medium	780.32 MiB	1.54 B	BLAS	16	pp512	110.51 ± 0.89
qwen2 1.5B Q3_K - Medium	780.32 MiB	1.54 B	BLAS	16	tg128	8.41 ± 0.22
qwen2 3B Q3_K - Medium	1.48 GiB	3.09 B	BLAS	16	pp512	49.05 ± 0.39
qwen2 3B Q3_K - Medium	1.48 GiB	3.09 B	BLAS	16	tg128	3.37 ± 0.05
qwen2 7B Q3_K - Medium	3.54 GiB	7.62 B	BLAS	16	pp512	24.96 ± 0.12
qwen2 7B Q3_K - Medium	3.54 GiB	7.62 B	BLAS	16	tg128	1.54 ± 0.02

After SIMD Benchmark

model	size	params	backend	threads	test	t/s
qwen2 1.5B Q3_K - Medium	780.32 MiB	1.54 B	BLAS	16	pp512	105.84 ± 2.11
qwen2 1.5B Q3_K - Medium	780.32 MiB	1.54 B	BLAS	16	tg128	23.93 ± 3.07
qwen2 3B Q3_K - Medium	1.48 GiB	3.09 B	BLAS	16	pp512	55.32 ± 0.20
qwen2 3B Q3_K - Medium	1.48 GiB	3.09 B	BLAS	16	tg128	14.92 ± 0.76
qwen2 7B Q3_K - Medium	3.54 GiB	7.62 B	BLAS	16	pp512	26.98 ± 0.04
qwen2 7B Q3_K - Medium	3.54 GiB	7.62 B	BLAS	16	tg128	6.84 ± 0.09

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

Tested Qwen2.5 1.5B (Q3_K)
Tested Qwen2.5 3B (Q3_K)
Tested Qwen2.5 7B (Q3_K)

Note

Tests were conducted on an IBM z15 Mainframe with 16 IFLs (cores) and 128 GB Memory on a shared R&D LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

Signed-off-by: Aaron Teo <[email protected]>

* origin/master: (27 commits) llama : fix build_ffn without gate (ggml-org#13336) CUDA: fix bad asserts for partial offload (ggml-org#13337) convert : qwen2/3moe : set yarn metadata if present (ggml-org#13331) CUDA: fix --split-mode row for MMQ (ggml-org#13323) gguf-py : avoid requiring pyside6 for other scripts (ggml-org#13036) CUDA: fix logic for clearing padding with -ngl 0 (ggml-org#13320) sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (ggml-org#13264) server : Webui - change setText command from parent window to also send the message. (ggml-org#13309) mtmd : rename llava directory to mtmd (ggml-org#13311) clip : fix confused naming ffn_up and ffn_down (ggml-org#13290) convert : bailingmoe : set yarn metadata if present (ggml-org#13312) SYCL: Disable mul_mat kernels for noncontiguous tensor b (ggml-org#13308) mtmd : add C public API (ggml-org#13184) rpc : use backend registry, support dl backends (ggml-org#13304) ggml : activate s390x simd for Q3_K (ggml-org#13301) llava/mtmd : fixes to fully support dl backends (ggml-org#13303) llama : build windows releases with dl backends (ggml-org#13220) CUDA: fix race condition in MMQ stream-k fixup (ggml-org#13299) CUDA: fix race condition in MMQ ids_dst (ggml-org#13294) vulkan: Additional type support for unary, binary, and copy (ggml-org#13266) ...

feat(ggml-cpu): activate s390x simd for Q3_K

06123df

Signed-off-by: Aaron Teo <[email protected]>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 4, 2025

slaren approved these changes May 4, 2025

View reviewed changes

CISC merged commit 6eb7d25 into ggml-org:master May 4, 2025
46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: Support Q3_K SIMD on s390x #13301

ggml-cpu: Support Q3_K SIMD on s390x #13301

taronaeo commented May 4, 2025 •

edited

Loading

ggml-cpu: Support Q3_K SIMD on s390x #13301

ggml-cpu: Support Q3_K SIMD on s390x #13301

Conversation

taronaeo commented May 4, 2025 • edited Loading

Before SIMD Benchmark

After SIMD Benchmark

Verification

taronaeo commented May 4, 2025 •

edited

Loading