[pull] master from ggml-org:master #201

pull · 2025-06-13T10:12:04Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

ggml-ci

* cmake: Simplify build-info.cpp generation The rebuild of build-info.cpp still gets triggered when .git/index gets changes. * cmake: generate build-info.cpp in build dir

Update oneMath commit to merged PR uxlfoundation/oneMath#669 which adds SYCL-Graph support for recording CUDA BLAS commands. With this change the `MUL_MAT` tests now pass on DPC++ CUDA backends with SYCL-Graph enabled. Prior to this change, an error would be thrown. ``` $ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2 UR CUDA ERROR: Value: 700 Name: CUDA_ERROR_ILLEGAL_ADDRESS Description: an illegal memory access was encountered Function: operator() Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154 Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN) Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator() SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code! in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598 $HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. ```

ggml-ci

Co-authored-by: dinhhuy <[email protected]>

* cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT * cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*

* batch : rework llama_batch_allocr ggml-ci * cont : move validation inside class ggml-ci * cont : move output counting to class ggml-ci * cont : minor ggml-ci * batch : add TODOs ggml-ci

* Update multimodal.md * Update multimodal.md

* batch : add LLAMA_BATCH_DEBUG environment variable ggml-ci * cont : improve seq_id display

* vocab : prevent integer overflow during load * Add static cast and GGML_ABORT --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-ci

* compare llama-bench: add option to plot * Address review comments: convert case + add type hints * Add matplotlib to requirements * fix tests * Improve comment and fix assert condition for test * Add back default test_name, add --plot_log_scale * use log_scale regardless of x_values

Currently when a model generates output which looks like a tool call, but is invalid an exception is thrown and not handled, causing the cli or llama-server to bail. Instead, handle the chat parser exception and simply return the generated text in such cases. Signed-off-by: Piotr Stankiewicz <[email protected]>

* batch : verify multi-sequence input batches ggml-ci * cont : auto-gen positions + verify multi-seq input ggml-ci * cont : first print debug info, then perform validation ggml-ci * cont : fix position auto-gen + add comments ggml-ci

ggml-ci

Adds: * Dots1Model to convert_hf_to_gguf.py * Computation graph code to llama-model.cpp * Chat template to llama-chat.cpp to detect this model's template. --- The model is called "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture. The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here: * https://huggingface.co/rednote-hilab/dots.llm1.inst * https://huggingface.co/rednote-hilab/dots.llm1.base The model architecture is a combination of Qwen and Deepseek parts, as seen here: https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py

Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.

* CUDA: add conv_2d_dw * better naming * simplify using template * Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const

ggml-ci

* model : more uniform output id handling ggml-ci * cont : revert n_outputs < n_tokens optimization ggml-ci * cont : fix out_ids initialization ggml-ci

ggml-ci

Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread

* Add PowerPC feature detection and scoring * ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC * ggml-cpu: Delay some initializations until function is called When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU. --------- Co-authored-by: Diego Devesa <[email protected]>

* Add header and namespace to use enqueue_functions extension * Convert submit and parallel_for to use new extension in convert.cpp * Convert submit and parallel_for to use extension in ggml-sycl.cpp * Convert submit and parallel_for to use extension in gla.cpp * Convert submit and parallel_for in mmq.cpp * Convert submit and parallel_for in mmvq.cpp * Convert submit and parallel_for in remaining files * Convert all simple parallel_for to nd_launch from enqueue_functions extension * Wrapping extension in general function Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels. --------- Signed-off-by: nscipione <[email protected]>

* vocab : prevent stack overflow in tokenize * vocab : return error instead of aborting on oversized token count * vocab : INT32_MIN from llama_tokenize on overflow

* CUDA: add conv_2d_transpose * remove direct include of cuda_fp16 * Review: add brackets for readability, remove ggml_set_param and add asserts

* ggml : add ggml_roll * use set/get_op_params & std::min

ggml-ci

* memory : rename interface to llama_memory_context_i ggml-ci * cont : fix comments * cont : use "mctx" for referencing a memory context ggml-ci

ggml-ci

…13792) * Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled. * remove #ifdef for debug utils and add queue marker.

* CUDA: add mean operation * add back sum_rows_f32_cuda * Review: early exit if col!=0

#14326) Mistral Small 2506 models using Pixtral vision encoder were running out of GPU memory when processing images larger than 1024x1024 pixels due to exponential memory growth from unlimited image size. This fix applies the same 1024x1024 limit used by Qwen2VL models to prevent OOM issues while maintaining compatibility with existing models.

ggml-ci

* run : avoid double tokenization by adopting common_tokenize heuristic * build : fix windows gcc and clang warnings * lint : fixed trailing whitepace * run : fix is_first flag

ggerganov and others added 8 commits June 13, 2025 08:03

vocab : prevent heap overflow when vocab is too small (#14145)

c33fe8b

ggml-ci

cmake : Improve build-info.cpp generation (#14156)

09cf2c7

* cmake: Simplify build-info.cpp generation The rebuild of build-info.cpp still gets triggered when .git/index gets changes. * cmake: generate build-info.cpp in build dir

sycl: Adding additional cpy dbg print output (#14034)

0889eba

server : fix SWA condition for full context reprocess (#14163)

ffad043

ggml-ci

pooling : make cls_b and cls_out_b optional (#14165)

d714dad

Co-authored-by: dinhhuy <[email protected]>

cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167)

cc8d081

* cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT * cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*

readme : remove survey link (#14168)

b7cc774

pull bot added the ⤵️ pull label Jun 13, 2025

github-actions bot added examples server ggml SYCL build labels Jun 13, 2025

ggerganov and others added 2 commits June 13, 2025 13:47

batch : rework llama_batch_allocr (#14153)

60c6663

* batch : rework llama_batch_allocr ggml-ci * cont : move validation inside class ggml-ci * cont : move output counting to class ggml-ci * cont : minor ggml-ci * batch : add TODOs ggml-ci

docs : Update multimodal.md (#14122)

26ff368

* Update multimodal.md * Update multimodal.md

github-actions bot added the documentation Improvements or additions to documentation label Jun 13, 2025

ggerganov and others added 3 commits June 13, 2025 18:35

batch : add LLAMA_BATCH_DEBUG environment variable (#14172)

80709b7

* batch : add LLAMA_BATCH_DEBUG environment variable ggml-ci * cont : improve seq_id display

Merge commit from fork

3cfbbdb

* vocab : prevent integer overflow during load * Add static cast and GGML_ABORT --------- Co-authored-by: Georgi Gerganov <[email protected]>

sycl: fix docker image (#14144)

40643ed

github-actions bot added the devops label Jun 13, 2025

ggerganov and others added 2 commits June 13, 2025 20:03

vocab : fix build (#14175)

fb85a28

ggml-ci

github-actions bot added python script labels Jun 14, 2025

p1-0tr and others added 5 commits June 14, 2025 17:25

docs : remove WIP since PR has been merged (#13912)

00ba772

cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188)

c311ac6

ggml-ci

slaren and others added 29 commits June 19, 2025 21:24

ggml-cpu : remove unnecesary arm feature detection (#14281)

8f71d0f

Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.

CUDA: add conv_2d_dw (#14265)

9eaa51e

* CUDA: add conv_2d_dw * better naming * simplify using template * Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const

ubatch : new splitting logic (#14217)

4c9fdfb

ggml-ci

model : more uniform output id handling (#14275)

812939a

* model : more uniform output id handling ggml-ci * cont : revert n_outputs < n_tokens optimization ggml-ci * cont : fix out_ids initialization ggml-ci

ggml: Update KleidiAI to v1.9.0 (#14277)

9230dbe

ggml : fix repack work size for mul_mat_id (#14292)

d27b3ca

ggml-ci

cuda : synchronize graph capture and cublas handle destruction (#14288)

e28c1b9

Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread

llama : improve sep token handling (#14272)

88fc854

vocab : prevent tokenizer overflow (#14301)

dd6e6d0

* vocab : prevent stack overflow in tokenize * vocab : return error instead of aborting on oversized token count * vocab : INT32_MIN from llama_tokenize on overflow

lint : remove trailing whitepace (#14304)

22015b2

CUDA: add conv_2d_transpose (#14287)

c959f46

* CUDA: add conv_2d_transpose * remove direct include of cuda_fp16 * Review: add brackets for readability, remove ggml_set_param and add asserts

docs : fix the link to llama.h (#14293)

d860dd9

Add ggml_roll (ggml/1274)

b714767

* ggml : add ggml_roll * use set/get_op_params & std::min

sync : ggml

06cbedf

ggml-ci

convert : fix Llama 4 conversion (#14311)

b23fa0b

memory : rename interface to llama_memory_context_i (#14296)

692e3cd

* memory : rename interface to llama_memory_context_i ggml-ci * cont : fix comments * cont : use "mctx" for referencing a memory context ggml-ci

metal : fix thread-safety (#14300)

67ae531

ggml-ci

gguf-py : fix TemplateProcessing pair when bos/eos is missing (#14312)

58cba76

Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#…

bb16041

…13792) * Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled. * remove #ifdef for debug utils and add queue marker.

gguf-py : fix Qwen3-Embedding eos token (#14314)

aa0ef5c

CUDA: add mean operation (#14313)

aa064b2

* CUDA: add mean operation * add back sum_rows_f32_cuda * Review: early exit if col!=0

common : use std::string_view now that we target c++17 (#14319)

40bfa04

HIP: enable vec fattn on RDNA4 (#14323)

af3373f

examples : fix is_first logic for tokenization (#14329)

f1f5e82

ggml-ci

run : avoid double tokenization (#14327)

66aba7a

* run : avoid double tokenization by adopting common_tokenize heuristic * build : fix windows gcc and clang warnings * lint : fixed trailing whitepace * run : fix is_first flag

gguf-py : fix SpecialVocab parsing when post_processor is null (#14330)

238005c

teleprint-me closed this Jun 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from ggml-org:master #201

[pull] master from ggml-org:master #201

Uh oh!

pull bot commented Jun 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

[pull] master from ggml-org:master #201

[pull] master from ggml-org:master #201

Uh oh!

Conversation

pull bot commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pull bot commented Jun 13, 2025 •

edited

Loading