Skip to content

Ollama: Crash on Intel Iris Xe due to non-configurable pipeline parallelism (n_copies=4) #13240

Open
@jaredpilcher

Description

@jaredpilcher

🐛 Describe the bug

When running models like devstral:latest using the ipex-llm build of Ollama, the runner process crashes with an exit status 1. The server log consistently shows that pipeline parallelism enabled (n_copies=4) is active by default. This aggressive parallelism setting appears to overwhelm the resources of integrated GPUs like the Intel Iris Xe, causing the Intel SYCL driver to crash with a UR errorException.

This n_copies=4 setting cannot be configured. The num_parallel parameter is not recognized in the Modelfile for this version of Ollama, making it impossible to reduce the parallelism and prevent the crash.

To Reproduce

  1. Install ipex-llm[cpp] on a Windows machine with an Intel Iris Xe GPU.
  2. Start the Ollama server: ollama serve
  3. Run a large model: ollama run devstral:latest
  4. The process will begin to load, allocate buffers, and then crash.

Expected behavior

The model should load and run successfully without crashing the driver. Ideally, the default n_copies for integrated GPUs would be lower (e.g., 1 or 2), or there should be a supported parameter in the Modelfile (like num_parallel) to configure it manually.

Environment

  • OS: Windows
  • Hardware: Lenovo ThinkPad T14s Gen 4
  • GPU: Intel(R) Iris(R) Xe Graphics
  • Library: ipex-llm[cpp]

Attempts to Troubleshoot

  • Setting num_gpu to a low value (e.g., 10) in a Modelfile does not solve the issue, as the n_copies=4 setting still triggers the crash.
  • Adding PARAMETER num_parallel 1 to a Modelfile results in an Error: unknown parameter 'num_parallel', confirming this version does not support the standard Ollama parameter for controlling parallelism.

Ollama Server Log

llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  241 tensors
llama_model_loader: - type q6_K:   41 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 13.34 GiB (4.86 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 23.57 B
print_info: general.name     = Devstral Small 2505
print_info: vocab type       = BPE
print_info: n_vocab          = 131072
print_info: n_merges         = 269443
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: LF token         = 1010 'Ċ'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 150
llama_model_load: vocab only - skipping tensors
time=2025-06-28T23:37:14.961-06:00 level=INFO source=server.go:430 msg="starting llama server" cmd="C:\\Users\\pilchj\\AppData\\Local\\miniconda3\\envs\\llm-cpp\\Lib\\site-packages\\bigdl\\cpp\\libs\\ollama\\ollama-lib.exe runner --model C:\\Users\\pilchj\\.ollama\\models\\blobs\\sha256-b3a2c9a8fef9be8d2ef951aecca36a36b9ea0b70abe9359eab4315bf4cd9be01 --ctx-size 8192 --batch-size 512 --n-gpu-layers 999 --threads 6 --no-mmap --parallel 4 --port 50155"
time=2025-06-28T23:37:14.984-06:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-06-28T23:37:14.994-06:00 level=INFO source=server.go:605 msg="waiting for llama runner to start responding"
time=2025-06-28T23:37:14.996-06:00 level=INFO source=server.go:639 msg="waiting for server to become available" status="llm server error"
using override patterns: []
time=2025-06-28T23:37:15.256-06:00 level=INFO source=runner.go:883 msg="starting go runner"
time=2025-06-28T23:37:15.266-06:00 level=INFO source=ggml.go:109 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(clang)
ModelParams: {NumGpuLayers:999 MainGpu:0 UseMmap:false UseMlock:false TensorSplit:[] Progress:0x7ff7e1aee200 VocabOnly:false OverrideTensors:[]}
time=2025-06-28T23:37:15.269-06:00 level=INFO source=runner.go:944 msg="Server listening on 127.0.0.1:50155"
time=2025-06-28T23:37:15.502-06:00 level=INFO source=server.go:639 msg="waiting for server to become available" status="llm server loading model"
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Iris(R) Xe Graphics) - 14789 MiB free
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Iris(R) Xe Graphics) - 14789 MiB free
llama_model_loader: loaded meta data with 41 key-value pairs and 363 tensors from C:\Users\pilchj\.ollama\models\blobs\sha256-b3a2c9a8fef9be8d2ef951aecca36a36b9ea0b70abe9359eab4315bf4cd9be01 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                  general.base_model.0.name str              = Devstrall Small 2505
llama_model_loader: - kv   2:          general.base_model.0.organization str              = Mistralai
llama_model_loader: - kv   3:              general.base_model.0.repo_url str              = https://huggingface.co/mistralai/Devs...
llama_model_loader: - kv   4:               general.base_model.0.version str              = 2505
llama_model_loader: - kv   5:                   general.base_model.count u32              = 1
llama_model_loader: - kv   6:                           general.basename str              = Devstral
llama_model_loader: - kv   7:                          general.file_type u32              = 15
llama_model_loader: - kv   8:                          general.languages arr[str,24]      = ["en", "fr", "de", "es", "pt", "it", ...
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                               general.name str              = Devstral Small 2505
llama_model_loader: - kv  11:                    general.parameter_count u64              = 23572403200
llama_model_loader: - kv  12:               general.quantization_version u32              = 2
llama_model_loader: - kv  13:                         general.size_label str              = Small
llama_model_loader: - kv  14:                               general.tags arr[str,1]       = ["text2text-generation"]
llama_model_loader: - kv  15:                               general.type str              = model
llama_model_loader: - kv  16:                            general.version str              = 2505
llama_model_loader: - kv  17:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  18:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  19:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  20:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  22:                          llama.block_count u32              = 40
llama_model_loader: - kv  23:                       llama.context_length u32              = 131072
llama_model_loader: - kv  24:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv  25:                  llama.feed_forward_length u32              = 32768
llama_model_loader: - kv  26:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  27:                       llama.rope.freq_base f32              = 1000000000.000000
llama_model_loader: - kv  28:                           llama.vocab_size u32              = 131072
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {%- set today = strftime_now("%Y-%m-%...
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  32:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  35:                      tokenizer.ggml.merges arr[str,269443]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv  36:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  37:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  38:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  39:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  40:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  241 tensors
llama_model_loader: - type q6_K:   41 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 13.34 GiB (4.86 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 32768
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 13B
print_info: model params     = 23.57 B
print_info: general.name     = Devstral Small 2505
print_info: vocab type       = BPE
print_info: n_vocab          = 131072
print_info: n_merges         = 269443
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: LF token         = 1010 'Ċ'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = false)
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:        SYCL0 model buffer size = 13302.36 MiB
load_tensors:          CPU model buffer size =   360.00 MiB
llama_init_from_model: n_seq_max     = 4
llama_init_from_model: n_ctx         = 8192
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 1
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel Iris Xe Graphics|   12.3|     96|     512|   32| 15507M|            1.6.32960|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      N|
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
llama_kv_cache_init:      SYCL0 KV buffer size =  1280.00 MiB
llama_init_from_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_init_from_model:  SYCL_Host  output buffer size =     2.08 MiB
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
llama_init_from_model:      SYCL0 compute buffer size =   410.02 MiB
llama_init_from_model:  SYCL_Host compute buffer size =    74.02 MiB
llama_init_from_model: graph nodes  = 1126 (with bs=512), 1006 (with bs=1)
llama_init_from_model: graph splits = 3
time=2025-06-28T23:37:57.940-06:00 level=WARN source=runner.go:802 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
UR errorException caught at file:D:\actions-runner\release-cpp-oneapi_2024_2\_work\llm.cpp\llm.cpp\ollama-llama-cpp\ggml\src\ggml-sycl\common.cpp, line:99
time=2025-06-28T23:37:59.880-06:00 level=INFO source=server.go:639 msg="waiting for server to become available" status="llm server not responding"
time=2025-06-28T23:38:01.936-06:00 level=INFO source=server.go:639 msg="waiting for server to become available" status="llm server error"
time=2025-06-28T23:38:02.187-06:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: exit status 1"
[GIN] 2025/06/28 - 23:38:02 | 500 |   48.2099198s |       127.0.0.1 | POST     "/api/generate"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions