Skip to content

[CPU vs GPU vs NPU] Phi3 on QualComm Elite 8- CPU works but GPU and NPU don't work #1535

Open
@sourcesync

Description

@sourcesync

Describe the bug
I can succesfully run a Phi3 model using onnxruntime_genai on a QualComm Elitie 8 CPU (Windows 11, ARM64) through phi3-qa.py. I can't get the model to run on the GPU (via DML EP) or NPU (via QNN EP). I have custom built from source code both onnxruntime (built with --use_dml, --use_qnn) and onnxruntime_genai.

To Reproduce
Steps to reproduce the behavior:

  1. build onnxruntime with --use_dml and --use_qnn flags.
  2. build onnxruntime_genai
  3. pip install custom built wheels in a Python3.10 venv.
  4. try running phi3-qa.py with various "-e" flags such as cpu, dml, and qnn. Only cpu works.

Errors
CPU:
python.exe phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 --timings -e cpu
...

[This works as expected, model runs on the CPU]

GPU:
python.exe phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 --timings -e dml
...
2025-06-06 16:03:08.4113577 [E:onnxruntime:onnxruntime-genai, sequential_executor.cc:572 onnxruntime::ExecuteKernel] Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: C:\Users\QCWorkshop\Projects\onnx-runtime\build_from_source\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2816)\onnxruntime.dll!00007FFE37DFFF24: (caller: 00007FFE37E1C404) Exception(2) tid(966c) 80070057 The parameter is incorrect.

[Program crashes with the error above]

NPU:
python.exe phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 --timings -e qnn
...
Starting stage: Graph Preparation Initializing
Completed stage: Graph Preparation Initializing (347 us)
Starting stage: Graph Transformations and Optimizations
Completed stage: Graph Transformations and Optimizations (894 us)
Starting stage: Graph Sequencing for Target
Completed stage: Graph Sequencing for Target (287 us)
Starting stage: VTCM Allocation
Completed stage: VTCM Allocation (61 us)
Starting stage: Parallelization Optimization
Completed stage: Parallelization Optimization (69 us)
Starting stage: Finalizing Graph Sequence
Completed stage: Finalizing Graph Sequence (136 us)
Starting stage: Completion
Completed stage: Completion (21 us)
...

[ It appears to work but it's running on the CPU not the NPU ]

Expected behavior
I expect correct DML and QNN execution provider support.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions