[CPU vs GPU vs NPU] Phi3 on QualComm Elite 8- CPU works but GPU and NPU don't work

**Describe the bug**
I can succesfully run a Phi3 model using onnxruntime_genai on a QualComm Elitie 8 CPU (Windows 11, ARM64) through phi3-qa.py.  I can't get the model to run on the GPU (via DML EP) or NPU (via QNN EP).  I have custom built from source code both onnxruntime (built with --use_dml, --use_qnn) and onnxruntime_genai. 


**To Reproduce**
Steps to reproduce the behavior:
1. build onnxruntime with --use_dml and --use_qnn flags.
2. build onnxruntime_genai
3. pip install custom built wheels in a Python3.10 venv.
4. try running phi3-qa.py with various "-e" flags such as cpu, dml, and qnn.  Only cpu works.


**Errors**
CPU:
python.exe phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 --timings -e cpu
...

[This works as expected, model runs on the CPU]


GPU: 
python.exe phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 --timings -e dml
...
2025-06-06 16:03:08.4113577 [E:onnxruntime:onnxruntime-genai, sequential_executor.cc:572 onnxruntime::ExecuteKernel] Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: C:\Users\QCWorkshop\Projects\onnx-runtime\build_from_source\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2816)\onnxruntime.dll!00007FFE37DFFF24: (caller: 00007FFE37E1C404) Exception(2) tid(966c) 80070057 The parameter is incorrect.

[Program crashes with the error above]


NPU:
python.exe phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 --timings -e qnn
...
Starting stage: Graph Preparation Initializing
Completed stage: Graph Preparation Initializing (347 us)
Starting stage: Graph Transformations and Optimizations
Completed stage: Graph Transformations and Optimizations (894 us)
Starting stage: Graph Sequencing for Target
Completed stage: Graph Sequencing for Target (287 us)
Starting stage: VTCM Allocation
Completed stage: VTCM Allocation (61 us)
Starting stage: Parallelization Optimization
Completed stage: Parallelization Optimization (69 us)
Starting stage: Finalizing Graph Sequence
Completed stage: Finalizing Graph Sequence (136 us)
Starting stage: Completion
Completed stage: Completion (21 us)
...

[ It appears to work but it's running on the CPU not the NPU ]



**Expected behavior**
I expect correct DML and QNN execution provider support.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPU vs GPU vs NPU] Phi3 on QualComm Elite 8- CPU works but GPU and NPU don't work #1535

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CPU vs GPU vs NPU] Phi3 on QualComm Elite 8- CPU works but GPU and NPU don't work #1535

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions