Description
Describe the bug
I can succesfully run a Phi3 model using onnxruntime_genai on a QualComm Elitie 8 CPU (Windows 11, ARM64) through phi3-qa.py. I can't get the model to run on the GPU (via DML EP) or NPU (via QNN EP). I have custom built from source code both onnxruntime (built with --use_dml, --use_qnn) and onnxruntime_genai.
To Reproduce
Steps to reproduce the behavior:
- build onnxruntime with --use_dml and --use_qnn flags.
- build onnxruntime_genai
- pip install custom built wheels in a Python3.10 venv.
- try running phi3-qa.py with various "-e" flags such as cpu, dml, and qnn. Only cpu works.
Errors
CPU:
python.exe phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 --timings -e cpu
...
[This works as expected, model runs on the CPU]
GPU:
python.exe phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 --timings -e dml
...
2025-06-06 16:03:08.4113577 [E:onnxruntime:onnxruntime-genai, sequential_executor.cc:572 onnxruntime::ExecuteKernel] Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: C:\Users\QCWorkshop\Projects\onnx-runtime\build_from_source\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2816)\onnxruntime.dll!00007FFE37DFFF24: (caller: 00007FFE37E1C404) Exception(2) tid(966c) 80070057 The parameter is incorrect.
[Program crashes with the error above]
NPU:
python.exe phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 --timings -e qnn
...
Starting stage: Graph Preparation Initializing
Completed stage: Graph Preparation Initializing (347 us)
Starting stage: Graph Transformations and Optimizations
Completed stage: Graph Transformations and Optimizations (894 us)
Starting stage: Graph Sequencing for Target
Completed stage: Graph Sequencing for Target (287 us)
Starting stage: VTCM Allocation
Completed stage: VTCM Allocation (61 us)
Starting stage: Parallelization Optimization
Completed stage: Parallelization Optimization (69 us)
Starting stage: Finalizing Graph Sequence
Completed stage: Finalizing Graph Sequence (136 us)
Starting stage: Completion
Completed stage: Completion (21 us)
...
[ It appears to work but it's running on the CPU not the NPU ]
Expected behavior
I expect correct DML and QNN execution provider support.