Description
Is your feature request related to a problem? Please describe.
The current catalog of models on QAI-Hub optimized for NPU (Neural Processing Unit) acceleration lacks advanced, open-source models capable of powering tool-augmented, agentic, and function-calling workflows directly on device.
The latest Qwen3 series offers groundbreaking advancements in reasoning, code generation, multilingual understanding, and dynamic tool use, making them ideal candidates for edge AI scenarios where NPU inference is required.
Request: Add the Qwen3 models from largest to smallest prioritized for NPU-optimized deployment to enable cutting-edge, on-device intelligent agents.
Details of models being requested (Ordered by Priority for NPU Deployment):
🔺 Highest Priority
Model Name: Qwen3-32B
Type: Dense
Source repo link: https://github.com/QwenLM/Qwen3
Use Case: NPU-accelerated intelligent agent with dynamic tool orchestration, large-scale multi-turn reasoning, multilingual interaction.
🔺 High Priority
Model Name: Qwen3-30B-A3B (MoE with 3B active parameters)
Type: Mixture of Experts
Source repo link: https://github.com/QwenLM/Qwen3
Use Case: Memory-efficient model excellent for constrained NPU memory, still achieving strong performance in reasoning and external tool use.
⚪ Medium Priority
Model Name: Qwen3-14B
Type: Dense
Source repo link: https://github.com/QwenLM/Qwen3
Use Case: Mid-size agentic assistant for mobile/edge deployment with solid multi-step reasoning and OpenAI API function-calling workflows.
⚪ Medium Priority
Model Name: Qwen3-8B
Type: Dense
Source repo link: https://github.com/QwenLM/Qwen3
Use Case: Mobile-friendly assistant model for real-time reasoning and tool-augmented coding tasks, optimized for limited NPU memory.
🔻 Lower Priority
Model Name: Qwen3-4B
Type: Dense
Source repo link: https://github.com/QwenLM/Qwen3
Use Case: Lightweight fallback for very constrained NPU setups while retaining essential tool-calling and multilingual capabilities.
Additional Context for Requested Models:
- Native dynamic thinking and non-thinking modes to optimize different reasoning workflows.
- Fully compatible with OpenAI API standards including /v1/chat/completions and function/tool schemas.
- Designed for dynamic external tool invocation, multi-turn dialogues, and intelligent agent workflows.
- Quantized versions (such as Q4_K_M) may be required to maximize NPU compatibility without sacrificing functionality.
Key Requirements:
- Full NPU Optimization for all models (quantization, kernel acceleration).
- Support OpenAI API endpoints: /v1/chat/completions, /v1/models, /v1/completions.
- Ensure Tools/Function Calling Support: parsing of OpenAI tool schemas, dynamic invocation with arguments.
- Documentation Needed: Shows how to run thinking vs non-thinking mode, clearly outline quantization types used, memory usage, NPU compatibility constraints, and any performance trade-offs.
References: