Skip to content

[MODEL REQUEST] requesting new model (Qwen3 Series (32B → 4B) for NPU-Optimized Inference with Tools/Function Calling & OpenAI API Compatibility on QAI-Hub) #195

Open
@zytoh0

Description

@zytoh0

Is your feature request related to a problem? Please describe.
The current catalog of models on QAI-Hub optimized for NPU (Neural Processing Unit) acceleration lacks advanced, open-source models capable of powering tool-augmented, agentic, and function-calling workflows directly on device.

The latest Qwen3 series offers groundbreaking advancements in reasoning, code generation, multilingual understanding, and dynamic tool use, making them ideal candidates for edge AI scenarios where NPU inference is required.

Request: Add the Qwen3 models from largest to smallest prioritized for NPU-optimized deployment to enable cutting-edge, on-device intelligent agents.

Details of models being requested (Ordered by Priority for NPU Deployment):

🔺 Highest Priority

Model Name: Qwen3-32B

Type: Dense

Source repo link: https://github.com/QwenLM/Qwen3

Use Case: NPU-accelerated intelligent agent with dynamic tool orchestration, large-scale multi-turn reasoning, multilingual interaction.

🔺 High Priority

Model Name: Qwen3-30B-A3B (MoE with 3B active parameters)

Type: Mixture of Experts

Source repo link: https://github.com/QwenLM/Qwen3

Use Case: Memory-efficient model excellent for constrained NPU memory, still achieving strong performance in reasoning and external tool use.

⚪ Medium Priority

Model Name: Qwen3-14B

Type: Dense

Source repo link: https://github.com/QwenLM/Qwen3

Use Case: Mid-size agentic assistant for mobile/edge deployment with solid multi-step reasoning and OpenAI API function-calling workflows.

⚪ Medium Priority

Model Name: Qwen3-8B

Type: Dense

Source repo link: https://github.com/QwenLM/Qwen3

Use Case: Mobile-friendly assistant model for real-time reasoning and tool-augmented coding tasks, optimized for limited NPU memory.

🔻 Lower Priority

Model Name: Qwen3-4B

Type: Dense

Source repo link: https://github.com/QwenLM/Qwen3

Use Case: Lightweight fallback for very constrained NPU setups while retaining essential tool-calling and multilingual capabilities.

Additional Context for Requested Models:

  • Native dynamic thinking and non-thinking modes to optimize different reasoning workflows.
  • Fully compatible with OpenAI API standards including /v1/chat/completions and function/tool schemas.
  • Designed for dynamic external tool invocation, multi-turn dialogues, and intelligent agent workflows.
  • Quantized versions (such as Q4_K_M) may be required to maximize NPU compatibility without sacrificing functionality.

Key Requirements:

  • Full NPU Optimization for all models (quantization, kernel acceleration).
  • Support OpenAI API endpoints: /v1/chat/completions, /v1/models, /v1/completions.
  • Ensure Tools/Function Calling Support: parsing of OpenAI tool schemas, dynamic invocation with arguments.
  • Documentation Needed: Shows how to run thinking vs non-thinking mode, clearly outline quantization types used, memory usage, NPU compatibility constraints, and any performance trade-offs.

References:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions