Skip to content

[Feature] Phi-4-MM support #6544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 7 tasks
lifuhuang opened this issue May 23, 2025 · 0 comments
Open
2 of 7 tasks

[Feature] Phi-4-MM support #6544

lifuhuang opened this issue May 23, 2025 · 0 comments
Assignees
Labels

Comments

@lifuhuang
Copy link
Collaborator

lifuhuang commented May 23, 2025

Update

Currently the basic text + vision support is already in main. However, there are many known issues across the board.

Known limitations: (See Execution Plan before for full list):

  1. Audio capabilities: currently we do not support audio at all.
  2. LoRA / Image quality: Phi4MM depends on LoRA for full image capability, but there is some compatibility issues with the native SGL LORA solution. We are working on solving it by refactoring / generalizing SGL LoRA capabilities. Fixed with Refactor LoRA handling to support adapter tensors in fused format #6585, Fix incorrect LoRA weight loading for fused gate_up_proj #6734, Support LoRA in TestOpenAIVisionServer and fix fused kv_proj loading bug. #6861)
  3. Token: Phi4MM supports two types of image token conventions (<|image1|> and <|endoftext10|>), currently we only support the latter. If you use the default chat template, it will automatically pick up the supported one.

Motivation

Supporting the Phi4 Multimodal model (https://huggingface.co/microsoft/Phi-4-multimodal-instruct in SGL.

Execution Plan:

Related resources

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants