This repository contains SoTA algorithms, models, and interesting projects in the area of multimodal understanding and content generation.
ONE is short for "ONE for all"
- [2025.04.10] We release v0.3.0. More than 15 SoTA generative models are added, including Flux, CogView4, OpenSora2.0, Movie Gen 30B , CogVideoX 5B~30B. Have fun!
- [2025.02.21] We support DeepSeek Janus-Pro, a SoTA multimodal understanding and generation model. See here
- [2024.11.06] v0.2.0 is released
To install v0.3.0, please install MindSpore 2.5.0 and run pip install mindone
Alternatively, to install the latest version from the master
branch, please run.
git clone https://github.com/mindspore-lab/mindone.git
cd mindone
pip install -e .
We support state-of-the-art diffusion models for generating images, audio, and video. Let's get started using Stable Diffusion 3 as an example.
Hello MindSpore from Stable Diffusion 3!
import mindspore
from mindone.diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
mindspore_dtype=mindspore.float16,
)
prompt = "A cat holding a sign that says 'Hello MindSpore'"
image = pipe(prompt)[0][0]
image.save("sd3.png")
- mindone diffusers is under active development, most tasks were tested with mindspore 2.5.0 on Ascend Atlas 800T A2 machines.
- compatibale with hf diffusers 0.32.2
component | features |
---|---|
pipeline | support text-to-image,text-to-video,text-to-audio tasks 160+ |
models | support audoencoder & transformers base models same as hf diffusers 50+ |
schedulers | support diffusion schedulers (e.g., ddpm and dpm solver) same as hf diffusers 35+ |
task | model | inference | finetune | pretrain | institute |
---|---|---|---|---|---|
Image-to-Video | hunyuanvideo-i2v π₯π₯ | β | βοΈ | βοΈ | Tencent |
Text/Image-to-Video | wan2.1 π₯π₯π₯ | β | βοΈ | βοΈ | Alibaba |
Text-to-Image | cogview4 π₯π₯π₯ | β | βοΈ | βοΈ | Zhipuai |
Text-to-Video | step_video_t2v π₯π₯ | β | βοΈ | βοΈ | StepFun |
Image-Text-to-Text | qwen2_vl π₯π₯π₯ | β | βοΈ | βοΈ | Alibaba |
Any-to-Any | janus π₯π₯π₯ | β | β | β | DeepSeek |
Any-to-Any | emu3 π₯π₯ | β | β | β | BAAI |
Class-to-Image | varπ₯π₯ | β | β | β | ByteDance |
Text/Image-to-Video | hpcai open sora 1.2/2.0 π₯π₯ | β | β | β | HPC-AI Tech |
Text/Image-to-Video | cogvideox 1.5 5B~30B π₯π₯ | β | β | β | Zhipu |
Text-to-Video | open sora plan 1.3 π₯π₯ | β | β | β | PKU |
Text-to-Video | hunyuanvideo π₯π₯ | β | β | β | Tencent |
Text-to-Video | movie gen 30B π₯π₯ | β | β | β | Meta |
Video-Encode-Decode | magvit | β | β | β | |
Text-to-Image | story_diffusion | β | βοΈ | βοΈ | ByteDance |
Image-to-Video | dynamicrafter | β | βοΈ | βοΈ | Tencent |
Video-to-Video | venhancer | β | βοΈ | βοΈ | Shanghai AI Lab |
Text-to-Video | t2v_turbo | β | β | β | |
Image-to-Video | svd | β | β | β | Stability AI |
Text-to-Video | animate diff | β | β | β | CUHK |
Text/Image-to-Video | video composer | β | β | β | Alibaba |
Text-to-Image | flux π₯ | β | β | βοΈ | Black Forest Lab |
Text-to-Image | stable diffusion 3 π₯ | β | β | βοΈ | Stability AI |
Text-to-Image | kohya_sd_scripts | β | β | βοΈ | kohya |
Text-to-Image | stable diffusion xl | β | β | β | Stability AI |
Text-to-Image | stable diffusion | β | β | β | Stability AI |
Text-to-Image | hunyuan_dit | β | β | β | Tencent |
Text-to-Image | pixart_sigma | β | β | β | Huawei |
Text-to-Image | fit | β | β | β | Shanghai AI Lab |
Class-to-Video | latte | β | β | β | Shanghai AI Lab |
Class-to-Image | dit | β | β | β | Meta |
Text-to-Image | t2i-adapter | β | β | β | Shanghai AI Lab |
Text-to-Image | ip adapter | β | β | β | Tencent |
Text-to-3D | mvdream | β | β | β | ByteDance |
Image-to-3D | instantmesh | β | β | β | Tencent |
Image-to-3D | sv3d | β | β | β | Stability AI |
Text/Image-to-3D | hunyuan3d-1.0 | β | β | β | Tencent |
task | model | inference | finetune | pretrain | features |
---|---|---|---|---|---|
Image-Text-to-Text | pllava π₯ | β | βοΈ | βοΈ | support video and image captioning |