-
Notifications
You must be signed in to change notification settings - Fork 133
Support CogVideoX T2V #165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CogVideoX-5b: https://wandb.ai/sayakpaul/finetrainers-cog/runs/7g3yu56z. |
CogVideoX 1.5 5B: https://wandb.ai/sayakpaul/finetrainers-cog/runs/pgdk9tit |
return [("video", output)] | ||
|
||
|
||
def _get_t5_prompt_embeds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, I would love to think of a way to reuse encode_prompt()
but that is a battle for a different day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
during fine-tuning, if cfg is not used, should the negative prompt = "" also be encoded here (encode the prompt and negative_prompt = "" for all lora / sft datasets)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We randomly drop out the caption or zero it out based on the scheme chosen. So, that should suffice I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick look, this is different from how I would have tried to implement the CogVideoX integration, and this is where we could introduce some better abstraction for loss calculation. Will do a refactor tomorrow if I find time or whenever I'm back from new year's holidays
As such, the trainer should not contain anything model-specific. If something needs to be implemented per-model, then it should be done via a helper utility invoked from the trainer.
For schedulers, we need to support three kinds:
- CogVideoXDDPM
- FlowMatching
- Normal DDPM (we don't have any model yet, so this is not actionable)
|
||
|
||
def post_latent_preparation(latents: torch.Tensor, **kwargs) -> torch.Tensor: | ||
return {"latents": latents} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have to use vae scaling factor here. The same bug is in Hunyuan Video (will update in separate PR).
from diffusers.pipelines.cogvideo.pipeline_cogvideox import get_resize_crop_region_for_grid | ||
|
||
|
||
def prepare_rotary_positional_embeddings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In diffusers, we plan to make the RoPE a part of the modeling components itself like done for Flux, Mochi, Hunyuan Video, etc. Will be doing it for Cog too so that we don't have to do it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until that's happening I guess it will stay here?
I think I get the idea you're trying to convey and can do it myself. I will let you do the reactors you mentioned in comments like this. IMO, in this PR:
I will revert the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked questions about some code, looking forward to the answers. Thank you for your support of CogVideoX.
Thanks @zRzRzRzRzRzRzR! I am working on addressing some of @a-r-r-o-w's feedback and will let both of you know once I have pushed the latest changes. |
@a-r-r-o-w, in the latest commits, I have
PTAL and I can iterate further if need be. @zRzRzRzRzRzRzR, thanks for suggesting the changes related to padding the frames. I have committed your fix here. PTAL and you're also welcome to review the other changes too. |
finetrainers/trainer.py
Outdated
if "calculate_timesteps" in self.model_config.keys(): | ||
timesteps = self.model_config["calculate_timesteps"]( | ||
scheduler=self.scheduler, | ||
latent_conditions=latent_conditions, | ||
generator=self.state.generator | ||
) | ||
else: # As flow-based calculations are more common for now. | ||
sigmas = self.derive_sigmas_for_flow(batch_size=batch_size, scheduler_sigmas=scheduler_sigmas) | ||
timesteps = (sigmas * 1000.0).long() | ||
|
||
# Same reason as above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because right now, the else branch is more common so it made sense to do it this way. Open to changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay to merge now. Just wanted to do a few 5000 step runs on each of the following cases to ensure everything is working as expected:
- Does SNR weighting for Cog have to be
1 / (1 - alphas_cumprod)
? This seems incorrect in the original implementation and our old script. SNR should be calculated asalphas_cumprod / (1 - alphas_cumprod)
. If the latter works better, that is what we should use. - I've commented out the invert_scale_latent bits. I think it is required only for Image-to-Video training, but will do a test run to confirm.
- CogVideoX 1.0 LoRA with current SNR implementation
- CogVideoX 1.5 LoRA with current SNR implementation
TODO:
- make SNR weighting opt-in. Should be done in a follow-up PR as I would like to add different commonly options for DDIM weighting and test all of them
- update README like we have for LTX and Hunyuan. Should think about creating a
docs/
folder with model-specific guides and training run examples
I believe Image-to-Video is not supported yet. Let's do it in follow-up PR.
@a-r-r-o-w I don't appreciate the intervention like this. I have tried respecting the comments and have tried my best to address and sort them out in the good spirits. The changes you have added go without any discussions or any reviews whatsoever. I don't think this this is explained by "we want to ship fast". |
I thought you said above that you were okay with me taking the refactors up. To address why the implementation before my latest commits were a no-go:
The reason for doing it this way is:
When I said "we want to ship fast", I understood most of how this was to be designed for minimal coupling between different moving parts. If I had implemented it, this would have been a straight shot 30 mins of work because:
From working with all the existing trainers for image models, I have some ideas in my mind that I would like to re-use and some that I want to introduce. For me, it is clear on how to do it fast and move fast, and the changes that you see are exactly how I would have done it first-try. So, in that sense, I do think we are moving fast. That said, I was under the assumption that it was okay for me to take up the refactor, and only then made the changes. With that in mind, will no longer modify your PRs unless explicitly granted permission. Explaining, iterating and testing all this would just have take a lot more time, and I want to prioritize more important things:
This will help us truly be a "memory-efficient" trainer because currently we are nowhere near that for decent resolutions people want to use. Eventually, instead of the functions that we use for "model utilities" can be made into ModelSpec classes with syntax sugar, so that we don't have to work with all the kwargs ignored, and other hacks made to support the three models we have now. |
I think the latent scaling is required regardless of you're precomputing or not no? But I see that you have moved that to |
Yeah of course. As also mentioned over DM, if you have collaborators on the repo who are willing to take work with good intentions, it doesn't hurt to allow them a bit more time and to allow them to eventually grow in doing so. If I were being slow, I would not have adjusted the this PR itself quickly, addressing your feedback. It's not repetitive, either. This was the first PR I wanted to take up and would have been a good opportunity for me to develop the muscle memory of the design we're hoping to have. Doesn't hurt to have this much of allowance. |
The code for non-preprocessing latent training and multiplying the scaling factor (in prepare_latents) was not added by me. My commits don't have any such changes. At the time I was asked for review, the scaling factor was being multiplied twice for non-preprocessing code path - once in prepare_latents and once in post_latent_preparation (via call to _scale_latents), which was also wrong. It was only correct when using pre-processed latents was enabled. That said, sincere apologies for the misunderstanding on my part that you were okay with me taking up the refactor. |
It was properly guarded: def post_latent_preparation(latents: torch.Tensor, **kwargs) -> torch.Tensor:
if kwargs.get("precompute_conditions", False) and kwargs.get("vae_config", None) is not None:
latents = _scale_latents(latents, kwargs.get("vae_config"))
latents = _pad_frames(latents, kwargs.get("denoier_config", None))
latents = latents.permute(0, 2, 1, 3, 4) # [B, F, C, H, W]
return {"latents": latents} Isn't this exactly what was done in this commit? def prepare_latents(
vae: AutoencoderKLCogVideoX,
image_or_video: torch.Tensor,
device: Optional[torch.device] = None,
dtype: Optional[torch.dtype] = None,
generator: Optional[torch.Generator] = None,
precompute: bool = False,
**kwargs,
) -> torch.Tensor:
device = device or vae.device
dtype = dtype or vae.dtype
if image_or_video.ndim == 4:
image_or_video = image_or_video.unsqueeze(2)
assert image_or_video.ndim == 5, f"Expected 5D tensor, got {image_or_video.ndim}D tensor"
image_or_video = image_or_video.to(device=device, dtype=vae.dtype)
image_or_video = image_or_video.permute(0, 2, 1, 3, 4) # [B, C, F, H, W]
if not precompute:
latents = vae.encode(image_or_video).latent_dist.sample(generator=generator)
if not vae.config.invert_scale_latents:
latents = latents * vae.config.scaling_factor
else:
... def post_latent_preparation(vae_config: Dict[str, Any], latents: torch.Tensor, **kwargs) -> torch.Tensor:
if not vae_config.invert_scale_latents:
latents = latents * vae_config.scaling_factor
else:
latents = 1 / vae_config.scaling_factor * latents
latents = _pad_frames(latents, kwargs.get("denoier_config", None))
latents = latents.permute(0, 2, 1, 3, 4) # [B, F, C, H, W]
return {"latents": latents} Or am I terribly misunderstanding something? |
Example command:
Command
Run: https://wandb.ai/sayakpaul/finetrainers-cog/runs/he705j4z
Testing Cog 1.5 as well as the 5B variant from 1.0.✅