Bug in preparing latents

Hi, thanks for your great contributions! When I train this model myself, there might a bug in your script:
According to:
https://github.com/huggingface/diffusers/blob/df1d7b01f18795a2d81eb1fd3f5d220db58cfae6/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L391-L396
CogVideoX model should not involve the `vae_scaling_factor` in latents, also this script is used in DaS's inference (without scale factor); however DaS let the latents multiplied with this factor in training (which is misaligned with the inference):
https://github.com/IGL-HKUST/DiffusionAsShader/blob/897fc5850fbfeefbfab273a9b8aa23dc060c2c19/training/cogvideox_image_to_video_sft.py#L892-L894
This way, the output video will have the first frame (reference) slightly brighter than the subsequent frames.
An elegant solution is to remove this multiplication in training script.

Consider the results of DaS are good, I think that is because you only train the side branch? But once someone train the pretrained parameters (main branch) under this condition, it will hurt the results greatly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug in preparing latents #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	image_latents = image_latent_dist.sample() * VAE_SCALING_FACTOR
	image_latents = image_latents.permute(0, 2, 1, 3, 4) # [B, F, C, H, W]
	image_latents = image_latents.to(memory_format=torch.contiguous_format, dtype=weight_dtype)

Bug in preparing latents #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions