Description
Hi, thanks for your great contributions! When I train this model myself, there might a bug in your script:
According to:
https://github.com/huggingface/diffusers/blob/df1d7b01f18795a2d81eb1fd3f5d220db58cfae6/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L391-L396
CogVideoX model should not involve the vae_scaling_factor
in latents, also this script is used in DaS's inference (without scale factor); however DaS let the latents multiplied with this factor in training (which is misaligned with the inference):
DiffusionAsShader/training/cogvideox_image_to_video_sft.py
Lines 892 to 894 in 897fc58
This way, the output video will have the first frame (reference) slightly brighter than the subsequent frames.
An elegant solution is to remove this multiplication in training script.
Consider the results of DaS are good, I think that is because you only train the side branch? But once someone train the pretrained parameters (main branch) under this condition, it will hurt the results greatly.