-
Notifications
You must be signed in to change notification settings - Fork 133
Hunyuan Video LoRA #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hunyuan Video LoRA #126
Conversation
@sayakpaul Could you give this a review? Note that I've left some todos for myself in future refactors and we should prioritize getting the trainers out there. I will need some more time to complete the longer finetuning run I was trying. I accidentally set |
README.md
Outdated
dataloader_cmd="--dataloader_num_workers 0" | ||
|
||
# Diffusion arguments | ||
diffusion_cmd="--flow_resolution_shifting" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diffusion_cmd="--flow_resolution_shifting" | |
diffusion_cmd="" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing because I'm yet to test which is better since we don't know how exactly Hunyuan was trained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this in quickly!
--video_resolution_buckets 17x512x768 49x512x768 61x512x768 129x512x768 \ | ||
--video_resolution_buckets 49x512x768 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this getting changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was incorrect when I merged LTX. I copied values from my multiresolution run but validation prompts and other settings from single resolution run
@@ -71,7 +71,7 @@ training_cmd="--training_type lora \ | |||
--seed 42 \ | |||
--mixed_precision bf16 \ | |||
--batch_size 1 \ | |||
--train_steps 2000 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, train shorter with a smaller LR? 👁️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Higher learning rate seems to make the model worse somehow when doing stylistic training :/ Yet to find the optimal training configuration for LTXV but ~1000-1500 steps seems to be okay
+ pipe.set_adapters(["ltxv-lora"], [1.0]) | ||
+ pipe.set_adapters(["ltxv-lora"], [0.75]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golden number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I haven't found the optimal training settings yet for the LoRA, using the full power of it at 1.0 leads to slightly worse quality outputs. This seems to strike a nice balance but ideally should be explored by the person who trained
--max_grad_norm 1.0" | ||
|
||
# Validation arguments | ||
validation_cmd="--validation_prompts \"afkx A baker carefully cuts a green bell pepper cake on a white plate against a bright yellow background, followed by a strawberry cake with a similar slice of cake being cut before the interior of the bell pepper cake is revealed with the surrounding cake-to-object sequence.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@97x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@129x512x768:::A person with gloved hands carefully cuts a cake shaped like a Skittles bottle, beginning with a precise incision at the lid, followed by careful sequential cuts around the neck, eventually detaching the lid from the body, revealing the chocolate interior of the cake while showcasing the layered design's detail.@@@61x512x768:::afkx A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@61x512x768\" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧠 @@@129x512x768
. I kid you not I thought it was something else completely.
cmd="accelerate launch --config_file accelerate_configs/uncompiled_8.yaml --gpu_ids $GPU_IDS train.py \ | ||
$model_cmd \ | ||
$dataset_cmd \ | ||
$dataloader_cmd \ | ||
$diffusion_cmd \ | ||
$training_cmd \ | ||
$optimizer_cmd \ | ||
$validation_cmd \ | ||
$miscellaneous_cmd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, very neat way to segregate the commands!
@@ -206,7 +206,9 @@ def validate_args(args: Args): | |||
|
|||
|
|||
def _add_model_arguments(parser: argparse.ArgumentParser) -> None: | |||
parser.add_argument("--model_name", type=str, required=True, choices=["ltx_video"], help="Name of model to train.") | |||
parser.add_argument( | |||
"--model_name", type=str, required=True, choices=["hunyuan_video", "ltx_video"], help="Name of model to train." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could determine the choices
automatically from the config map we have right now. TODO
return pipe | ||
|
||
|
||
def prepare_conditions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be decorated with torch.no_grad()
?
if isinstance(prompt, str): | ||
prompt = [prompt] | ||
|
||
conditions = {} | ||
conditions.update( | ||
_get_llama_prompt_embeds(tokenizer, text_encoder, prompt, prompt_template, device, dtype, max_sequence_length) | ||
) | ||
conditions.update(_get_clip_prompt_embeds(tokenizer_2, text_encoder_2, prompt, device, dtype)) | ||
|
||
guidance = torch.tensor([guidance], device=device, dtype=dtype) * 1000.0 | ||
conditions["guidance"] = guidance | ||
|
||
return conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if it's possible to leverage the encode_prompt()
from the pipeline itself. TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to keep custom implementations here per model because it is clean to understand and debug without jumping to diffusers codebase. Also, our pipelines contain some additional things and checks at times - let's revisit this idea maybe later
Co-authored-by: Sayak Paul <[email protected]>
All yours @sayakpaul for the initial designing and refactors 🪄 I'm still trying to figure out how best to implement precomputation because the current approach just loads all the models and is not really ideal. I will have a refactor out in a few hours |
Script:
Slurm: