Skip to content

missing projection weights for finetuning probelm #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
setayeshk opened this issue Jan 28, 2025 · 0 comments
Open

missing projection weights for finetuning probelm #3

setayeshk opened this issue Jan 28, 2025 · 0 comments

Comments

@setayeshk
Copy link

setayeshk commented Jan 28, 2025

Hello,

i was trying to finetune this model on flicrfa dataset. but the projection heads start from random weights, which drastically reduces the model's performance (i used clip trainer for my training process)
aslo i have problem loading the model after pushing it

I was thinking of fine-tuning the projection heads first and then, after a few epochs, fine-tuning the LoRA configuration of the last two layers of the text and image encoders.

also i didn’t quite understand the purpose of the clip_wrapper in your training notebook. Doesn’t it cause issues by setting hidden_size to 1? Wouldn’t that mean the loss during training is computed on projection heads of size 1, which seems meaningless?

Could you please clarify this for me and provide guidance on the best way to fine-tune your model on FlickrFA while ensuring the projection heads start with meaningful weights and maintaining good performance?

this is my code for loading and finetuning the model:

text_encoder = AutoModel.from_pretrained(text_model_name)
tokenizer = AutoTokenizer.from_pretrained(text_model_name)

vision_encoder = CLIPVisionModel.from_pretrained(vision_model_name)
vision_processor = CLIPFeatureExtractor.from_pretrained(feature_extractor_name)

config = CLIPConfig.from_text_vision_configs(
    text_config=text_encoder.config,
    vision_config=vision_encoder.config
)


clip_model = CLIPModel(config)
clip_model.text_model = text_encoder
clip_model.vision_model = vision_encoder

trainer = CLIPTrainer(
model=clip_model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
data_collator=data_collator,
tokenizer=tokenizer,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

trainer.save_model(args.output_dir)

and this is the code for loading it:

vision_encoder = CLIPVisionModel.from_pretrained(args.model_repo, revision=args.revision)
text_encoder = RobertaModel.from_pretrained(args.model_repo, revision=args.revision)

Some weights of CLIPVisionModel were not initialized from the model checkpoint at Setayeshk/Clipfa_finetune and are newly initialized: ['vision_model.embeddings.class_embedding', 'vision_model.embeddings.patch_embedding.weight', 'vision_model.embeddings.position_embedding.weight', 'vision_model.encoder.layers.0.layer_norm1.bias', 'vision_model.encoder.layers.0.layer_norm1.weight', 'vision_model.encoder.layers.0.layer_norm2.bias', 'vision_model.encoder.layers.0.layer_norm2.weight', 'vision_model.encoder.layers.0.mlp.fc1.bias', 'vision_model.encoder.layers.0.mlp.fc1.weight', 'vision_model.encoder.layers.0.mlp.fc2.bias', 'vision_model.encoder.layers.0.mlp.fc2.weight', 'vision_model.encoder.layers.0.self_attn.k_proj.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.weight', 'vision_model.encoder.layers.0.self_attn.out_proj.bias', 'vision_model.encoder.layers.0.self_attn.out_proj.weight', 'vision_model.encoder ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant