-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Update to fit_mixin.fit to allow fine tuning to resume from a checkpoint #3269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reverted a push that was stylistic because it did not play well with delinter. |
I'll take care of the formatting, no worries :) I'll play around with it myself in a while, before the v4.0 release that's coming soon.
|
That is awesome of you. I am embarrassed to say I was currently researching and failing to figure out what formatting changes were needed! lol With this change hopefully Llama Index will soon be updated to resume_from_checkpoint too. On a side note: is there something I can check to see the requirements for formatting, or do I need to just run the formatter myself? Thank you for your time! |
You can also run the formatter yourself indeed! The details are here: https://github.com/UKPLab/sentence-transformers?tab=readme-ov-file#development-setup
|
Code is now acceptable to ruff and ruff-format. Thank you for pointing that out and next time I will read the README! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very solid! Well done, thanks. I ran it locally and it worked out of the box.
Summary
Adding a parameter to fit_mixin.fit method that allows for a fine-tune to be continued from the latest checkpoint if a checkpoint exists in the checkpoint directory
Motivations
A faulty PSU and many hours of fine-tuning going to waste. The method in fit_mixin.fix has a parameter that allow for checkpoints to be saved, and the method that it uses for training, Transformer's trainer.train has a parameter that takes in a str, or a bool, that allows for resuming from a checkpoint, this change takes advantage of that parameter so that the saved checkpoints can now be used.
Results
When a checkpoint file is found, fine-tuning will begin from the latest step found inside of training_state.json, the good thing about this is it allows for continuing a fine-tune without retraining the model on data that it has already been processed.
Changes
Adds an optional parameter resume_from_checkpoint to fit_mixin.fit that prompts a check for checkpoints in checkpoint_directory and continues fine-tuning from there if it finds them
Related Issue
It is related to #1605 and gives a way to use resume_from_checkpoint at least with the fit() method in fit_mixin