-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Fix PTL2.2 saving multiple *-last.ckpt
checkpoints in resumed training
#8480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…om previous run Signed-off-by: He Huang (Steve) <[email protected]>
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
@stevehuang52 this problem basically happens when we try to resume 2nd time right ? |
@athitten Yes, it happens when we resume for the 2nd time, the first resume is fine |
pass | ||
else: | ||
return False | ||
if self.dirpath is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevehuang52 why have we added if self.dirpath is None:
instead of assert self.dirpath is not None
which is the case in PTL code here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, assert hides actual error, we should raise it properly with useful error message (like here)
@@ -454,3 +454,29 @@ def _remove_unfinished_checkpoints(checkpoint_dir: Union[Path, str]) -> None: | |||
# delete markers | |||
for marker_path in existing_marker_filepaths: | |||
os.remove(marker_path) | |||
|
|||
def _should_remove_checkpoint(self, trainer: "pl.Trainer", previous: str, current: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevehuang52 we want to avoid overriding PTL's protected methods in NeMo as much as possible so that NeMo doesn't break when PTL changes something in the protected methods.
Maybe we can create this PR in PTL itself ? Since its not a lot of modification and just addition of couple of lines in _should_remove_checkpoint
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athitten thanks for the advice, I've created a PR to PTL Lightning-AI/pytorch-lightning#19613, let's see if they'll approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you very much @stevehuang52.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems they closed the PR above. Is the resolution implemented here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@titu1994 Yeah they don't want to do the fix in PTL, we will do the fix in NeMo (which is this PR).
Signed-off-by: He Huang (Steve) <[email protected]>
jenkins |
jenkins |
Signed-off-by: stevehuang52 <[email protected]>
jenkins |
jenkins |
@athitten seems that Jenkins passed without issues. Could you please review the PR again and see if there's other changes or tests needed? Thanks~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks very much @stevehuang52
…ing (#8480) * Fix PTL2.2 saving multiple `*-last.ckpt` checkpoints when resuming from previous run Signed-off-by: He Huang (Steve) <[email protected]> * Fix missing import Signed-off-by: He Huang (Steve) <[email protected]> * fix broken test Signed-off-by: stevehuang52 <[email protected]> --------- Signed-off-by: He Huang (Steve) <[email protected]> Signed-off-by: stevehuang52 <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]> Signed-off-by: Michal Futrega <[email protected]>
* add init griffin * remove unnecessary imports * add sft * add sft model init * add text gen starategy for Griffin no cache * test SFT * minor fix to config * fix logprob output issue * sft WS fixed * replace trainer in conversion script * Revert "Fix PTL2.2 saving multiple `*-last.ckpt` checkpoints in resumed training (#8480)" This reverts commit 11b7a73. * Revert "FSDP update to PTL 2.2 (#8658)" This reverts commit 355e36c. * init dist opt * add peft * fix generate script * convert to HF format * further cleanups * minor fix * minor fix * more refactoring * remove local path from config * undo unnecessary changes * remove pretraining * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix val param sync * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addresing MR comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code ql fixed * more code ql * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add jenkins * remove jenkins for momentarily * add reqs for griffin * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add req test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add reqs to nlp * add reqs to nlp * replace torch scan * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * jit fusion for embedding decoder * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * jit fusion for embedding decoder * add fix to rglru * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Ali Taghibakhshi <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]>
…ing (NVIDIA#8480) * Fix PTL2.2 saving multiple `*-last.ckpt` checkpoints when resuming from previous run Signed-off-by: He Huang (Steve) <[email protected]> * Fix missing import Signed-off-by: He Huang (Steve) <[email protected]> * fix broken test Signed-off-by: stevehuang52 <[email protected]> --------- Signed-off-by: He Huang (Steve) <[email protected]> Signed-off-by: stevehuang52 <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]>
* add init griffin * remove unnecessary imports * add sft * add sft model init * add text gen starategy for Griffin no cache * test SFT * minor fix to config * fix logprob output issue * sft WS fixed * replace trainer in conversion script * Revert "Fix PTL2.2 saving multiple `*-last.ckpt` checkpoints in resumed training (NVIDIA#8480)" This reverts commit 445fb5c. * Revert "FSDP update to PTL 2.2 (NVIDIA#8658)" This reverts commit 60364e9. * init dist opt * add peft * fix generate script * convert to HF format * further cleanups * minor fix * minor fix * more refactoring * remove local path from config * undo unnecessary changes * remove pretraining * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix val param sync * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addresing MR comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code ql fixed * more code ql * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add jenkins * remove jenkins for momentarily * add reqs for griffin * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add req test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add reqs to nlp * add reqs to nlp * replace torch scan * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * jit fusion for embedding decoder * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * jit fusion for embedding decoder * add fix to rglru * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Ali Taghibakhshi <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]>
What does this PR do ?
Fix PTL2.2 saving multiple
*-last.ckpt
checkpoints in resumed training.In current PTL2.2 model_checkpoint.py, it doesn't delete the previous
*-last.ckpt
checkpoint that the model was resumed on when saving a new*-last.ckpt
checkpoint, which results in multiple checkpoints. This will cause errors of "multiple *-last.ckpt checkpoints" when trying to resume from a previous job.The fix is to check whether the checkpoint to be saved is also a
-last.ckpt
checkpoint. If it is, delete the previous-last.ckpt
if they are under the same directory.PR Type: