Skip to content

Fix PTL2.2 saving multiple *-last.ckpt checkpoints in resumed training #8480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 23, 2024

Conversation

stevehuang52
Copy link
Collaborator

@stevehuang52 stevehuang52 commented Feb 21, 2024

What does this PR do ?

Fix PTL2.2 saving multiple *-last.ckpt checkpoints in resumed training.

In current PTL2.2 model_checkpoint.py, it doesn't delete the previous *-last.ckpt checkpoint that the model was resumed on when saving a new *-last.ckpt checkpoint, which results in multiple checkpoints. This will cause errors of "multiple *-last.ckpt checkpoints" when trying to resume from a previous job.

The fix is to check whether the checkpoint to be saved is also a -last.ckpt checkpoint. If it is, delete the previous -last.ckpt if they are under the same directory.

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Copy link
Contributor

github-actions bot commented Mar 7, 2024

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Mar 7, 2024
@stevehuang52 stevehuang52 removed the stale label Mar 7, 2024
@athitten
Copy link
Collaborator

athitten commented Mar 7, 2024

@stevehuang52 this problem basically happens when we try to resume 2nd time right ?

@stevehuang52
Copy link
Collaborator Author

stevehuang52 commented Mar 7, 2024

@stevehuang52 this problem basically happens when we try to resume 2nd time right ?

@athitten Yes, it happens when we resume for the 2nd time, the first resume is fine

pass
else:
return False
if self.dirpath is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevehuang52 why have we added if self.dirpath is None: instead of assert self.dirpath is not None which is the case in PTL code here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athitten I remembered @titu1994 told me to avoid using assert in nemo, and to raise exceptions instead so that they can be caught by error handling code. However, it's up to you if you would like to keep it identical to PTL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, assert hides actual error, we should raise it properly with useful error message (like here)

@@ -454,3 +454,29 @@ def _remove_unfinished_checkpoints(checkpoint_dir: Union[Path, str]) -> None:
# delete markers
for marker_path in existing_marker_filepaths:
os.remove(marker_path)

def _should_remove_checkpoint(self, trainer: "pl.Trainer", previous: str, current: str) -> bool:
Copy link
Collaborator

@athitten athitten Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevehuang52 we want to avoid overriding PTL's protected methods in NeMo as much as possible so that NeMo doesn't break when PTL changes something in the protected methods.

Maybe we can create this PR in PTL itself ? Since its not a lot of modification and just addition of couple of lines in _should_remove_checkpoint function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athitten thanks for the advice, I've created a PR to PTL Lightning-AI/pytorch-lightning#19613, let's see if they'll approve

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you very much @stevehuang52.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems they closed the PR above. Is the resolution implemented here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@titu1994 Yeah they don't want to do the fix in PTL, we will do the fix in NeMo (which is this PR).

@stevehuang52
Copy link
Collaborator Author

jenkins

@athitten
Copy link
Collaborator

jenkins

Signed-off-by: stevehuang52 <[email protected]>
@github-actions github-actions bot added the core Changes to NeMo Core label Mar 19, 2024
@stevehuang52 stevehuang52 requested a review from athitten March 19, 2024 19:39
@athitten
Copy link
Collaborator

jenkins

@stevehuang52 stevehuang52 requested a review from titu1994 March 20, 2024 18:58
@stevehuang52
Copy link
Collaborator Author

jenkins

@stevehuang52
Copy link
Collaborator Author

@athitten seems that Jenkins passed without issues. Could you please review the PR again and see if there's other changes or tests needed? Thanks~

Copy link
Collaborator

@athitten athitten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks very much @stevehuang52

@athitten athitten merged commit 11b7a73 into main Mar 23, 2024
@athitten athitten deleted the stevehuang52-patch-4 branch March 23, 2024 03:14
michal2409 pushed a commit that referenced this pull request Apr 3, 2024
…ing (#8480)

* Fix PTL2.2 saving multiple `*-last.ckpt` checkpoints when resuming from previous run

Signed-off-by: He Huang (Steve) <[email protected]>

* Fix missing import

Signed-off-by: He Huang (Steve) <[email protected]>

* fix broken test

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: He Huang (Steve) <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>
JRD971000 pushed a commit that referenced this pull request Apr 10, 2024
JRD971000 added a commit that referenced this pull request Apr 11, 2024
JRD971000 added a commit that referenced this pull request May 3, 2024
* add init griffin

* remove unnecessary imports

* add sft

* add sft model init

* add text gen starategy for Griffin no cache

* test SFT

* minor fix to config

* fix logprob output issue

* sft WS fixed

* replace trainer in conversion script

* Revert "Fix PTL2.2 saving multiple `*-last.ckpt` checkpoints in resumed training (#8480)"

This reverts commit 11b7a73.

* Revert "FSDP update to PTL 2.2 (#8658)"

This reverts commit 355e36c.

* init dist opt

* add peft

* fix generate script

* convert to HF format

* further cleanups

* minor fix

* minor fix

* more refactoring

* remove local path from config

* undo unnecessary changes

* remove pretraining

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix val param sync

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Addresing MR comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* code ql fixed

* more code ql

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add jenkins

* remove jenkins for momentarily

* add reqs for griffin

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add req test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add reqs to nlp

* add reqs to nlp

* replace torch scan

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* jit fusion for embedding decoder

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* jit fusion for embedding decoder

* add fix to rglru

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…ing (NVIDIA#8480)

* Fix PTL2.2 saving multiple `*-last.ckpt` checkpoints when resuming from previous run

Signed-off-by: He Huang (Steve) <[email protected]>

* Fix missing import

Signed-off-by: He Huang (Steve) <[email protected]>

* fix broken test

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: He Huang (Steve) <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* add init griffin

* remove unnecessary imports

* add sft

* add sft model init

* add text gen starategy for Griffin no cache

* test SFT

* minor fix to config

* fix logprob output issue

* sft WS fixed

* replace trainer in conversion script

* Revert "Fix PTL2.2 saving multiple `*-last.ckpt` checkpoints in resumed training (NVIDIA#8480)"

This reverts commit 445fb5c.

* Revert "FSDP update to PTL 2.2 (NVIDIA#8658)"

This reverts commit 60364e9.

* init dist opt

* add peft

* fix generate script

* convert to HF format

* further cleanups

* minor fix

* minor fix

* more refactoring

* remove local path from config

* undo unnecessary changes

* remove pretraining

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix val param sync

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Addresing MR comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* code ql fixed

* more code ql

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add jenkins

* remove jenkins for momentarily

* add reqs for griffin

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add req test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add reqs to nlp

* add reqs to nlp

* replace torch scan

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* jit fusion for embedding decoder

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* jit fusion for embedding decoder

* add fix to rglru

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to NeMo Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants