[WIP] Async checkpointing #3701

S1ro1 · 2025-08-01T01:30:18Z

Very much WIP, overrides bunch of stuff I'm not sure that is stable to do.
TODO: discuss if we want to do a bit different approach (and more easily maintainable)

S1ro1 · 2025-08-01T01:32:01Z

src/accelerate/dist_checkpointing.py

+    from accelerate import Accelerator
+
+
+class AccelerateStorageWriter(FileSystemWriter):


This class is the issue: I'm overriding quite interesting stuff from Pytorch that idk if I should (asked on their slack if it's safe). If we don't have this, we can't save optimizer into 1 directory and model into another, which we currently do

S1ro1 · 2025-08-01T01:34:50Z

src/accelerate/dist_checkpointing.py

+        model_storage_md, optim_storage_md = {}, {}
+        for wr_list in results:
+            for wr in wr_list:
+                new_index = dataclasses.asdict(wr.index)


WriteResult dataclass is frozen (which tells a lot about what kind of war crimes I do here), so we have to use some fancy python things to avoid that.

S1ro1 · 2025-08-01T01:36:25Z

src/accelerate/dist_checkpointing.py

+            result = []
+            for to_get in ["model", "optim"]:
+                result.append(
+                    Metadata(


By default DCP thinks we're saving an object called "state" into 1 directory, which we're not. We're saving "optimizer" into 1 subdirectory and "model" into another. That's why we have to update the metadata (remove the "state" prefix and split it into 2)

S1ro1 · 2025-08-01T01:37:35Z

src/accelerate/dist_checkpointing.py

+            self.fs.rename(tmp_path, metadata_path)
+
+
+def save_model_and_optimizer(


This is the only "public" api that we expose, not even. We only use this internally in accelerator.save_state.

HuggingFaceDocBuilderDev · 2025-08-01T01:38:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

WIP: very much wip but works (probably)

354b0b5

S1ro1 commented Aug 1, 2025

View reviewed changes

S1ro1 changed the title ~~WIP: [Async checkpointing]~~ [WIP] Async checkpointing Aug 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Async checkpointing #3701

[WIP] Async checkpointing #3701

Uh oh!

S1ro1 commented Aug 1, 2025 •

edited

Loading

Uh oh!

S1ro1 Aug 1, 2025

Uh oh!

S1ro1 Aug 1, 2025 •

edited

Loading

Uh oh!

S1ro1 Aug 1, 2025 •

edited

Loading

Uh oh!

S1ro1 Aug 1, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Aug 1, 2025

Uh oh!

Uh oh!

		from accelerate import Accelerator


		class AccelerateStorageWriter(FileSystemWriter):

		self.fs.rename(tmp_path, metadata_path)


		def save_model_and_optimizer(

[WIP] Async checkpointing #3701

Are you sure you want to change the base?

[WIP] Async checkpointing #3701

Uh oh!

Conversation

S1ro1 commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

S1ro1 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

S1ro1 Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

S1ro1 Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

S1ro1 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 1, 2025

Uh oh!

Uh oh!

S1ro1 commented Aug 1, 2025 •

edited

Loading

S1ro1 Aug 1, 2025 •

edited

Loading

S1ro1 Aug 1, 2025 •

edited

Loading