Checkpoint appears corrupted with Python 3.12 #239

ThomasRieutord · 2025-04-08T11:49:03Z

What happened?

When working on transfer learning, we tried to use a registred checkpoint in an environment with Python 3.12 and the checkpoint appeared to be corrupted.

However, it is possible to load the checkpoint in an environment with Python 3.11. So, the problem seems to come from the changes between 3.11 and 3.12. Indeed, the package tarfile (on which Pytorch relies to load checkpoints) mentions changes in 3.12.

This should not be a problem while Anemoi relies on Python 3.11 but I thought it was worth mentioning for future updates.

What are the steps to reproduce the bug?

Download the checkpoint "proper-osprey" from the Anemoi catalog. It will named 4b23cfdc-f24f-428a-98ce-1c800979e30a.ckpt
Execute the following in the Python 3.12 environment:

import torch
ckpt = torch.load("4b23cfdc-f24f-428a-98ce-1c800979e30a.ckpt", map_location=torch.device("cpu"), weights_only=False)

Version

python 3.12, torch 2.6, anemoi-training 0.3.2.post246

Platform (OS and architecture)

Linux laptop 6.11.0-21-generic ~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC

Relevant log output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python3.12/site-packages/torch/serialization.py", line 1326, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../python3.12/site-packages/torch/serialization.py", line 671, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted

Accompanying data

No response

Organisation

No response

The text was updated successfully, but these errors were encountered:

mtgarciag · 2025-04-29T08:16:22Z

Hi! For me it worked following Gabriel's steps in slack (which uses python 3.11). For training with the streteched-grid I also had to change the lines regarding the node_loss_weights in the training config to:

node_loss_weights:
  _target_: anemoi.training.losses.nodeweights.ReweightedGraphNodeAttribute
  target_nodes: ${graph.data}
  node_attribute: area_weight
  scaled_attribute: cutout_mask
  weight_frac_of_total: 0.3

and add weights_only=False to torch.load() in line 76 of checkpoint.py (this seems to be an issue of pytorch 2.6).

ThomasRieutord · 2025-04-29T14:33:32Z

Hi mtgarciag, thanks for your reply! Yes, it works with Python 3.11 but not with Python 3.12. As long as we use 3.11 it's OK but I thought it was useful to mention it in anticipation to a future switch to 3.12

ThomasRieutord added the bug Something isn't working label Apr 8, 2025

github-project-automation bot added this to Anemoi-dev Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint appears corrupted with Python 3.12 #239

Checkpoint appears corrupted with Python 3.12 #239

ThomasRieutord commented Apr 8, 2025 •

edited

Loading

mtgarciag commented Apr 29, 2025

ThomasRieutord commented Apr 29, 2025

Checkpoint appears corrupted with Python 3.12 #239

Checkpoint appears corrupted with Python 3.12 #239

Comments

ThomasRieutord commented Apr 8, 2025 • edited Loading

What happened?

What are the steps to reproduce the bug?

Version

Platform (OS and architecture)

Relevant log output

Accompanying data

Organisation

mtgarciag commented Apr 29, 2025

ThomasRieutord commented Apr 29, 2025

ThomasRieutord commented Apr 8, 2025 •

edited

Loading