Skip to content

Checkpoint appears corrupted with Python 3.12 #239

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ThomasRieutord opened this issue Apr 8, 2025 · 2 comments
Open

Checkpoint appears corrupted with Python 3.12 #239

ThomasRieutord opened this issue Apr 8, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@ThomasRieutord
Copy link

ThomasRieutord commented Apr 8, 2025

What happened?

When working on transfer learning, we tried to use a registred checkpoint in an environment with Python 3.12 and the checkpoint appeared to be corrupted.

However, it is possible to load the checkpoint in an environment with Python 3.11. So, the problem seems to come from the changes between 3.11 and 3.12. Indeed, the package tarfile (on which Pytorch relies to load checkpoints) mentions changes in 3.12.

This should not be a problem while Anemoi relies on Python 3.11 but I thought it was worth mentioning for future updates.

What are the steps to reproduce the bug?

  1. Download the checkpoint "proper-osprey" from the Anemoi catalog. It will named 4b23cfdc-f24f-428a-98ce-1c800979e30a.ckpt
  2. Execute the following in the Python 3.12 environment:
import torch
ckpt = torch.load("4b23cfdc-f24f-428a-98ce-1c800979e30a.ckpt", map_location=torch.device("cpu"), weights_only=False)

Version

python 3.12, torch 2.6, anemoi-training 0.3.2.post246

Platform (OS and architecture)

Linux laptop 6.11.0-21-generic ~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC

Relevant log output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python3.12/site-packages/torch/serialization.py", line 1326, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../python3.12/site-packages/torch/serialization.py", line 671, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted

Accompanying data

No response

Organisation

No response

@ThomasRieutord ThomasRieutord added the bug Something isn't working label Apr 8, 2025
@mtgarciag
Copy link

Hi! For me it worked following Gabriel's steps in slack (which uses python 3.11). For training with the streteched-grid I also had to change the lines regarding the node_loss_weights in the training config to:

node_loss_weights:
  _target_: anemoi.training.losses.nodeweights.ReweightedGraphNodeAttribute
  target_nodes: ${graph.data}
  node_attribute: area_weight
  scaled_attribute: cutout_mask
  weight_frac_of_total: 0.3

and add weights_only=False to torch.load() in line 76 of checkpoint.py (this seems to be an issue of pytorch 2.6).

@ThomasRieutord
Copy link
Author

Hi mtgarciag, thanks for your reply! Yes, it works with Python 3.11 but not with Python 3.12. As long as we use 3.11 it's OK but I thought it was useful to mention it in anticipation to a future switch to 3.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

2 participants