Open
Description
What happened?
Inference runs successfully with anemoi-inference==0.4.9
, but upgrading to anemoi-inference==0.5.0
results in a torch.OutOfMemory error. The environment remains unchanged in both cases, with anemoi-models==0.4.0
installed.
What are the steps to reproduce the bug?
Run an inference step using a graphtransformer (n320->TriNodes(refinement=7)->n320) with 1024 channels.
checkpoint: my_ckpt.ckpt
date: 2023-06-01
runner: default
input: mars
lead_time: 360
output:
grib: test_n320.grib
Version
0.5.0
Platform (OS and architecture)
x86_64 GNU/Linux
Relevant log output
...
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/models/encoder_processor_decoder.py", line 188, in forward
x_data_latent, x_latent = self._run_mapper(
^^^^^^^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/models/encoder_processor_decoder.py", line 159, in _run_mapper
return checkpoint(
^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/_compile.py", line 32, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 496, in checkpoint
ret = function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/layers/mapper.py", line 344, in forward
x_dst = super().forward(x, batch_size, shard_shapes, model_comm_group)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/layers/mapper.py", line 260, in forward
(x_src, x_dst), edge_attr = self.proc(
^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/layers/block.py", line 512, in forward
edge_attr_list, edge_index_list = sort_edges_1hop_chunks(
^^^^^^^^^^^^^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/distributed/khop_edges.py", line 121, in sort_edges_1hop_chunks
edge_index_chunk, edge_attr_chunk = bipartite_subgraph(
^^^^^^^^^^^^^^^^^^^
File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch_geometric/utils/subgraph.py", line 192, in bipartite_subgraph
edge_attr = edge_attr[edge_mask] if edge_attr is not None else None
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 120.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 17.12 MiB is free. Process 2750488 has 9.46 GiB memory in use. Including non-PyTorch memory, this process has 16.54 GiB memory in use. Process 2750490 has 9.48 GiB memory in use. Process 2750487 has 4.03 GiB memory in use. Of the allocated memory 15.94 GiB is allocated by PyTorch, and 112.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Accompanying data
No response
Organisation
ECMWF