Skip to content

Increase in memory usage with anemoi-inference=0.5.0 #197

Open
@JPXKQX

Description

@JPXKQX

What happened?

Inference runs successfully with anemoi-inference==0.4.9, but upgrading to anemoi-inference==0.5.0results in a torch.OutOfMemory error. The environment remains unchanged in both cases, with anemoi-models==0.4.0 installed.

What are the steps to reproduce the bug?

Run an inference step using a graphtransformer (n320->TriNodes(refinement=7)->n320) with 1024 channels.

checkpoint: my_ckpt.ckpt
date: 2023-06-01
runner: default
input: mars
lead_time: 360
output:
  grib: test_n320.grib

Version

0.5.0

Platform (OS and architecture)

x86_64 GNU/Linux

Relevant log output

...
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/models/encoder_processor_decoder.py", line 188, in forward
    x_data_latent, x_latent = self._run_mapper(
                              ^^^^^^^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/models/encoder_processor_decoder.py", line 159, in _run_mapper
    return checkpoint(
           ^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/_compile.py", line 32, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 496, in checkpoint
    ret = function(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/layers/mapper.py", line 344, in forward
    x_dst = super().forward(x, batch_size, shard_shapes, model_comm_group)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/layers/mapper.py", line 260, in forward
    (x_src, x_dst), edge_attr = self.proc(
                                ^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/layers/block.py", line 512, in forward
    edge_attr_list, edge_index_list = sort_edges_1hop_chunks(
                                      ^^^^^^^^^^^^^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/anemoi/models/distributed/khop_edges.py", line 121, in sort_edges_1hop_chunks
    edge_index_chunk, edge_attr_chunk = bipartite_subgraph(
                                        ^^^^^^^^^^^^^^^^^^^
  File "VENVS_DIR/aifs-inference/lib/python3.11/site-packages/torch_geometric/utils/subgraph.py", line 192, in bipartite_subgraph
    edge_attr = edge_attr[edge_mask] if edge_attr is not None else None

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 120.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 17.12 MiB is free. Process 2750488 has 9.46 GiB memory in use. Including non-PyTorch memory, this process has 16.54 GiB memory in use. Process 2750490 has 9.48 GiB memory in use. Process 2750487 has 4.03 GiB memory in use. Of the allocated memory 15.94 GiB is allocated by PyTorch, and 112.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Accompanying data

No response

Organisation

ECMWF

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions