feat: parallel inference without slurm #121

cathalobrien · 2025-02-03T15:49:31Z

This PR extends the Parallel Inference added in #55 to work without slurm within 1 node. It's much nicer to debug and run now :D

When running parallel inference without slurm, you have to add world_size to the config. At the moment, this is ignored in favour of SLURM_NTASKS when running with srun. An example of a config running parallel inference across 4 nodes is shown below. You can launch this job as normal with anemoi-inference run parinf.yaml.

checkpoint: /path/to/inference-last.ckpt
lead_time: 60
runner: parallel
world_size: 4 #Only required if running parallel inference without Slurm
input:
  grib: /path/to/input.grib
output:
  grib: /path/to/output.grib

How it works is:

check if anemoi-inference is launched by srun
if not, spawn config.world_size processes
- master_addr is localhost
- master_port is a hash of the node name, within a range.
Each spawned process runs a slimmed down version RunCmd.run but with the config preloaded

Issues:

The master port calculation would lead to a clash if two parallel inference processes ran on the same node at the same time.
At the moment, I have a copy of RunCmd.run in runners/parallel.py. Would be nice to be able to use that code directly, rather then having to maintain a copy. To do this, I would just have to be able to pass a loaded config instead of a path

📚 Documentation preview 📚: https://anemoi-inference--121.org.readthedocs.build/en/121/

codecov-commenter · 2025-02-03T16:25:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.03%. Comparing base (90728d5) to head (9d22d57).
Report is 44 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #121   +/-   ##
=======================================
  Coverage   98.03%   98.03%           
=======================================
  Files           3        3           
  Lines          51       51           
=======================================
  Hits           50       50           
  Misses          1        1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

for more information, see https://pre-commit.ci

…oi-inference into feature/par-inf-without-slurm

CHANGELOG.md

src/anemoi/inference/config.py

src/anemoi/inference/runners/parallel.py

src/anemoi/inference/config.py

for more information, see https://pre-commit.ci

…oi-inference into feature/par-inf-without-slurm

for more information, see https://pre-commit.ci

cathalobrien added 5 commits February 3, 2025 14:32

wip

746e928

forgot

7521b1e

works now except for detecting srun use

2b831b4

works now :)

b099f62

pre-commit why hast thou forsaken me

6936d67

cathalobrien requested review from gmertes and HCookie February 3, 2025 15:49

cathalobrien mentioned this pull request Feb 3, 2025

Feature: parallel inference without slurm #112

Closed

changelog

35e87ee

cathalobrien and others added 5 commits February 4, 2025 09:47

added some more guards for invalid config entries

8398c0b

updated docs to explain to how to launch parinf without slurm

a08119f

[pre-commit.ci] auto fixes from pre-commit.com hooks

b124d6a

for more information, see https://pre-commit.ci

Frailty, thy name is pre-commit

777c43e

Merge branch 'feature/par-inf-without-slurm' of github.com:ecmwf/anem…

475dd22

…oi-inference into feature/par-inf-without-slurm

HCookie changed the title ~~parallel inference without slurm~~ feat: parallel inference without slurm Feb 6, 2025

HCookie reviewed Feb 6, 2025

View reviewed changes

feedback

997c500

github-actions bot added the documentation Improvements or additions to documentation label Feb 6, 2025

Docs and error state minimum models version and give workarounds

9d22d57

HCookie reviewed Feb 12, 2025

View reviewed changes

src/anemoi/inference/runners/parallel.py Outdated Show resolved Hide resolved

HCookie reviewed Feb 12, 2025

View reviewed changes

src/anemoi/inference/runners/parallel.py Show resolved Hide resolved

HCookie assigned cathalobrien Feb 12, 2025

gmertes reviewed Feb 13, 2025

View reviewed changes

src/anemoi/inference/runners/parallel.py Outdated Show resolved Hide resolved

src/anemoi/inference/config.py Outdated Show resolved Hide resolved

feedback

24e292f

github-actions bot added the config label Feb 13, 2025

pre-commit-ci bot and others added 4 commits February 13, 2025 16:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

68b6529

for more information, see https://pre-commit.ci

update docs to reflect new min version for models

a4ddf5f

Merge branch 'feature/par-inf-without-slurm' of github.com:ecmwf/anem…

ef7bf42

…oi-inference into feature/par-inf-without-slurm

[pre-commit.ci] auto fixes from pre-commit.com hooks

8464674

for more information, see https://pre-commit.ci

HCookie mentioned this pull request Feb 14, 2025

If I want to apply anemoi-inference on a machine with two GPUs, how should I proceed? #139

Closed

cathalobrien and others added 3 commits February 14, 2025 14:41

Keeping parallel stuff seperate

261b07e

[pre-commit.ci] auto fixes from pre-commit.com hooks

0fb80cc

for more information, see https://pre-commit.ci

changed to kwargs

450f9c1

gmertes approved these changes Feb 14, 2025

View reviewed changes

cathalobrien mentioned this pull request Feb 14, 2025

Running anemoi-training example notebook #119

Closed

cathalobrien merged commit 90d7911 into main Feb 14, 2025
74 of 79 checks passed

cathalobrien deleted the feature/par-inf-without-slurm branch February 14, 2025 17:02

DeployDuck mentioned this pull request Feb 14, 2025

chore(main): Release 0.4.10 #140

Merged

HCookie added the enhancement New feature or request label Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parallel inference without slurm #121

feat: parallel inference without slurm #121

cathalobrien commented Feb 3, 2025 •

edited by github-actions bot

Loading

codecov-commenter commented Feb 3, 2025 •

edited

Loading

feat: parallel inference without slurm #121

feat: parallel inference without slurm #121

Conversation

cathalobrien commented Feb 3, 2025 • edited by github-actions bot Loading

codecov-commenter commented Feb 3, 2025 • edited Loading

Codecov Report

cathalobrien commented Feb 3, 2025 •

edited by github-actions bot

Loading

codecov-commenter commented Feb 3, 2025 •

edited

Loading