Skip to content

feat: parallel inference without slurm #121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Feb 14, 2025

Conversation

cathalobrien
Copy link
Contributor

@cathalobrien cathalobrien commented Feb 3, 2025

This PR extends the Parallel Inference added in #55 to work without slurm within 1 node. It's much nicer to debug and run now :D

When running parallel inference without slurm, you have to add world_size to the config. At the moment, this is ignored in favour of SLURM_NTASKS when running with srun. An example of a config running parallel inference across 4 nodes is shown below. You can launch this job as normal with anemoi-inference run parinf.yaml.

checkpoint: /path/to/inference-last.ckpt
lead_time: 60
runner: parallel
world_size: 4 #Only required if running parallel inference without Slurm
input:
  grib: /path/to/input.grib
output:
  grib: /path/to/output.grib

How it works is:

  • check if anemoi-inference is launched by srun
  • if not, spawn config.world_size processes
    • master_addr is localhost
    • master_port is a hash of the node name, within a range.
  • Each spawned process runs a slimmed down version RunCmd.run but with the config preloaded

Issues:

  • The master port calculation would lead to a clash if two parallel inference processes ran on the same node at the same time.
  • At the moment, I have a copy of RunCmd.run in runners/parallel.py. Would be nice to be able to use that code directly, rather then having to maintain a copy. To do this, I would just have to be able to pass a loaded config instead of a path

📚 Documentation preview 📚: https://anemoi-inference--121.org.readthedocs.build/en/121/

@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.03%. Comparing base (90728d5) to head (9d22d57).
Report is 44 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #121   +/-   ##
=======================================
  Coverage   98.03%   98.03%           
=======================================
  Files           3        3           
  Lines          51       51           
=======================================
  Hits           50       50           
  Misses          1        1           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@HCookie HCookie changed the title parallel inference without slurm feat: parallel inference without slurm Feb 6, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 6, 2025
@cathalobrien cathalobrien merged commit 90d7911 into main Feb 14, 2025
74 of 79 checks passed
@cathalobrien cathalobrien deleted the feature/par-inf-without-slurm branch February 14, 2025 17:02
@HCookie HCookie added the enhancement New feature or request label Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
config documentation Improvements or additions to documentation enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants