Skip to content

ci: use all available CUDA devices for parallel tests #767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

egparedes
Copy link
Contributor

@egparedes egparedes commented May 29, 2025

Enhance pytest and CSCS-CI configuration settings to use all CUDA devices during parallel tests runs.
It works by adding code in the pytest_configure() hook, which is executed by every pytest worker, to set the environment variable CUDA_VISIBLE_DEVICES to a different device for each worker id. The list of available devices needs to be explicitly defined in the custom PYTEST_XDIST_SPLIT_CUDA_VISIBLE_DEVICES environment variable as a comma separated list.

Additionally, replace the custom NUM_PROCESSES env variable by the standard PYTEST_XDIST_AUTO_NUM_WORKER to control the number or pytest-xdist workers.

@egparedes
Copy link
Contributor Author

cscs-ci run default

@egparedes egparedes requested a review from Copilot May 29, 2025 13:42
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the CI and pytest configurations to utilize all available CUDA devices for parallel test runs and standardizes worker configuration by replacing a custom environment variable with the standard PYTEST_XDIST_AUTO_NUM_WORKERS.

  • In noxfile.py, the NUM_PROCESSES env variable is replaced with a hard-coded "auto" setting.
  • In pytest_config.py, logic is added to split the CUDA_VISIBLE_DEVICES among pytest-xdist workers.
  • In ci/base.yml, the custom NUM_PROCESSES variable is removed and replaced with PYTEST_XDIST_AUTO_NUM_WORKERS along with the new PYTEST_XDIST_SPLIT_CUDA_VISIBLE_DEVICES setup.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
noxfile.py Replaced custom NUM_PROCESSES with standard auto configuration.
model/testing/src/icon4py/model/testing/pytest_config.py Added logic to distribute CUDA devices across pytest workers.
ci/base.yml Updated environment variables to support the new configuration.

@egparedes
Copy link
Contributor Author

cscs-ci run default

1 similar comment
@egparedes
Copy link
Contributor Author

cscs-ci run default

@egparedes egparedes requested a review from Copilot May 29, 2025 14:12
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances parallel testing configurations by allocating distinct CUDA devices to each pytest worker and standardizing the worker count through PYTEST_XDIST_AUTO_NUM_WORKERS.

  • Replaces the custom NUM_PROCESSES environment variable with the standard PYTEST_XDIST_AUTO_NUM_WORKERS in noxfile.py and ci/base.yml.
  • Adds a pytest_configure hook to assign CUDA devices based on worker IDs in model/testing/src/icon4py/model/testing/pytest_config.py.
  • Updates CI configurations to echo CUDA-related environment variables for diagnostic purposes.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
noxfile.py Removed usage of NUM_PROCESSES and hardcoded 'auto' for pytest worker count.
model/testing/src/icon4py/model/testing/pytest_config.py Added logic to split CUDA devices among pytest workers based on environment configuration.
ci/default.yml Introduced echo commands to display CUDA environment variables for debugging.
ci/base.yml Replaced NUM_PROCESSES with PYTEST_XDIST_AUTO_NUM_WORKERS and added split CUDA devices configuration.

Copy link

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

@egparedes
Copy link
Contributor Author

cscs-ci run default

@egparedes egparedes force-pushed the use-4-gpus-on-cscs-ci branch from a4aded7 to 636af39 Compare June 16, 2025 14:06
@egparedes
Copy link
Contributor Author

cscs-ci run default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant