Description
2 approaches to the async problem
- The comprehensive one which would enable all the jobs to be actual slurm jobs with dependencies that can be submited ahead of time. This would allow:
- all the current configurations to run async
- users to write arbitrary scripts and turn them automatically in such jobs.
- The targeted approach where we only focus on some jobs of interest. This would have the following limitations:
- Only jobs that explicitly submit a corresponding slurm job can run async
- This also implies that any job in an async config must be implemented this way, including user provided jobs.
Implications
Although approach 1 seems more appealing, it implies a lot of refactoring as it contradicts with a lot of design choices. In particular, the current structure assumes that
- jobs have access to the only one running python interpreter and its memory.
- jobs run sequentially
In order to generate slurm jobs out of any of these jobs we'd need to ensure that
- the job python module is transferred to the working directory along with all its imported modules
- the configuration objects are dumped to files in that working directory
all of this knowing that some of the jobs currently act one after each other in the same directory...
Proposed Road map
As a conclusion, here is the proposed road map:
- target the jobs of interest, namely
icon
andprepare_data
and make theirmain
function return the job id(s) that they submitted - Implement the dependency mechanism in
run_chain
with an error when trying to run async with jobs not ready for it. - Later: make as many job async as possible so that other config than icon can be made async
Most of the work is in prepare_data
. It will need to be either broken into pieces or equiped with error messages for all the pieces not ready for async yet.
Metadata
Metadata
Assignees
Labels
No labels