-
Notifications
You must be signed in to change notification settings - Fork 17
Cleanup of node, process and thread count. #625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The SchedulerAdaptorDescription should include methods to determine which counts can be used |
We could add SchedulerAdaptorDescription.supportsMultiNode() boolean This will flag the local, ssh as not able to run jobs a cross multiple machines aka nodeCount > 1. We could also flag GridEngine to not support multi node, making the whole parallel environment mapping much easier. @arnikz do you need GridEngine multi node support? |
Make sense for |
It does seem that SGE supports Not sure if this completely solves the issue though. It would allow correct behavior when you specify "-nodes 10", but I'm not sure what would happen if you would say "-nodes 10 -core-per-node 16" on a 4 core/machine cluster.... |
There some info here that discuss getting info on the nodes using https://serverfault.com/questions/266848/how-to-reserve-complete-nodes-on-sun-grid-engine |
After giving this some further thought, another option would be to go for a concept based on tasks instead, somewhat similar to what SLURM is doing. The idea is that each task basically represents an "executable" being started somewhere (this can also be a script of course). This task may need 1 or more cores. In addition you may wish to start several of these tasks instead of just one. This straightforward to specify using:
When you want more than one task (
With this approach, running sequential (single task-single core) and multi threaded (single task-multiple core) jobs is still simple. In addition, it also allows for schedulers to decide what the best task-to-node assignment is (by simply not specifying When the job started, you can either start the executable once per job, or once for each task. The first seems to be the default on all schedulers:
To start once per task, the adaptors can use the nodefile (TORQUE and SGE) or In the
The rest follows from there.... |
Implemented in a27b63f which passes all unit and integration tests. Not entirely sure about the mapping in SGE yet. Need multi node multi core cluster setup to test this. |
There is some recurring confusion of the semantics of the node, process and thread count in the
JobDescription
. See for example xenon-middleware/xenon-cli#63 , xenon-middleware/xenon-cli#57 and #206Currently we have:
This filters thru to xenon-cli which has command options to set these values.
After some discussion we came to the following command line options for the cli:
and for starting the processes -one- of the following options:
All options are optional. If no values are set, the default is used. This leads to the following behavior:
--cores-per-node 2
you will get 1 node, 2 cores, 1 executable started--cores-per-node 2 -start-per-core
you will get 1 node, 2 cores, 2 executables started--nodes 2
you will get 2 nodes, 1 core each, 1 executable started on first node--nodes 2 --cores-per-node 2
you will get 2 nodes, 2 core each, 1 executable started on first node--nodes 2 --cores-per-node 2 -start-per-node
you will get 2 nodes, 2 core each, 1 executable started on each node (2 in total)--nodes 2 --cores-per-node 2 -start-per-core
you will get 2 nodes, 2 core each, 1 executable started on each core (4 in total)This approach is slightly less flexible than the previous one, as it is not possible to directly express starting a job on 4 nodes with 4 processes per node and 4 threads per process (for running an mixed MPI/OpenMP job for example). However, just starting 4 nodes with 16 cores each will probably give you the same result.
For the
JobDescription
this would result inprocessesPerNode
being renamed intocoresPerNode
,threadsPerProcess
disappearing, andstartSingleProcess
turning into some enum.Any comments?
The text was updated successfully, but these errors were encountered: