-
Notifications
You must be signed in to change notification settings - Fork 3
MPI part 2 #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: source
Are you sure you want to change the base?
MPI part 2 #76
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, really liked the timing graphs!
# Lightening the load | ||
|
||
What if we relax the problem a little. Suppose instead of all processes | ||
needing all of the results back, only process zero needs the assembled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why all the processes needed all the data!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - they never really explained their reasons for that...!
recv_ptr <- new_data(recv_data) # on rank 0 | ||
|
||
mpi_time <- -(get_mpi_time()) | ||
mpi_gather_to_zero(send_ptr, recv_ptr) # Gather in recv_ptr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, is it the case that no data is actually gathered into the recv_ptr
for the non-rank-0 processes because they aren't valid pointers to the type of data required? Do you always need to pass a pointer to 0 for the non-gathering processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a funny business really. Yes, for the non-zero nodes, the recv_ptr arg is ignored. We need all the processes to reach the MPI_Gather command together, and we basically need to supply a recv_ptr anyway to make that line compile syntactically. I think we could alternatively say something like this if we wanted :-
mpi_gather_to_zero(send_ptr, (rank == 0) ? recv_ptr : (void *) 0)
which is both a bit clearer, and slightly messier at the same time...! We are sort of trusting MPI_Gather to not do anything with the invalid thing we send, if we are a non-gathering node.
content/blog/r-mpi-part-two.md
Outdated
Lastly, node the run on my desktop is a bit quicker than the cluster job. But | ||
remember here I'm making poor use of the cluster nodes, only giving them one | ||
process (one core) each. HPC cluster nodes don't really offer you benefit in | ||
speed for single core; it's the extra RAM and extra cores that give you the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to mention one or two more realistic examples which would make better use of cluster nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to rewrite this bit...
Co-authored-by: EmmaLRussell <[email protected]>
Co-authored-by: EmmaLRussell <[email protected]>
Co-authored-by: EmmaLRussell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a lovely read, I am quite new to this kind of stuff but your explanations were clear! This type of data exchange with pointers between R and C++ is pretty crazy too. Look forward to how much more optimised it can get when you use the cores in the nodes too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of very belated comments here. I'm still a bit unenlightened I am afraid!
content/blog/r-mpi-part-two.md
Outdated
|
||
# MPI Communication - Worst Case | ||
|
||
Now we're ready to write some very naive code, in which a number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get lost through here working out what we're trying to achieve, I'm afraid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've improved this a bit?
content/blog/r-mpi-part-two.md
Outdated
of parallelisation will make that faster. Further, we're not really doing | ||
that much work in the loop, so the sequential part of getting the memory | ||
is a significant chunk of the total time. That limits how much speed-up | ||
we could ever achieve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this also gets into "strong scaling" vs "weak scaling" ideas
What if we relax the problem a little. Suppose instead of all processes | ||
needing all of the results back, only process zero needs the assembled | ||
bulk. All the other processes could then only allocate memory for the | ||
data they create, and contribute just that to the MPI call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that this is the thought process as we shift from shared memory to message passing as the paradigm? It might be worth linking the rust book chapter on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't very intentionally going for that. It was more that I just started with a "worst case" on performance, where every node wants to have all the data from every node - for some reason. (The global sim happens to want to do this with certain bits of the algorithm)
What I'm really aiming for is the more basic idea that throwing more cores/nodes at a job won't necessarily make it faster, if the comms cost (and also memory usage) scales up with the number of nodes - so the graphs aren't very exciting.
No description provided.