Skip to content

MPI part 2 #76

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: source
Choose a base branch
from
Open

MPI part 2 #76

wants to merge 7 commits into from

Conversation

weshinsley
Copy link
Contributor

No description provided.

@weshinsley weshinsley requested a review from a team June 29, 2023 13:49
Copy link
Contributor

@EmmaLRussell EmmaLRussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, really liked the timing graphs!

# Lightening the load

What if we relax the problem a little. Suppose instead of all processes
needing all of the results back, only process zero needs the assembled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why all the processes needed all the data!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - they never really explained their reasons for that...!

recv_ptr <- new_data(recv_data) # on rank 0

mpi_time <- -(get_mpi_time())
mpi_gather_to_zero(send_ptr, recv_ptr) # Gather in recv_ptr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, is it the case that no data is actually gathered into the recv_ptr for the non-rank-0 processes because they aren't valid pointers to the type of data required? Do you always need to pass a pointer to 0 for the non-gathering processes?

Copy link
Contributor Author

@weshinsley weshinsley Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a funny business really. Yes, for the non-zero nodes, the recv_ptr arg is ignored. We need all the processes to reach the MPI_Gather command together, and we basically need to supply a recv_ptr anyway to make that line compile syntactically. I think we could alternatively say something like this if we wanted :-

mpi_gather_to_zero(send_ptr, (rank == 0) ? recv_ptr : (void *) 0)

which is both a bit clearer, and slightly messier at the same time...! We are sort of trusting MPI_Gather to not do anything with the invalid thing we send, if we are a non-gathering node.

Lastly, node the run on my desktop is a bit quicker than the cluster job. But
remember here I'm making poor use of the cluster nodes, only giving them one
process (one core) each. HPC cluster nodes don't really offer you benefit in
speed for single core; it's the extra RAM and extra cores that give you the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to mention one or two more realistic examples which would make better use of cluster nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to rewrite this bit...

Copy link
Contributor

@M-Kusumgar M-Kusumgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a lovely read, I am quite new to this kind of stuff but your explanations were clear! This type of data exchange with pointers between R and C++ is pretty crazy too. Look forward to how much more optimised it can get when you use the cores in the nodes too!

@weshinsley weshinsley requested a review from richfitz September 14, 2023 10:43
Copy link
Member

@richfitz richfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of very belated comments here. I'm still a bit unenlightened I am afraid!


# MPI Communication - Worst Case

Now we're ready to write some very naive code, in which a number of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get lost through here working out what we're trying to achieve, I'm afraid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've improved this a bit?

of parallelisation will make that faster. Further, we're not really doing
that much work in the loop, so the sequential part of getting the memory
is a significant chunk of the total time. That limits how much speed-up
we could ever achieve.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this also gets into "strong scaling" vs "weak scaling" ideas

What if we relax the problem a little. Suppose instead of all processes
needing all of the results back, only process zero needs the assembled
bulk. All the other processes could then only allocate memory for the
data they create, and contribute just that to the MPI call.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that this is the thought process as we shift from shared memory to message passing as the paradigm? It might be worth linking the rust book chapter on this?

Copy link
Contributor Author

@weshinsley weshinsley Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't very intentionally going for that. It was more that I just started with a "worst case" on performance, where every node wants to have all the data from every node - for some reason. (The global sim happens to want to do this with certain bits of the algorithm)

What I'm really aiming for is the more basic idea that throwing more cores/nodes at a job won't necessarily make it faster, if the comms cost (and also memory usage) scales up with the number of nodes - so the graphs aren't very exciting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants