Skip to content

Failure when testing on 2-nodes #299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yjhong89 opened this issue Mar 31, 2025 · 2 comments
Open

Failure when testing on 2-nodes #299

yjhong89 opened this issue Mar 31, 2025 · 2 comments

Comments

@yjhong89
Copy link

yjhong89 commented Mar 31, 2025

Hi!

I am currently testing 2 H100 nodes (each node has 8 gpus).
But following error occurs and maybe related to ssh connection.
Is it possible to change port or something other things to try instead of this ?

  • For single node, nccl-test works fine.
root@yj-videogen-multinode-job-master-0:/workspace/nccl-tests# mpirun --allow-run-as-root -x NCCL_SHM_DISABLED=1 -x NCCL_DEBUG=INFO -np 16 -N 8 -H hostfile ./build/all_
reduce_perf -b 8 -e 1G -f 2
ssh: connect to host (worker_ip) port 22: Connection refused
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-yj-videogen-multinode-job-master-0-34@0,0] on node yj-videogen-multinode-job-master-0
  Remote daemon: [prterun-yj-videogen-multinode-job-master-0-34@0,1] on node (worker_ip)

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
@kiskra-nvidia
Copy link
Member

Your MPI installation does not work correctly in a multi-node setting. You need to fix that first before you try running NCCL tests. I'm guessing you'll see the same issue with an MPI "hello world"-type problem, or even something as simple as:

mpirun --allow-run-as-root -x NCCL_SHM_DISABLED=1 -x NCCL_DEBUG=INFO -np 16 -N 8 -H hostname

@AddyLaddy
Copy link
Collaborator

My favorite "canary" test of an MPI installation is to download and compile a simple MPI program:

wget https://raw.githubusercontent.com/pmodels/mpich/main/examples/cpi.c
mpicc -o cpi cpi.c

This does more than just a "Hello World!" printf and demonstrates that basic MPI Collectives calls are functional across the job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants