Description
I’m testing the roc_shmem using two nodes on frontier. I find that if I did MPI_Init in the code, and then do roc_shmem_init, I’ll get the below error msg.
Assertion failed in file ../src/mpid/ch4/src/ch4_impl.h at line 128: *(&incomplete) >= 0
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x7fffeca719ab]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1fedbf4) [0x7fffec4abbf4]
/opt/cray/pe/lib64/libmpi_cray.so.12(MPI_Iprobe+0x264e) [0x7fffeacb2c2e]
/autofs/nccs-svm1_home1/nanding/mysoftware/ROC_SHMEM/mytest/./mpi-based-init() [0x2705a0]
/autofs/nccs-svm1_home1/nanding/mysoftware/ROC_SHMEM/mytest/./mpi-based-init() [0x26b9f4]
/usr/lib64/libstdc++.so.6(+0xdca33) [0x7fffe8266a33]
/lib64/libpthread.so.0(+0xa6ea) [0x7fffed3a26ea]
/lib64/libc.so.6(clone+0x3f) [0x7fffe7eaca6f]
MPICH ERROR [Rank 1] [job id 1377452.0] [Wed Jul 12 15:39:36 2023] [frontier08573] - Abort(1): Internal error
srun: error: frontier08572: task 0: Segmentation fault
srun: Terminating StepId=1377452.0
The functional_test runs ok because there’s no MPI_Init in that code. Could you please advise how does ROC_SHMEM co-exist with MPI? Maybe I miss something in the code.
MPI_Init(&c, &v);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
printf("Rank %d, MPI Init Done\n",rank);
fflush(stdout);
mype = roc_shmem_my_pe();
npes = roc_shmem_n_pes();
printf("ROC_SHMEM Rank %d / %d \n",mype,rank);
fflush(stdout);
char name[MPI_MAX_PROCESSOR_NAME];
int resultlength;
MPI_Get_processor_name(name, &resultlength);
// application picks the device each PE will use
hipGetDeviceCount(&ndevices);
hipSetDevice(rank%ndevices);
roc_shmem_init();
printf("(%d) ROC_SHMEM Init Done\n",mype);
fflush(stdout);