Skip to content

In some machines, tests take too much time to complete when using oversubscribe #902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sanvila opened this issue Mar 27, 2025 · 9 comments

Comments

@sanvila
Copy link

sanvila commented Mar 27, 2025

Hello. I reported this to Debian here:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1101363

On AWS instances of types c7a.large, m7a.large, r7a.large, which incidentally have 2 vCPUs, the Debian package for dbcsr used to take less than 4 minutes to build.

After I added PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe, so that it also builds ok on systems with a single CPU, the build on systems with 2 CPUs now fails with timeout, like this:

11:  **********************************************************************
11:   -- TESTING dbcsr_multiply (N, C,            5 , A, N, N) ............... PASSED !
11:  **********************************************************************
11:  test_name multiply_LIMITS_MIX_3
11:  The solution is CORRECT !
11:  **********************************************************************
11:   -- TESTING dbcsr_multiply (T, N,            5 , A, N, N) ............... PASSED !
11:  **********************************************************************
11/19 Test #11: dbcsr_unittest1 .......................................***Timeout 1500.01 sec
[...]
The following tests FAILED:
	 11 - dbcsr_unittest1 (Timeout)

I tried increasing the timeout, like this:

--- a/tests/CMakeLists.txt
+++ b/tests/CMakeLists.txt
@@ -140,6 +140,7 @@ foreach (dbcsr_test ${DBCSR_TESTS_FTN})
   endif ()
   set_tests_properties(
     ${dbcsr_test} PROPERTIES ENVIRONMENT OMP_NUM_THREADS=${NUM_THREADS}
+                             TIMEOUT 3600
                              PROCESSORS ${test_processors})
 endforeach ()

but 3600 was not enough, and 7200 was not enouth either (still timeouts), which makes me to think that maybe the proper fix should be somewhere else.

Thanks.

@alazzaro
Copy link
Member

I have no clue what the problem can be... We use to run the same test on the CI (example) and it takes <1m. Actually, that specific test (dbcsr_unittest1) does not even use MPI... So, I would say the computation is slow... From the full log, I can see that it is actually running some operations up to:

11:  The solution is CORRECT !
11:  **********************************************************************
11:   -- TESTING dbcsr_multiply (T, T,            3 , N, S, N) ............... PASSED !
11:  **********************************************************************
11:  test_name multiply_LIMITS_MIX_3
11:  The solution is CORRECT !
11:  **********************************************************************
11:   -- TESTING dbcsr_multiply (T, C,            3 , N, S, N) ............... PASSED !
11:  **********************************************************************
11/19 Test #11: dbcsr_unittest1 .......................................***Timeout 1500.01 sec

I can see 2 actions here:

  • I can add some timers per each operation
  • Split in the test in multiple tests

In any case, this doesn't explain why it is so slow in your run...

@alazzaro
Copy link
Member

Could you test with a single rank and thread? (i.e. by changing the defaults?)

@sanvila
Copy link
Author

sanvila commented Mar 27, 2025

Could you test with a single rank and thread? (i.e. by changing the defaults?)

Sure, if you tell me what I should do (some change in tests/CMakeLists.txt I guess).

(I forgot to say: I'm not familiar with the package, I am just routinely rebuilding all 37000 Debian source packages for QA purposes).

@hfp
Copy link
Member

hfp commented Mar 27, 2025

I don't know what over-subscription in OpenMPI means, specifically if CPU affinity is relaxed. Generally, MPI is not meant for "tasking" aka abusing ranks like ordinary processes. My experience with over-subscription in OpenMPI is when people use it because mpirun gave a warning about an unknown number of "slots" on the target system. This always played out to be bad for performance likely perhaps CPU affinity was not fully relaxed aka no affinity.

@hfp
Copy link
Member

hfp commented Mar 27, 2025

Using over-subscription "to move processes around as needed" without saying "forget about affinity" is likely causing problems. MPI implementations do really hard to pin processes according to "hardware geometry", which contradicts over-subscription.

@alazzaro
Copy link
Member

Could you test with a single rank and thread? (i.e. by changing the defaults?)

Sure, if you tell me what I should do (some change in tests/CMakeLists.txt I guess).

(I forgot to say: I'm not familiar with the package, I am just routinely rebuilding all 37000 Debian source packages for QA purposes).

Yes, this is one way. You can change the lines after:

set(TEST_MPI_RANKS

@sanvila
Copy link
Author

sanvila commented Mar 29, 2025

Ok, I tried reducing both TEST_MPI_RANKS and TEST_OMP_THREADS from 2 to 1 and now the package takes the usual 4 minutes to build again (not several hours). On machines with 1 CPU it also takes 4 minutes.

What does this little experiment tell us about a potential fix? (hopefully one that works for everybody)

I don't know much about this. It is really ok to use 2 ranks and 2 threads on a machine with only 2 vCPUs? (Is the product, 2x2 = 4 what matters?)

Thanks.

@alazzaro
Copy link
Member

Ok, I tried reducing both TEST_MPI_RANKS and TEST_OMP_THREADS from 2 to 1 and now the package takes the usual 4 minutes to build again (not several hours). On machines with 1 CPU it also takes 4 minutes.

What does this little experiment tell us about a potential fix? (hopefully one that works for everybody)

I don't know much about this. It is really ok to use 2 ranks and 2 threads on a machine with only 2 vCPUs? (Is the product, 2x2 = 4 what matters?)

Thanks.

For the moment, I would suggest to set during the cmake phase:

-DTEST_MPI_RANKS=1 \
-DTEST_OMP_THREADS=1 \

You can also try 2 threads (but definitely a single MPI).
Is it possible? In the long run, we can change the test to be less compute-intensive...

@sanvila
Copy link
Author

sanvila commented Mar 29, 2025

Ok, I'll probably try -DTEST_MPI_RANKS=1 as it seems to be the minimal change which makes it to work here. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants