In some machines, tests take too much time to complete when using oversubscribe #902

sanvila · 2025-03-27T10:57:10Z

Hello. I reported this to Debian here:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1101363

On AWS instances of types c7a.large, m7a.large, r7a.large, which incidentally have 2 vCPUs, the Debian package for dbcsr used to take less than 4 minutes to build.

After I added PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe, so that it also builds ok on systems with a single CPU, the build on systems with 2 CPUs now fails with timeout, like this:

11:  **********************************************************************
11:   -- TESTING dbcsr_multiply (N, C,            5 , A, N, N) ............... PASSED !
11:  **********************************************************************
11:  test_name multiply_LIMITS_MIX_3
11:  The solution is CORRECT !
11:  **********************************************************************
11:   -- TESTING dbcsr_multiply (T, N,            5 , A, N, N) ............... PASSED !
11:  **********************************************************************
11/19 Test #11: dbcsr_unittest1 .......................................***Timeout 1500.01 sec
[...]
The following tests FAILED:
	 11 - dbcsr_unittest1 (Timeout)

I tried increasing the timeout, like this:

--- a/tests/CMakeLists.txt
+++ b/tests/CMakeLists.txt
@@ -140,6 +140,7 @@ foreach (dbcsr_test ${DBCSR_TESTS_FTN})
   endif ()
   set_tests_properties(
     ${dbcsr_test} PROPERTIES ENVIRONMENT OMP_NUM_THREADS=${NUM_THREADS}
+                             TIMEOUT 3600
                              PROCESSORS ${test_processors})
 endforeach ()

but 3600 was not enough, and 7200 was not enouth either (still timeouts), which makes me to think that maybe the proper fix should be somewhere else.

Thanks.

The text was updated successfully, but these errors were encountered:

alazzaro · 2025-03-27T11:43:12Z

I have no clue what the problem can be... We use to run the same test on the CI (example) and it takes <1m. Actually, that specific test (dbcsr_unittest1) does not even use MPI... So, I would say the computation is slow... From the full log, I can see that it is actually running some operations up to:

11:  The solution is CORRECT !
11:  **********************************************************************
11:   -- TESTING dbcsr_multiply (T, T,            3 , N, S, N) ............... PASSED !
11:  **********************************************************************
11:  test_name multiply_LIMITS_MIX_3
11:  The solution is CORRECT !
11:  **********************************************************************
11:   -- TESTING dbcsr_multiply (T, C,            3 , N, S, N) ............... PASSED !
11:  **********************************************************************
11/19 Test #11: dbcsr_unittest1 .......................................***Timeout 1500.01 sec

I can see 2 actions here:

I can add some timers per each operation
Split in the test in multiple tests

In any case, this doesn't explain why it is so slow in your run...

alazzaro · 2025-03-27T11:44:09Z

Could you test with a single rank and thread? (i.e. by changing the defaults?)

sanvila · 2025-03-27T12:28:48Z

Could you test with a single rank and thread? (i.e. by changing the defaults?)

Sure, if you tell me what I should do (some change in tests/CMakeLists.txt I guess).

(I forgot to say: I'm not familiar with the package, I am just routinely rebuilding all 37000 Debian source packages for QA purposes).

hfp · 2025-03-27T15:06:40Z

I don't know what over-subscription in OpenMPI means, specifically if CPU affinity is relaxed. Generally, MPI is not meant for "tasking" aka abusing ranks like ordinary processes. My experience with over-subscription in OpenMPI is when people use it because mpirun gave a warning about an unknown number of "slots" on the target system. This always played out to be bad for performance likely perhaps CPU affinity was not fully relaxed aka no affinity.

hfp · 2025-03-27T15:12:53Z

Using over-subscription "to move processes around as needed" without saying "forget about affinity" is likely causing problems. MPI implementations do really hard to pin processes according to "hardware geometry", which contradicts over-subscription.

alazzaro · 2025-03-27T15:59:35Z

Could you test with a single rank and thread? (i.e. by changing the defaults?)

Sure, if you tell me what I should do (some change in tests/CMakeLists.txt I guess).

(I forgot to say: I'm not familiar with the package, I am just routinely rebuilding all 37000 Debian source packages for QA purposes).

Yes, this is one way. You can change the lines after:

dbcsr/CMakeLists.txt

Line 84 in 966e81e

set(TEST_MPI_RANKS

sanvila · 2025-03-29T11:20:42Z

Ok, I tried reducing both TEST_MPI_RANKS and TEST_OMP_THREADS from 2 to 1 and now the package takes the usual 4 minutes to build again (not several hours). On machines with 1 CPU it also takes 4 minutes.

What does this little experiment tell us about a potential fix? (hopefully one that works for everybody)

I don't know much about this. It is really ok to use 2 ranks and 2 threads on a machine with only 2 vCPUs? (Is the product, 2x2 = 4 what matters?)

Thanks.

alazzaro · 2025-03-29T17:39:56Z

Ok, I tried reducing both TEST_MPI_RANKS and TEST_OMP_THREADS from 2 to 1 and now the package takes the usual 4 minutes to build again (not several hours). On machines with 1 CPU it also takes 4 minutes.

What does this little experiment tell us about a potential fix? (hopefully one that works for everybody)

I don't know much about this. It is really ok to use 2 ranks and 2 threads on a machine with only 2 vCPUs? (Is the product, 2x2 = 4 what matters?)

Thanks.

For the moment, I would suggest to set during the cmake phase:

-DTEST_MPI_RANKS=1 \
-DTEST_OMP_THREADS=1 \

You can also try 2 threads (but definitely a single MPI).
Is it possible? In the long run, we can change the test to be less compute-intensive...

sanvila · 2025-03-29T20:04:15Z

Ok, I'll probably try -DTEST_MPI_RANKS=1 as it seems to be the minimal change which makes it to work here. Thanks.

alazzaro mentioned this issue Apr 2, 2025

feat: Fedora packaging #759

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In some machines, tests take too much time to complete when using oversubscribe #902

In some machines, tests take too much time to complete when using oversubscribe #902

sanvila commented Mar 27, 2025 •

edited

Loading

alazzaro commented Mar 27, 2025

alazzaro commented Mar 27, 2025

sanvila commented Mar 27, 2025

hfp commented Mar 27, 2025 •

edited

Loading

hfp commented Mar 27, 2025

alazzaro commented Mar 27, 2025

sanvila commented Mar 29, 2025

alazzaro commented Mar 29, 2025

sanvila commented Mar 29, 2025

In some machines, tests take too much time to complete when using oversubscribe #902

In some machines, tests take too much time to complete when using oversubscribe #902

Comments

sanvila commented Mar 27, 2025 • edited Loading

alazzaro commented Mar 27, 2025

alazzaro commented Mar 27, 2025

sanvila commented Mar 27, 2025

hfp commented Mar 27, 2025 • edited Loading

hfp commented Mar 27, 2025

alazzaro commented Mar 27, 2025

sanvila commented Mar 29, 2025

alazzaro commented Mar 29, 2025

sanvila commented Mar 29, 2025

sanvila commented Mar 27, 2025 •

edited

Loading

hfp commented Mar 27, 2025 •

edited

Loading