This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
CentOS GPU tests failing in master #16951
Closed
Description
Description
Centos GPU tests are failing in master:
I couldn't reproduce in p3 instance over ubuntu 18.04. Trying in the CI AMI now.
Seems to be a problem in the base AMI, reproduced by running the following commands:
time ci/build.py --docker-registry mxnetci --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_centos7_gpu
time ci/build.py --docker-registry mxnetci --nvidiadocker --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_centos7_gpu
Failure is:
[07:03:53] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
terminate called after throwing an instance of 'dmlc::Error'
what(): [07:03:59] /work/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:107: Check failed: err == CUBLAS_STATUS_SUCCESS (7 vs. 0) : Destory cublas handle failed
Stack trace:
[bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2b) [0x7f0376aa865b]
[bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)+0x227) [0x7f037aa308e7]
[bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mshadow::Stream<mshadow::gpu>* mshadow::NewStream<mshadow::gpu>(bool, bool, int)+0x244) [0x7f037aa30e14]
[bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x19f) [0x7f037aa513ef]
[bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>)+0x46) [0x7f037aa51626]
[bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x44) [0x7f037aa3d1c4]
[bt] (6) /usr/lib64/libstdc++.so.6(+0xb5070) [0x7f03e2478070]
[bt] (7) /usr/lib64/libpthread.so.0(+0x7e65) [0x7f03f4f92e65]
[bt] (8) /usr/lib64/libc.so.6(clone+0x6d) [0x7f03f45b288d]
/work/runtime_functions.sh: line 1312: 6 Aborted (core dumped) python3.6 -m "nose" $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
2019-11-30 07:03:59,955 - root - INFO - Waiting for status of container ea33d765417a for 600 s.
2019-11-30 07:04:00,117 - root - INFO - Container exit status: {'StatusCode': 134, 'Error': None}
2019-11-30 07:04:00,117 - root - ERROR - Container exited with an error 😞
2019-11-30 07:04:00,117 - root - INFO - Executed command for reproduction:
ci/build.py --docker-registry mxnetci --nvidiadocker --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_centos7_gpu
2019-11-30 07:04:00,117 - root - INFO - Stopping container: ea33d765417a
2019-11-30 07:04:00,119 - root - INFO - Removing container: ea33d765417a
2019-11-30 07:04:00,140 - root - CRITICAL - Execution of ['/work/runtime_functions.sh', 'unittest_centos7_gpu'] failed with status: 134
A solution would be to update the AMI