Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

CentOS GPU tests failing in master #16951

Closed
@larroy

Description

@larroy

Description

Centos GPU tests are failing in master:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/master/1341/

I couldn't reproduce in p3 instance over ubuntu 18.04. Trying in the CI AMI now.

Seems to be a problem in the base AMI, reproduced by running the following commands:

time ci/build.py --docker-registry mxnetci --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_centos7_gpu
time ci/build.py --docker-registry mxnetci --nvidiadocker --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_centos7_gpu

Failure is:

[07:03:53] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
terminate called after throwing an instance of 'dmlc::Error'
  what():  [07:03:59] /work/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:107: Check failed: err == CUBLAS_STATUS_SUCCESS (7 vs. 0) : Destory cublas handle failed
Stack trace:
  [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2b) [0x7f0376aa865b]
  [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)+0x227) [0x7f037aa308e7]
  [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mshadow::Stream<mshadow::gpu>* mshadow::NewStream<mshadow::gpu>(bool, bool, int)+0x244) [0x7f037aa30e14]
  [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x19f) [0x7f037aa513ef]
  [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>)+0x46) [0x7f037aa51626]
  [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x44) [0x7f037aa3d1c4]
  [bt] (6) /usr/lib64/libstdc++.so.6(+0xb5070) [0x7f03e2478070]
  [bt] (7) /usr/lib64/libpthread.so.0(+0x7e65) [0x7f03f4f92e65]
  [bt] (8) /usr/lib64/libc.so.6(clone+0x6d) [0x7f03f45b288d]


/work/runtime_functions.sh: line 1312:     6 Aborted                 (core dumped) python3.6 -m "nose" $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
2019-11-30 07:03:59,955 - root - INFO - Waiting for status of container ea33d765417a for 600 s.
2019-11-30 07:04:00,117 - root - INFO - Container exit status: {'StatusCode': 134, 'Error': None}
2019-11-30 07:04:00,117 - root - ERROR - Container exited with an error 😞
2019-11-30 07:04:00,117 - root - INFO - Executed command for reproduction:

ci/build.py --docker-registry mxnetci --nvidiadocker --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_centos7_gpu

2019-11-30 07:04:00,117 - root - INFO - Stopping container: ea33d765417a
2019-11-30 07:04:00,119 - root - INFO - Removing container: ea33d765417a
2019-11-30 07:04:00,140 - root - CRITICAL - Execution of ['/work/runtime_functions.sh', 'unittest_centos7_gpu'] failed with status: 134

A solution would be to update the AMI

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions