TBlob bug about dltensor #15931
Description
Description
TBlob does not disable/overload the default copy constructor/assignment, so the default one can be used. This results in shallow copy of dltensor_ (which is a field of type DLTensor in TBlob, see here) and memory leak.
Environment info (Required)
Python 3.7.3
Built from source (master at 5a4c01b)
Minimum reproducible example
To reproduce this error, I made a minor change to the function NumpyDotForward (in src/operator/numpy/np_dot-inl.h) for illustration.
Here is the function after my modification.
I modified one line, and added two lines (denoted by comments):
template<typename xpu>
inline void NumpyDotForward(const nnvm::NodeAttrs& attrs,
const OpContext& ctx,
const std::vector<TBlob>& inputs,
const std::vector<OpReqType>& req,
const std::vector<TBlob>& outputs) {
using namespace mshadow;
using namespace mxnet_op;
CHECK_EQ(inputs.size(), 2U);
CHECK_EQ(outputs.size(), 1U);
const TBlob& a = inputs[0];
const TBlob& b = inputs[1];
// const TBlob& out = outputs[0];
TBlob out = outputs[0]; // ** changed by me **
const mxnet::TShape a_shape = a.shape_;
const mxnet::TShape b_shape = b.shape_;
out = out.reshape(out.shape_); // ** added by me **
out = TBlob(out.dltensor()); // ** added by me **
MSHADOW_REAL_TYPE_SWITCH(out.type_flag_, DType, {
if (b_shape.ndim() < 3) {
// Case 1, 2, 3, 4, 5: a is N-D array (N >= 1) and b is vector or matrix, sum product
// over the last axis of a and the first axis of b
TensordotIntAxesImpl<xpu>(1, ctx, a, b, out, req[0]);
} else {
// Case 3, 5.5: a is N-D array and b is M-D array (M > 2), sum product over the last axis
// of a and the 2nd-to-last axis of b
const Tuple<int> a_axes_summed({a_shape.ndim() - 1});
const Tuple<int> b_axes_summed({b_shape.ndim() - 2});
TensordotImpl<xpu>(a_axes_summed, b_axes_summed, ctx, a, b, out, req);
}
});
}
Steps to reproduce
- replace NumpyDotForward with the above one
- build
- run the following
from mxnet import np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[1, 1], [1, 1], [1, 1]])
np.dot(a, b)
The expected result is
array([[ 6., 6.],
[15., 15.]])
But the real result is
array([[0., 0.],
[0., 0.]])
Possible cause of this problem
TBlob.dltensor_.shape is a pointer. When TBlob b is assigned to TBlob a, the pointer gets shallow copied:
a.dltensor_.shape = b.dltensor_.shape
But b.dltensor_.shape points to b.shape_.data(). So when b is a temporary variable (like the return value of TBlob.reshape()), b.shape_.data() gets destroyed after the function returns. Now a.dltensor_.shape points to invalid memory.
Quick fix (IMO)
- disable default assignment/copy constructor (declare them with private)
- overload them and use SetDLTensor to avoid shallow copy
Comments
This bug has nothing to do with np.dot. I just used it for illustration.
Thank @yzhliu @reminisce @haojin2 for help.