Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

TBlob bug about dltensor #15931

Closed
Closed
@hzfan

Description

@hzfan

Description

TBlob does not disable/overload the default copy constructor/assignment, so the default one can be used. This results in shallow copy of dltensor_ (which is a field of type DLTensor in TBlob, see here) and memory leak.

Environment info (Required)

Python 3.7.3
Built from source (master at 5a4c01b)

Minimum reproducible example

To reproduce this error, I made a minor change to the function NumpyDotForward (in src/operator/numpy/np_dot-inl.h) for illustration.

Here is the function after my modification.
I modified one line, and added two lines (denoted by comments):

template<typename xpu>
inline void NumpyDotForward(const nnvm::NodeAttrs& attrs,
                            const OpContext& ctx,
                            const std::vector<TBlob>& inputs,
                            const std::vector<OpReqType>& req,
                            const std::vector<TBlob>& outputs) {
  using namespace mshadow;
  using namespace mxnet_op;

  CHECK_EQ(inputs.size(), 2U);
  CHECK_EQ(outputs.size(), 1U);

  const TBlob& a = inputs[0];
  const TBlob& b = inputs[1];
  // const TBlob& out = outputs[0];
  TBlob out = outputs[0];  // ** changed by me **
  const mxnet::TShape a_shape = a.shape_;
  const mxnet::TShape b_shape = b.shape_;
  out = out.reshape(out.shape_);  // ** added by me **
  out = TBlob(out.dltensor());  // ** added by me **
  MSHADOW_REAL_TYPE_SWITCH(out.type_flag_, DType, {
    if (b_shape.ndim() < 3) {
      // Case 1, 2, 3, 4, 5: a is N-D array (N >= 1) and b is vector or matrix, sum product
      //        over the last axis of a and the first axis of b
      TensordotIntAxesImpl<xpu>(1, ctx, a, b, out, req[0]);
    } else {
      // Case 3, 5.5: a is N-D array and b is M-D array (M > 2), sum product over the last axis
      //         of a and the 2nd-to-last axis of b
      const Tuple<int> a_axes_summed({a_shape.ndim() - 1});
      const Tuple<int> b_axes_summed({b_shape.ndim() - 2});
      TensordotImpl<xpu>(a_axes_summed, b_axes_summed, ctx, a, b, out, req);
    }
  });
}

Steps to reproduce

  1. replace NumpyDotForward with the above one
  2. build
  3. run the following
from mxnet import np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[1, 1], [1, 1], [1, 1]])
np.dot(a, b)

The expected result is

array([[ 6.,  6.],
       [15., 15.]])

But the real result is

array([[0., 0.],
       [0., 0.]])

Possible cause of this problem

TBlob.dltensor_.shape is a pointer. When TBlob b is assigned to TBlob a, the pointer gets shallow copied:

a.dltensor_.shape = b.dltensor_.shape

But b.dltensor_.shape points to b.shape_.data(). So when b is a temporary variable (like the return value of TBlob.reshape()), b.shape_.data() gets destroyed after the function returns. Now a.dltensor_.shape points to invalid memory.

Quick fix (IMO)

  • disable default assignment/copy constructor (declare them with private)
  • overload them and use SetDLTensor to avoid shallow copy

Comments

This bug has nothing to do with np.dot. I just used it for illustration.

Thank @yzhliu @reminisce @haojin2 for help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BackendIssues related to the backend of MXNetBugOperator

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions