Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331

Open
@apeforest

Description

@apeforest

Description

Currently, MXNet only supports tensor size smaller than 2^31. To support large tensors, users need to recompile MXNet with USE_INT64_TENSOR_SIZE compiler flag set to ON.

Large tensor is used often in applications such as recommendation system with sparse embedding matrix and graph neural networks such as DGL.

To provide a better user experience, we would like to turn on this compiler flag by default so that MXNet binary release will support large tensors.

RFC: https://lists.apache.org/thread.html/df53b8c26e9e0433378dd803baba9fec4dd922728a5ce9135dc164b3@%3Cdev.mxnet.apache.org%3E

Current Status:

Large tensor support is already implemented in MXNet backend and C API. Over 80 operators have been tested and more are being tested.

There was performance degradation in a few operators such as transpose and it has been fixed (#16104)

Model Inference Performance

int64/int32 P50 records the 50-percentile inference runtime
% Diff: Runtime speedup of int64 build vs int32 build.
Thus a positive means inference time is reduced using int64 as tensor index.

Model Mode int64 P50 (ms) int32 P50 (ms) Diff (%)
resnext101_64x4d gluon 47.34253883 49.46685 4.29%
resnext101_64x4d module 28.83672714 28.48792 -1.22%
resnext50 gluon 17.14539528 18.05592 5.04%
resnext50 module 10.05506516 9.636641 -4.34%
nin gluon 2.574443817 2.608061 1.29%
nin module 2.432107925 2.737761 11.16%
resnet18 gluon 3.895759583 3.638268 -7.08%
resnet18 module 2.954959869 3.182888 7.16%
wavernn gluon 262.9389763 256.5546 -2.49%
caffenet gluon 2.930879593 3.087759 5.08%
caffenet module 3.169536591 3.225327 1.73%
vgg19 gluon 14.18304443 13.89098 -2.10%
vgg19 module 13.80157471 14.33492 3.72%
maskrcnn gluon 2340.852737 2391.741 2.13%
maskrcnn module 1943.515778 1926.38 -0.89%
superres gluon 17.39168167 18.00895 3.43%
superres module 16.98470116 17.26198 1.61%
resnet101 gluon 18.73707771 18.4412 -1.60%
resnet101 module 16.66593552 14.78386 -12.73%
vgg16 gluon 12.403965 16.2611 23.72%
vgg16 module 17.93074608 11.83605 -51.49%
yolov3 gluon 22.96686172 23.01311 0.20%
yolov3 module 18.57829094 20.05506 7.36%
ssd gluon 17.17400551 16.73698 -2.61%
ssd module 13.98611069 14.00757 0.15%
rnn gluon 28.2740593 28.92017 2.23%
rnn module 19.32096481 28.63479 32.53%
a3c gluon 0.928401947 0.94223 1.47%
a3c module 0.673055649 0.858545 21.61%
squeezenetv10 gluon 4.072666168 4.251957 4.22%
squeezenetv10 module 3.686189651 3.818274 3.46%
resnet152 gluon 25.8705616 27.65441 6.45%
resnet152 module 20.5206871 21.03257 2.43%
resnet34 gluon 6.978273392 7.166862 2.63%
resnet34 module 5.693674088 5.653858 -0.70%
squeezenetv11 gluon 3.037929535 3.165722 4.04%
squeezenetv11 module 2.890110016 2.983332 3.12%
resnext101 gluon 29.1929245 27.65107 -5.58%
resnext101 module 15.9804821 17.51709 8.77%
bert gluon 44.32678223 43.77675 -1.26%
bert module 43.85828972 45.38655 3.37%
resnet50 gluon 10.39171219 10.31256 -0.77%
resnet50 module 9.351491928 8.312941 -12.49%
fasterrcnn gluon 1041.807413 1061.532 1.86%
fasterrcnn module 702.3141384 703.7232 0.20%
inception gluon 7.934331894 8.714437 8.95%
inception module 5.178928375 5.363703 3.44%
Average gluon n/a n/a 0.69%
Average module n/a n/a -0.37%

Model Training Performance

Model int64 Samples/Second int32 Samples/Second Percentage Change
xception 67.51961 68.61849 -1.60%
resnet50_v2 299.0174 299.1728 -0.05%
gnmt 7.65 7.675 -0.33%
vgg16 228.4218 230.0739 -0.72%
bert 38.1 46.7 -18.42%
yolo3_darknet53_custom 31.6145 40.65 -22.23%
inceptionv3 225.4025 227.1884 -0.79%
se_resnet152_v1 123.7371 124.1493 -0.33%
word_language_model 15651.19 15524.71 0.81%
*mobilenet0.25_cifar10 56.6609205 60.5992765 6.50%  
resnet101_v1 176.6355 177.3132 -0.38%
squeezenet1.0 790.7722 790.1395 0.08%
mobilenetv2_0.75 680.4143 672.2202 1.22%
ssd 66.2365 67.56 -1.96%
Average -3.44%

* measures speed instead of throughput

What Caused Performance Drop in BERT

Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).

Running operator-level profiler we could identify the 2.2X performance drop in broadcast_axis operator.

w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]

w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}

Why is broadcast_axis Operator Affected

Too many div/mul/mod ALU operations in the indices which changed from int32 type to int64 type

template<typename OP>
struct broadcast_kernel {
  template<typename IType, typename OType>
  MSHADOW_XINLINE static void Map(index_t i,
                                  IType *input,
                                  OType *output,
                                  mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> in_shape,
                                  mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> out_shape,
                                  const OpReqType req,
                                  const uint32_t ndim) {
    size_t in_stride = 1;
    size_t out_stride = 1;
    index_t idx = i;
    index_t in_idx = i;
    for (int iter = ndim - 1; iter >= 0; --iter) {
      size_t dim_idx = idx % out_shape[iter];
      in_idx -= dim_idx * out_stride;
      if (in_shape[iter] != 1) {
        in_idx += dim_idx * in_stride;
      }
      idx /= out_shape[iter];
      in_stride *= in_shape[iter];
      out_stride *= out_shape[iter];
    }
    KERNEL_ASSIGN(output[i], req, OP::Map(input[in_idx]));
  }
};

TODO

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions