[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331
Description
Description
Currently, MXNet only supports tensor size smaller than 2^31. To support large tensors, users need to recompile MXNet with USE_INT64_TENSOR_SIZE compiler flag set to ON.
Large tensor is used often in applications such as recommendation system with sparse embedding matrix and graph neural networks such as DGL.
To provide a better user experience, we would like to turn on this compiler flag by default so that MXNet binary release will support large tensors.
Current Status:
Large tensor support is already implemented in MXNet backend and C API. Over 80 operators have been tested and more are being tested.
There was performance degradation in a few operators such as transpose and it has been fixed (#16104)
Model Inference Performance
int64/int32 P50 records the 50-percentile inference runtime
% Diff: Runtime speedup of int64 build vs int32 build.
Thus a positive means inference time is reduced using int64 as tensor index.
Model | Mode | int64 P50 (ms) | int32 P50 (ms) | Diff (%) |
---|---|---|---|---|
resnext101_64x4d | gluon | 47.34253883 | 49.46685 | 4.29% |
resnext101_64x4d | module | 28.83672714 | 28.48792 | -1.22% |
resnext50 | gluon | 17.14539528 | 18.05592 | 5.04% |
resnext50 | module | 10.05506516 | 9.636641 | -4.34% |
nin | gluon | 2.574443817 | 2.608061 | 1.29% |
nin | module | 2.432107925 | 2.737761 | 11.16% |
resnet18 | gluon | 3.895759583 | 3.638268 | -7.08% |
resnet18 | module | 2.954959869 | 3.182888 | 7.16% |
wavernn | gluon | 262.9389763 | 256.5546 | -2.49% |
caffenet | gluon | 2.930879593 | 3.087759 | 5.08% |
caffenet | module | 3.169536591 | 3.225327 | 1.73% |
vgg19 | gluon | 14.18304443 | 13.89098 | -2.10% |
vgg19 | module | 13.80157471 | 14.33492 | 3.72% |
maskrcnn | gluon | 2340.852737 | 2391.741 | 2.13% |
maskrcnn | module | 1943.515778 | 1926.38 | -0.89% |
superres | gluon | 17.39168167 | 18.00895 | 3.43% |
superres | module | 16.98470116 | 17.26198 | 1.61% |
resnet101 | gluon | 18.73707771 | 18.4412 | -1.60% |
resnet101 | module | 16.66593552 | 14.78386 | -12.73% |
vgg16 | gluon | 12.403965 | 16.2611 | 23.72% |
vgg16 | module | 17.93074608 | 11.83605 | -51.49% |
yolov3 | gluon | 22.96686172 | 23.01311 | 0.20% |
yolov3 | module | 18.57829094 | 20.05506 | 7.36% |
ssd | gluon | 17.17400551 | 16.73698 | -2.61% |
ssd | module | 13.98611069 | 14.00757 | 0.15% |
rnn | gluon | 28.2740593 | 28.92017 | 2.23% |
rnn | module | 19.32096481 | 28.63479 | 32.53% |
a3c | gluon | 0.928401947 | 0.94223 | 1.47% |
a3c | module | 0.673055649 | 0.858545 | 21.61% |
squeezenetv10 | gluon | 4.072666168 | 4.251957 | 4.22% |
squeezenetv10 | module | 3.686189651 | 3.818274 | 3.46% |
resnet152 | gluon | 25.8705616 | 27.65441 | 6.45% |
resnet152 | module | 20.5206871 | 21.03257 | 2.43% |
resnet34 | gluon | 6.978273392 | 7.166862 | 2.63% |
resnet34 | module | 5.693674088 | 5.653858 | -0.70% |
squeezenetv11 | gluon | 3.037929535 | 3.165722 | 4.04% |
squeezenetv11 | module | 2.890110016 | 2.983332 | 3.12% |
resnext101 | gluon | 29.1929245 | 27.65107 | -5.58% |
resnext101 | module | 15.9804821 | 17.51709 | 8.77% |
bert | gluon | 44.32678223 | 43.77675 | -1.26% |
bert | module | 43.85828972 | 45.38655 | 3.37% |
resnet50 | gluon | 10.39171219 | 10.31256 | -0.77% |
resnet50 | module | 9.351491928 | 8.312941 | -12.49% |
fasterrcnn | gluon | 1041.807413 | 1061.532 | 1.86% |
fasterrcnn | module | 702.3141384 | 703.7232 | 0.20% |
inception | gluon | 7.934331894 | 8.714437 | 8.95% |
inception | module | 5.178928375 | 5.363703 | 3.44% |
Average | gluon | n/a | n/a | 0.69% |
Average | module | n/a | n/a | -0.37% |
Model Training Performance
Model | int64 Samples/Second | int32 Samples/Second | Percentage Change |
---|---|---|---|
xception | 67.51961 | 68.61849 | -1.60% |
resnet50_v2 | 299.0174 | 299.1728 | -0.05% |
gnmt | 7.65 | 7.675 | -0.33% |
vgg16 | 228.4218 | 230.0739 | -0.72% |
bert | 38.1 | 46.7 | -18.42% |
yolo3_darknet53_custom | 31.6145 | 40.65 | -22.23% |
inceptionv3 | 225.4025 | 227.1884 | -0.79% |
se_resnet152_v1 | 123.7371 | 124.1493 | -0.33% |
word_language_model | 15651.19 | 15524.71 | 0.81% |
*mobilenet0.25_cifar10 | 56.6609205 | 60.5992765 | 6.50% |
resnet101_v1 | 176.6355 | 177.3132 | -0.38% |
squeezenet1.0 | 790.7722 | 790.1395 | 0.08% |
mobilenetv2_0.75 | 680.4143 | 672.2202 | 1.22% |
ssd | 66.2365 | 67.56 | -1.96% |
Average | -3.44% |
* measures speed instead of throughput
What Caused Performance Drop in BERT
Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).
Running operator-level profiler we could identify the 2.2X performance drop in broadcast_axis operator.
w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]
w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}
Why is broadcast_axis Operator Affected
Too many div/mul/mod ALU operations in the indices which changed from int32 type to int64 type
template<typename OP>
struct broadcast_kernel {
template<typename IType, typename OType>
MSHADOW_XINLINE static void Map(index_t i,
IType *input,
OType *output,
mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> in_shape,
mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> out_shape,
const OpReqType req,
const uint32_t ndim) {
size_t in_stride = 1;
size_t out_stride = 1;
index_t idx = i;
index_t in_idx = i;
for (int iter = ndim - 1; iter >= 0; --iter) {
size_t dim_idx = idx % out_shape[iter];
in_idx -= dim_idx * out_stride;
if (in_shape[iter] != 1) {
in_idx += dim_idx * in_stride;
}
idx /= out_shape[iter];
in_stride *= in_shape[iter];
out_stride *= out_shape[iter];
}
KERNEL_ASSIGN(output[i], req, OP::Map(input[in_idx]));
}
};
TODO
- (DONE) update MXNet development doc and FAQ for adding new operators
(@ChaiBapchya ) - (DONE) turning on nightly tests for large tensor (@access2rohit )
skipping tests that cannot fit in nightly CI machine #17450
Re-Enabling Large Tensor and Vector Nightly on GPU #16164
enabling build stage gpu_int64 to enable large tensor nightly runs #17546 - test performance in npx operators (@access2rohit)
- (DONE) test more operators (@ChaiBapchya)
Implement remaining nn_basic ops in opperf #17456 - (DONE) adding end-to-end tests for a list of models (@jonatan1626)
Updated PartialSortSmallK for LT support #17462 - Fix training regression in BERT model
- setting the flag to ON and clean up (@apeforest)