[mxnet 2.0] [item 2.4] Turning on large tensor support by default

## Description
Currently, MXNet only supports tensor size smaller than 2^31. To support large tensors, users need to recompile MXNet with USE_INT64_TENSOR_SIZE compiler flag set to ON. 

Large tensor is used often in applications such as recommendation system with sparse embedding matrix and graph neural networks such as DGL.

To provide a better user experience, we would like to turn on this compiler flag by default so that MXNet binary release will support large tensors.

RFC: https://lists.apache.org/thread.html/df53b8c26e9e0433378dd803baba9fec4dd922728a5ce9135dc164b3@%3Cdev.mxnet.apache.org%3E 

## Current Status:
Large tensor support is already implemented in MXNet backend and C API. Over 80 operators have been tested and more are being tested.

There was performance degradation in a few operators such as transpose and it has been fixed (https://github.com/apache/incubator-mxnet/pull/16104)

### Model Inference Performance
int64/int32 P50 records the 50-percentile inference runtime
% Diff: Runtime speedup of int64 build vs int32 build.
Thus a positive means inference time is reduced using int64 as tensor index.

Model | Mode | int64 P50 (ms) | int32 P50 (ms) | Diff (%)
-- | -- | -- | -- | --
resnext101_64x4d | gluon | 47.34253883 | 49.46685 | 4.29%
resnext101_64x4d | module | 28.83672714 | 28.48792 | -1.22%
resnext50 | gluon | 17.14539528 | 18.05592 | 5.04%
resnext50 | module | 10.05506516 | 9.636641 | -4.34%
nin | gluon | 2.574443817 | 2.608061 | 1.29%
nin | module | 2.432107925 | 2.737761 | 11.16%
resnet18 | gluon | 3.895759583 | 3.638268 | -7.08%
resnet18 | module | 2.954959869 | 3.182888 | 7.16%
wavernn | gluon | 262.9389763 | 256.5546 | -2.49%
caffenet | gluon | 2.930879593 | 3.087759 | 5.08%
caffenet | module | 3.169536591 | 3.225327 | 1.73%
vgg19 | gluon | 14.18304443 | 13.89098 | -2.10%
vgg19 | module | 13.80157471 | 14.33492 | 3.72%
maskrcnn | gluon | 2340.852737 | 2391.741 | 2.13%
maskrcnn | module | 1943.515778 | 1926.38 | -0.89%
superres | gluon | 17.39168167 | 18.00895 | 3.43%
superres | module | 16.98470116 | 17.26198 | 1.61%
resnet101 | gluon | 18.73707771 | 18.4412 | -1.60%
**resnet101** | **module** | **16.66593552** | **14.78386** | **-12.73%**
vgg16 | gluon | 12.403965 | 16.2611 | 23.72%
**vgg16** | **module** | **17.93074608** | **11.83605** | **-51.49%**
yolov3 | gluon | 22.96686172 | 23.01311 | 0.20%
yolov3 | module | 18.57829094 | 20.05506 | 7.36%
ssd | gluon | 17.17400551 | 16.73698 | -2.61%
ssd | module | 13.98611069 | 14.00757 | 0.15%
rnn | gluon | 28.2740593 | 28.92017 | 2.23%
rnn | module | 19.32096481 | 28.63479 | 32.53%
a3c | gluon | 0.928401947 | 0.94223 | 1.47%
a3c | module | 0.673055649 | 0.858545 | 21.61%
squeezenetv10 | gluon | 4.072666168 | 4.251957 | 4.22%
squeezenetv10 | module | 3.686189651 | 3.818274 | 3.46%
resnet152 | gluon | 25.8705616 | 27.65441 | 6.45%
resnet152 | module | 20.5206871 | 21.03257 | 2.43%
resnet34 | gluon | 6.978273392 | 7.166862 | 2.63%
resnet34 | module | 5.693674088 | 5.653858 | -0.70%
squeezenetv11 | gluon | 3.037929535 | 3.165722 | 4.04%
squeezenetv11 | module | 2.890110016 | 2.983332 | 3.12%
resnext101 | gluon | 29.1929245 | 27.65107 | -5.58%
resnext101 | module | 15.9804821 | 17.51709 | 8.77%
bert | gluon | 44.32678223 | 43.77675 | -1.26%
bert | module | 43.85828972 | 45.38655 | 3.37%
resnet50 | gluon | 10.39171219 | 10.31256 | -0.77%
**resnet50** | **module** | **9.351491928** | **8.312941** | **-12.49%** 
fasterrcnn | gluon | 1041.807413 | 1061.532 | 1.86%
fasterrcnn | module | 702.3141384 | 703.7232 | 0.20%
inception | gluon | 7.934331894 | 8.714437 | 8.95%
inception | module | 5.178928375 | 5.363703 | 3.44%
**Average** | **gluon** | n/a | n/a | 0.69%
**Average** | **module** | n/a | n/a | -0.37%

### Model Training Performance
Model |  int64 Samples/Second | int32 Samples/Second | Percentage Change
-- | -- | -- | --
xception | 67.51961 | 68.61849 | -1.60%
resnet50_v2 | 299.0174 | 299.1728 | -0.05%
gnmt | 7.65 | 7.675 | -0.33%
vgg16 | 228.4218 | 230.0739 | -0.72%
**bert** |  **38.1** | **46.7** | **-18.42%**
**yolo3_darknet53_custom** | **31.6145** | **40.65** | **-22.23%**
inceptionv3 | 225.4025 | 227.1884 | -0.79%
se_resnet152_v1 | 123.7371 | 124.1493  | -0.33%
word_language_model | 15651.19 | 15524.71 |  0.81%
*mobilenet0.25_cifar10 | 56.6609205 | 60.5992765 | 6.50%  
resnet101_v1 | 176.6355 | 177.3132 | -0.38%
squeezenet1.0 | 790.7722 | 790.1395 | 0.08%
mobilenetv2_0.75 | 680.4143 | 672.2202 | 1.22%
ssd | 66.2365 | 67.56 | -1.96%
**Average** |  |  | **-3.44%**

\* measures speed instead of throughput

### What Caused Performance Drop in BERT
Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).

Running operator-level profiler we could identify the 2.2X performance drop in broadcast_axis operator.

w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]

w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}

### Why is broadcast_axis Operator Affected
Too many div/mul/mod ALU operations in the indices which changed from int32 type to int64 type

```
template<typename OP>
struct broadcast_kernel {
  template<typename IType, typename OType>
  MSHADOW_XINLINE static void Map(index_t i,
                                  IType *input,
                                  OType *output,
                                  mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> in_shape,
                                  mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> out_shape,
                                  const OpReqType req,
                                  const uint32_t ndim) {
    size_t in_stride = 1;
    size_t out_stride = 1;
    index_t idx = i;
    index_t in_idx = i;
    for (int iter = ndim - 1; iter >= 0; --iter) {
      size_t dim_idx = idx % out_shape[iter];
      in_idx -= dim_idx * out_stride;
      if (in_shape[iter] != 1) {
        in_idx += dim_idx * in_stride;
      }
      idx /= out_shape[iter];
      in_stride *= in_shape[iter];
      out_stride *= out_shape[iter];
    }
    KERNEL_ASSIGN(output[i], req, OP::Map(input[in_idx]));
  }
};
```

## TODO
- (DONE) update MXNet development doc and FAQ for adding new operators 
(@ChaiBapchya )
- (DONE) turning on nightly tests for large tensor (@access2rohit )
~~https://github.com/apache/incubator-mxnet/pull/17450~~
~~https://github.com/apache/incubator-mxnet/pull/16164~~
~~https://github.com/apache/incubator-mxnet/pull/17546~~
- test performance in npx operators (@access2rohit)
- (DONE) test more operators (@ChaiBapchya)
https://github.com/apache/incubator-mxnet/pull/17456
- (DONE) adding end-to-end tests for a list of models (@jonatan1626)
https://github.com/apache/incubator-mxnet/pull/17462
- Fix training regression in BERT model
- setting the flag to ON and clean up (@apeforest)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331

Description

Current Status:

Model Inference Performance

Model Training Performance

What Caused Performance Drop in BERT

Why is broadcast_axis Operator Affected

TODO

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Mode	int64 P50 (ms)	int32 P50 (ms)	Diff (%)
resnext101_64x4d	gluon	47.34253883	49.46685	4.29%
resnext101_64x4d	module	28.83672714	28.48792	-1.22%
resnext50	gluon	17.14539528	18.05592	5.04%
resnext50	module	10.05506516	9.636641	-4.34%
nin	gluon	2.574443817	2.608061	1.29%
nin	module	2.432107925	2.737761	11.16%
resnet18	gluon	3.895759583	3.638268	-7.08%
resnet18	module	2.954959869	3.182888	7.16%
wavernn	gluon	262.9389763	256.5546	-2.49%
caffenet	gluon	2.930879593	3.087759	5.08%
caffenet	module	3.169536591	3.225327	1.73%
vgg19	gluon	14.18304443	13.89098	-2.10%
vgg19	module	13.80157471	14.33492	3.72%
maskrcnn	gluon	2340.852737	2391.741	2.13%
maskrcnn	module	1943.515778	1926.38	-0.89%
superres	gluon	17.39168167	18.00895	3.43%
superres	module	16.98470116	17.26198	1.61%
resnet101	gluon	18.73707771	18.4412	-1.60%
resnet101	module	16.66593552	14.78386	-12.73%
vgg16	gluon	12.403965	16.2611	23.72%
vgg16	module	17.93074608	11.83605	-51.49%
yolov3	gluon	22.96686172	23.01311	0.20%
yolov3	module	18.57829094	20.05506	7.36%
ssd	gluon	17.17400551	16.73698	-2.61%
ssd	module	13.98611069	14.00757	0.15%
rnn	gluon	28.2740593	28.92017	2.23%
rnn	module	19.32096481	28.63479	32.53%
a3c	gluon	0.928401947	0.94223	1.47%
a3c	module	0.673055649	0.858545	21.61%
squeezenetv10	gluon	4.072666168	4.251957	4.22%
squeezenetv10	module	3.686189651	3.818274	3.46%
resnet152	gluon	25.8705616	27.65441	6.45%
resnet152	module	20.5206871	21.03257	2.43%
resnet34	gluon	6.978273392	7.166862	2.63%
resnet34	module	5.693674088	5.653858	-0.70%
squeezenetv11	gluon	3.037929535	3.165722	4.04%
squeezenetv11	module	2.890110016	2.983332	3.12%
resnext101	gluon	29.1929245	27.65107	-5.58%
resnext101	module	15.9804821	17.51709	8.77%
bert	gluon	44.32678223	43.77675	-1.26%
bert	module	43.85828972	45.38655	3.37%
resnet50	gluon	10.39171219	10.31256	-0.77%
resnet50	module	9.351491928	8.312941	-12.49%
fasterrcnn	gluon	1041.807413	1061.532	1.86%
fasterrcnn	module	702.3141384	703.7232	0.20%
inception	gluon	7.934331894	8.714437	8.95%
inception	module	5.178928375	5.363703	3.44%
Average	gluon	n/a	n/a	0.69%
Average	module	n/a	n/a	-0.37%

Model	int64 Samples/Second	int32 Samples/Second	Percentage Change
xception	67.51961	68.61849	-1.60%
resnet50_v2	299.0174	299.1728	-0.05%
gnmt	7.65	7.675	-0.33%
vgg16	228.4218	230.0739	-0.72%
bert	38.1	46.7	-18.42%
yolo3_darknet53_custom	31.6145	40.65	-22.23%
inceptionv3	225.4025	227.1884	-0.79%
se_resnet152_v1	123.7371	124.1493	-0.33%
word_language_model	15651.19	15524.71	0.81%
*mobilenet0.25_cifar10	56.6609205	60.5992765	6.50%
resnet101_v1	176.6355	177.3132	-0.38%
squeezenet1.0	790.7722	790.1395	0.08%
mobilenetv2_0.75	680.4143	672.2202	1.22%
ssd	66.2365	67.56	-1.96%
Average			-3.44%

[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331

Description

Description

Current Status:

Model Inference Performance

Model Training Performance

What Caused Performance Drop in BERT

Why is broadcast_axis Operator Affected

TODO

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions