You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
Copy file name to clipboardExpand all lines: docs/tutorials/tensorrt/inference_with_trt.md
+17-27Lines changed: 17 additions & 27 deletions
Original file line number
Diff line number
Diff line change
@@ -17,29 +17,23 @@
17
17
18
18
# Optimizing Deep Learning Computation Graphs with TensorRT
19
19
20
-
NVIDIA's TensorRT is a deep learning library that has been shown to provide large speedups when used for network inference. MXNet 1.3.0 is shipping with experimental integrated support for TensorRT. This means MXNet users can noew make use of this acceleration library to efficiently run their networks. In this blog post we'll see how to install, enable and run TensorRT with MXNet. We'll also give some insight into what is happening behind the scenes in MXNet to enable TensorRT graph execution.
20
+
NVIDIA's TensorRT is a deep learning library that has been shown to provide large speedups when used for network inference. MXNet 1.5.0 and later versions ship with experimental integrated support for TensorRT. This means MXNet users can now make use of this acceleration library to efficiently run their networks. In this tutorial we'll see how to install, enable and run TensorRT with MXNet. We'll also give some insight into what is happening behind the scenes in MXNet to enable TensorRT graph execution.
21
21
22
22
## Installation and Prerequisites
23
-
Installing MXNet with TensorRT integration is an easy process. First ensure that you are running Ubuntu 16.04, that you have updated your video drivers, and you have installed CUDA 9.0 or 9.2. You'll need a Pascal or newer generation NVIDIA gpu. You'll also have to download and install TensorRT libraries [instructions here](https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html). Once your these prerequisites installed and up-to-date you can install a special build of MXNet with TensorRT support enabled via PyPi and pip. Install the appropriate version by running:
23
+
Installing MXNet with TensorRT integration is an easy process. First ensure that you are running Ubuntu 18.04, that you have updated your video drivers, and you have installed CUDA 10.0. You'll need a Pascal or newer generation NVIDIA gpu. You'll also have to download and install TensorRT libraries [instructions here](https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html). Once you have these prerequisites installed and up-to-date you can install a special build of MXNet with TensorRT support enabled via PyPi and pip. Install the appropriate version by running:
24
24
25
-
To install with CUDA 9.0:
26
25
```
27
-
pip install mxnet-tensorrt-cu90
26
+
pip install mxnet-tensorrt-cu10
28
27
```
29
28
30
-
To install with CUDA 9.2:
31
-
```
32
-
pip install mxnet-tensorrt-cu92
33
-
```
34
-
35
-
If you are running an operating system other than Ubuntu 16.04, or just prefer to use a docker image with all prerequisites installed you can instead run:
29
+
If you are running an operating system other than Ubuntu 18.04, or just prefer to use a docker image with all prerequisites installed you can instead run:
36
30
```
37
31
nvidia-docker run -ti mxnet/tensorrt python
38
32
```
39
33
40
34
## Sample Models
41
35
### Resnet 18
42
-
TensorRT is an inference only library, so for the purposes of this blog post we will be using a pre-trained network, in this case a Resnet 18. Resnets are a computationally intensive model architecture that are often used as a backbone for various computer vision tasks. Resnets are also commonly used as a reference for benchmarking deep learning library performance. In this section we'll use a pretrained Resnet 18 from the [Gluon Model Zoo](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/model_zoo.html) and compare its inference speed with TensorRT using MXNet with TensorRT integration turned off as a baseline.
36
+
TensorRT is an inference only library, so for the purposes of this tutorial we will be using a pre-trained network, in this case a Resnet 50. Resnets are a computationally intensive model architecture that are often used as a backbone for various computer vision tasks. Resnets are also commonly used as a reference for benchmarking deep learning library performance. In this section we'll use a pretrained Resnet 18 from the [Gluon Model Zoo](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/model_zoo.html) and compare its inference speed with TensorRT using MXNet with TensorRT integration turned off as a baseline.
43
37
44
38
## Model Initialization
45
39
```python
@@ -63,7 +57,6 @@ In our first section of code we import the modules needed to run MXNet, and to t
For this experiment we are strictly interested in inference performance, so to simplify the benchmark we'll pass a tensor filled with zeros as an input. We then bind a symbol as usual, returning a normal MXNet executor, and we run forward on this executor in a loop. To help improve the accuracy of our benchmarks we run a small number of predictions as a warmup before running our timed loop. This will ensure various lazy operations, which do not represent real-world usage, have completed before we measure relative performance improvement. On a modern PC with a Titan V GPU the time taken for our MXNet baseline is **33.73s**. Next we'll run the same model with TensorRT enabled, and see how the performance compares.
87
-
88
-
While TensorRT integration remains experimental, we require users to set an environment variable to enable graph compilation. You can see that at the start of this test we explicitly disabled TensorRT graph compilation support. Next, we will run the same predictions using TensorRT. This will require us to explicitly enable the MXNET_USE_TENSORRT environment variable, and we'll also use a slightly different API to bind our symbol.
79
+
For this experiment we are strictly interested in inference performance, so to simplify the benchmark we'll pass a tensor filled with zeros as an input. We then bind a symbol as usual, returning a normal MXNet executor, and we run forward on this executor in a loop. To help improve the accuracy of our benchmarks we run a small number of predictions as a warmup before running our timed loop. This will ensure various lazy operations, which do not represent real-world usage, have completed before we measure relative performance improvement. On a modern PC with an RTX 2070 GPU the time taken for our MXNet baseline is **17.20s**. Next we'll run the same model with TensorRT enabled, and see how the performance compares.
89
80
90
81
## MXNet with TensorRT Integration Performance
91
82
```python
92
83
# Execute with TensorRT
93
84
print('Building TensorRT engine')
94
-
os.environ['MXNET_USE_TENSORRT'] ='1'
95
-
arg_params.update(aux_params)
96
-
all_params =dict([(k, v.as_in_context(mx.gpu(0))) for k, v in arg_params.items()])
Instead of calling simple_bind directly on our symbol to return an executor, we call an experimental API from the contrib module of MXNet. This call is meant to emulate the simple_bind call, and has many of the same arguments. One difference to note is that this call takes params in the form of a single merged dictionary to assist with a tensor cleanup pass that we'll describe below.
102
-
103
-
As TensorRT integration improves our goal is to gradually deprecate this tensorrt_bind call, and allow users to use TensorRT transparently (see the [Subgraph API](https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimization+and+Quantization+based+on+subgraph+and+MKL-DNN) for more information). When this happens, the similarity between tensorrt_bind and simple_bind should make it easy to migrate your code.
93
+
We us a few TensorRT specific API calls from the contrib package here to setup our parameters and indicate we'd like to run inference in fp16 mode. We then call simplebind as normal and copy our parameter dictionaries to our executor.
104
94
105
-
```
95
+
```python
106
96
#Warmup
107
97
print('Warming up TensorRT')
108
98
for i inrange(0, 10):
@@ -118,7 +108,7 @@ for i in range(0, 10000):
118
108
end = time.time()
119
109
print(time.process_time() - start)
120
110
```
121
-
We run timing with a warmup once more, and on the same machine, run in **18.99s**. A 1.8x speed improvement! Speed improvements when using libraries like TensorRT can come from a variety of optimizations, but in this case our speedups are coming from a technique known as [operator fusion](http://dmlc.ml/2016/11/21/fusion-and-runtime-compilation-for-nnvm-and-tinyflow.html).
111
+
We run timing with a warmup once more, and on the same machine, run in **9.83s**. A 1.75x speed improvement! Speed improvements when using libraries like TensorRT can come from a variety of optimizations, but in this case our speedups are coming from a technique known as [operator fusion](http://dmlc.ml/2016/11/21/fusion-and-runtime-compilation-for-nnvm-and-tinyflow.html).
122
112
123
113
## Operators and Subgraph Fusion
124
114
@@ -136,8 +126,8 @@ The examples below shows a Gluon implementation of a Wavenet before and after a
136
126
## After
137
127

138
128
139
-
## Future Work
140
-
As mentioned above, MXNet developers are excited about the possibilities of [creating APIs](https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimization+and+Quantization+based+on+subgraph+and+MKL-DNN) that deal specifically with subgraphs. As this work matures it will bring many improvements for TensorRT users. We hope this will also be an opportunity for other acceleration libraries to integrate with MXNet.
129
+
## SubGraph API
130
+
As of MXNet 1.5, MXNet developers have integrated TensorRT with MXNet via a Subgraph API. Read more about the design of the API [here](https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimization+and+Quantization+based+on+subgraph+and+MKL-DNN).
141
131
142
132
## Thanks
143
133
Thank you to NVIDIA for contributing this feature, and specifically thanks to Marek Kolodziej and Clement Fuji-Tsang. Thanks to Junyuan Xie and Jun Wu for the code reviews and design feedback, and to Aaron Markham for the copy review.
0 commit comments