-
Notifications
You must be signed in to change notification settings - Fork 37
TensorRT show no improvement in inference speed #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you try |
By the way, you should skip the early stages to bypass the GPU warmup process. |
I used the way parsing onnx file. However, by loading DSVT_TrtEngine file is the same situation, the elasped time remains unchanged. Thanks! |
I use the python API of TensorRT, and I have noticed the FP16 version of TensorRT is nearly twice as fast compared to the PyTorch version. |
the following is the bash script to generate the engine_trt file, do you find some any problems? attach the output log file
|
Hi, I have found that improper dynamic shapes were utilized in the TRT command, leading to the observed results as follows: After analyzing the distribution of these dynamic shapes in Waymo validation dataset, I suggest employing the following command for optimal results:
Subsequently, you will obtain the following results: |
This issue seems to have been solved and will be closed. |
I think this is due to hardware reasons! in my nvidia p4000, I implemented dsvt using tensorrt definition c++ api and plugin. similarly, there was no acceleration!you can try other hardware,such as 2080ti. |
New command of trtexec will also be slow in P4000?
I only test it in RTX3090. |
Maybe this is the reason for FP16. P4000 has much lower computational power and fewer GPU cores in FP16. Please refer here. RTX 3090 is 500x faster than P4000 in FP16 computation. Our TensorRT deployment mainly focuses on FP16. |
thank you all above, There is a huge diffrence in hardware performence between p4000 and 3090ti. this issure will be closed |
Nice! |
I attempted to deploy the dsvt model to TensorRT according to your deployment code, By the TensorRT official example code I used dynamic shape for dsvt_block model input, Model inference time is about 260ms. However, using pytorch version takes less time, about 140ms. Why the time takes more with TensorRT c++ code?
Environment
TensorRT Version: 8.5.1.7
CUDA Version: 11.8
CUDNN Version: 8.6
Hardware GPU: p4000
(the rest is the same as the public)
inference code
according to results, the average time cost of each stage, as following:
t1-t0:0.00860953
t2-t1:0.0124242
t3-t2:4.72069e-05
t4-t3:8.10623e-06
t5-t4:0.260188
t6-t5:0.00110817
c++ code takes more time? Have some mistakes in inference code?
The text was updated successfully, but these errors were encountered: