Description
Hello there! Thanks for your great work!
I had an issue when I deployed it on my pc. can anyone help me take a look? Thanks!!
Description
I have observed an unexpected performance behavior while using fast_gicp_mt. Specifically, the single-threaded versions of certain point cloud alignment algorithms such as GICP and NDT are outperforming their multi-threaded counterparts. This was observed while aligning two point clouds of sizes 17047 and 17334 points.
Environment
The repo is deployed using docker.
OS: Ubuntu20.04 + ROS Noetic
GPU: RTX4090 32GB
CPU: i9-13900KF
RAM: 32GB
I deployed the repo on WSL using Docker.
Details
The execution times for various algorithms were recorded, and it was noted that single-threaded implementations were consistently faster than multi-threaded ones. Below are some of the results obtained:
$ rosrun fast_gicp gicp_align 251370668.pcd 251371071.pcd
target:17047[pts] source:17334[pts]
--- pcl_gicp ---
single:110.186[msec] 100times:11059.9[msec] fitness_score:0.204892
--- pcl_ndt ---
single:39.1375[msec] 100times:4043.5[msec] fitness_score:0.229616
--- fgicp_st ---
single:101.371[msec] 100times:9945.61[msec] 100times_reuse:6586.6[msec] fitness_score:0.204376
--- fgicp_mt ---
single:135.229[msec] 100times:12986.9[msec] 100times_reuse:11950.3[msec] fitness_score:0.204384
--- vgicp_st ---
single:85.6506[msec] 100times:7514.18[msec] 100times_reuse:4194.52[msec] fitness_score:0.205022
--- vgicp_mt ---
single:158.688[msec] 100times:16300.5[msec] 100times_reuse:15309.5[msec] fitness_score:0.205022
--- ndt_cuda (P2D) ---
single:17.4151[msec] 100times:1702.9[msec] 100times_reuse:1340.19[msec] fitness_score:0.197208
--- ndt_cuda (D2D) ---
single:13.5261[msec] 100times:1391.88[msec] 100times_reuse:1119.26[msec] fitness_score:0.199985
--- vgicp_cuda (parallel_kdtree) ---
single:37.8372[msec] 100times:3054.31[msec] 100times_reuse:1987.94[msec] fitness_score:0.205017
--- vgicp_cuda (gpu_bruteforce) ---
single:65.4749[msec] 100times:3064.62[msec] 100times_reuse:2966.4[msec] fitness_score:0.249594
--- vgicp_cuda (gpu_rbf_kernel) ---
single:13.1453[msec] 100times:1515.33[msec] 100times_reuse:1119.99[msec] fitness_score:0.204766
Expected Behavior:
Typically, one would expect the multi-threaded implementations to be faster or at least as fast as the single-threaded ones, especially when dealing with large datasets.