Skip to content

feat(BEVFusion): added a positional encoding-like feature extractor #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

knzo25
Copy link
Contributor

@knzo25 knzo25 commented Mar 13, 2025

Summary

Added a positional encoding-like feature extractor before the sparse encoder.
The results, mainly for the online model, see their mAP greatly improved.

Change point

The fundamentals behind this idea can be seen from different angles.
A good study was actually performed in the first NERF paper and explain more in detail here:
https://arxiv.org/pdf/2006.10739

From a more engineer-like perspective, we have signal in the range [-120, 120]. The sparse convolutions, product of having a kernel size of 3x3 will have at most 0.17m x 3 of difference between elements in the operation, making it difficult to learn features due to the signal to reference ratio. The reference is actually not completely needed since it is encoded in the coordinated of the convolution. Simply applying multiple sin/cos functions will allow the first layer to learn higher frequency features, in theory allowing for a more discerning detector. This can also be seen from kernel theory an series, but I think is overkill

Note

Although this does not impact other projects and by default this new feature is off, the idea itself can be applied to other models. That being said, the idea behind it is already applied to some degree in the pillar features of centerpoint & transfusion.

Test performed

*Note: these are all the best eval performance of the best epoch (will replace them for the test later)

TIERIV INTERNAL LINK

model range mAP car truck bus bicycle pedestrian
BEVFusion-L base_120m_v1 (baseline) 122.4m 61.8 77.9 64.5 57.5 53.4 55.7
BEVFusion-L base_120m_v1 (encoding) 122.4m 68.8 81.2 65.3 65.8 70.2 61.5
BEVFusion-L-offline base_120m_v1 (baseline) 122.4m 66.0 85.4 57.4 51.7 68.1 67.2
BEVFusion-L-offline base_120m_v1 (encoding) 122.4m 70.8 85.6 65.6 67.7 67.9 67.1
The link of data and evaluation result
  • BEVFusion-L base_120m_v1 (baseline)
    • Training dataset: db_jpntaxi_v1 + db_jpntaxi_v2 + db_jpntaxi_v4 + db_gsm8_v1 + db_j6_v1 + db_j6_v2 + db_j6_v3 + db_j6_v5 +
    • Eval dataset: db_jpntaxi_v1 + db_jpntaxi_v2 + db_jpntaxi_v4 + db_gsm8_v1 + db_j6_v1 + db_j6_v2 + db_j6_v3 + db_j6_v5 +
    • Config file path
    • Results are in internal data.
    • Total mAP to eval dataset (eval range = 120m): 0.618
class_name mAP [email protected] [email protected] [email protected] [email protected]
car 77.9 64.4 78.3 83.4 85.6
truck 64.5 40.8 62.5 75.0 79.6
bus 57.5 38.1 55.1 63.7 73.1
bicycle 53.4 45.3 54.2 56.3 57.6
pedestrian 55.7 47.5 54.2 58.4 62.7
  • BEVFusion-L base_120m_v1 (encoding)
    • Training dataset: db_jpntaxi_v1 + db_jpntaxi_v2 + db_jpntaxi_v4 + db_gsm8_v1 + db_j6_v1 + db_j6_v2 + db_j6_v3 + db_j6_v5 +
    • Eval dataset: db_jpntaxi_v1 + db_jpntaxi_v2 + db_jpntaxi_v4 + db_gsm8_v1 + db_j6_v1 + db_j6_v2 + db_j6_v3 + db_j6_v5 +
    • Config file path
    • Results are in internal data.
    • Total mAP to eval dataset (eval range = 120m): 0.688
class_name mAP [email protected] [email protected] [email protected] [email protected]
car 81.2 69.5 81.4 86.1 88.0
truck 65.3 42.8 64.5 74.4 79.6
bus 65.8 40.7 66.5 76.2 79.9
bicycle 70.2 66.4 70.9 71.4 72.0
pedestrian 61.5 54.1 59.7 63.9 68.1
  • BEVFusion-L-offline base_120m_v1 (baseline)
    • Training dataset: db_jpntaxi_v1 + db_jpntaxi_v2 + db_jpntaxi_v4 + db_gsm8_v1 + db_j6_v1 + db_j6_v2 + db_j6_v3 + db_j6_v5 +
    • Eval dataset: db_jpntaxi_v1 + db_jpntaxi_v2 + db_jpntaxi_v4 + db_gsm8_v1 + db_j6_v1 + db_j6_v2 + db_j6_v3 + db_j6_v5 +
    • Config file path
    • Results are in internal data.
    • Total mAP to eval dataset (eval range = 120m): 0.66
class_name mAP [email protected] [email protected] [email protected] [email protected]
car 85.4 77.4 85.7 88.8 89.8
truck 57.4 36.3 55.4 66.6 71.2
bus 51.7 38.2 52.2 57.5 58.9
bicycle 68.1 66.3 68.3 68.6 69.4
pedestrian 67.2 64.3 66.1 67.8 70.7
  • BEVFusion-L-offline base_120m_v1 (encoding)
    • Training dataset: db_jpntaxi_v1 + db_jpntaxi_v2 + db_jpntaxi_v4 + db_gsm8_v1 + db_j6_v1 + db_j6_v2 + db_j6_v3 + db_j6_v5 +
    • Eval dataset: db_jpntaxi_v1 + db_jpntaxi_v2 + db_jpntaxi_v4 + db_gsm8_v1 + db_j6_v1 + db_j6_v2 + db_j6_v3 + db_j6_v5 +
    • Config file path
    • Results are in internal data.
    • Total mAP to eval dataset (eval range = 120m): 0.708
class_name mAP [email protected] [email protected] [email protected] [email protected]
car 85.6 77.6 86.0 88.9 89.8
truck 65.6 45.8 65.7 73.9 77.0
bus 67.7 49.2 71.6 74.5 75.6
bicycle 67.9 66.1 68.0 68.3 69.1
pedestrian 67.1 64.0 66.0 67.7 70.8

…irst layer of the sparse encoder, which improves mAP quite a bit

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
@knzo25 knzo25 requested a review from scepter914 March 13, 2025 07:11
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
@knzo25 knzo25 self-assigned this Mar 13, 2025
@knzo25 knzo25 marked this pull request as ready for review March 13, 2025 07:30
@scepter914
Copy link
Collaborator

Thank you for contribution and great work 👍

I want to know about the result in detail, so would you write the result like CenterPoint-ConvNeXtPC in the document?
Especially, I want to confirm what dataset you use for training and evaluation.

- Evaluation result with test-dataset: DB JPNTAXI v1.0 + DB JPNTAXI v2.0 + DB JPNTAXI v3.0 + DB GSM8 v1.0 + DB J6 v1.0 (total frames: 1,394):
  - Total mAP (eval range = 120m): 0.686

| class_name | Count  | mAP  | [email protected] | [email protected] | [email protected] | [email protected] |
| ---------- | ------ | ---- | ------- | ------- | ------- | ------- |
| car        | 41,133 | 77.9 | 79.8    | 82.2    | 83.0    | 79.5    |
| truck      | 8,890  | 58.6 | 34.7    | 59.7    | 67.7    | 72.2    |
| bus        | 3,275  | 80.9 | 69.2    | 79.6    | 81.1    | 82.6    |
| bicycle    | 3,635  | 53.2 | 52.3    | 53.4    | 53.5    | 53.6    |
| pedestrian | 25,981 | 64.8 | 62.4    | 64.0    | 65.4    | 67.4    |

In addition to the result, would you upload to S3 model zoo? cc. @SamratThapa120

SamratThapa120 pushed a commit that referenced this pull request Mar 27, 2025
@knzo25
Copy link
Contributor Author

knzo25 commented Apr 30, 2025

@scepter914
Apologies for the excessive delay. Formated the output 🙏

@Shin-kyoto Shin-kyoto self-requested a review May 9, 2025 11:02
Copy link
Collaborator

@scepter914 scepter914 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks great to me.
I request @Shin-kyoto to confirm operational check with pseudo labeling, so please wait more.

@SamratThapa120
Copy link
Collaborator

SamratThapa120 commented Jun 9, 2025

@scepter914 @Shin-kyoto
FYI, these models were trained using positional encodings mentioned in this PR. I merged the changes locally, and trained the models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants