-
Notifications
You must be signed in to change notification settings - Fork 10
feat(BEVFusion): added a positional encoding-like feature extractor #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…irst layer of the sparse encoder, which improves mAP quite a bit Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Thank you for contribution and great work 👍 I want to know about the result in detail, so would you write the result like CenterPoint-ConvNeXtPC in the document?
In addition to the result, would you upload to S3 model zoo? cc. @SamratThapa120 |
Signed-off-by: scepter914 <[email protected]>
@scepter914 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks great to me.
I request @Shin-kyoto to confirm operational check with pseudo labeling, so please wait more.
@scepter914 @Shin-kyoto |
Summary
Added a positional encoding-like feature extractor before the sparse encoder.
The results, mainly for the online model, see their mAP greatly improved.
Change point
The fundamentals behind this idea can be seen from different angles.
A good study was actually performed in the first NERF paper and explain more in detail here:
https://arxiv.org/pdf/2006.10739
From a more engineer-like perspective, we have signal in the range [-120, 120]. The sparse convolutions, product of having a kernel size of 3x3 will have at most 0.17m x 3 of difference between elements in the operation, making it difficult to learn features due to the signal to reference ratio. The reference is actually not completely needed since it is encoded in the coordinated of the convolution. Simply applying multiple sin/cos functions will allow the first layer to learn higher frequency features, in theory allowing for a more discerning detector. This can also be seen from kernel theory an series, but I think is overkill
Note
Although this does not impact other projects and by default this new feature is off, the idea itself can be applied to other models. That being said, the idea behind it is already applied to some degree in the pillar features of centerpoint & transfusion.
Test performed
*Note: these are all the best eval performance of the best epoch (will replace them for the test later)
TIERIV INTERNAL LINK
The link of data and evaluation result