ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding

Qihang Peng Henry Zheng Gao Huang
Tsinghua University

🔥 News

[2025-02] We release the paper of ProxyTransformation. Please check the webpage for brief introduction!
[2025-02] Our paper was accept by CVPR2025 ! 🥳

⭐ Motivation

After reconstructing the scene point cloud from multi-view images in ego-centric 3D visual grounding, the noise in the reconstruction process and large-scale downsampling will cause the scene point cloud to lose a large amount of geometric and semantic information. Previous point cloud enhancement work was only based on a single point cloud modality. By enhancing the geometric structure through point cloud features, it did not make full use of the multi-modal information in this context. Moreover, these methods often require preprocessing, which does not meet our online requirements. Therefore, we hope to make full use of multi-modal information for point cloud enhancement, use text prompt and multi-view image to generate corresponding transformations, and conduct partition optimization on the scene point cloud structure.

📖 Framework

In ego-centric 3D visual grounding, we first generate a uniform grid prior in space and perform an initial clustering. Each cluster is then processed by an offset network to obtain deformable offsets for the cluster centers, allowing the initial grid prior to be shifted toward more important regions and enabling clustering to capture the sub-manifold of the target region. We utilize a proxy block based on proxy attention to process multi-modal information, obtaining a transformation matrix and translation vector for each sub-manifold. This optimizes the relative positions and internal structures of the sub-manifolds, which are subsequently fed into downstream structures for feature learning and fusion, ultimately achieving precise localization of the target object in the scene.

📝 TODO List

Clean up the codebase and release our code.
Upload our model weights.
Full release and further updates.

📚 Getting Started

Installation

Clone this repository.

git clone https://github.com/pqh22/ProxyTransformation.git
cd ProxyTransformation

Create environment and install packages

conda create -n pt python=3.8 -y
conda activate pt
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
python install.py all

By the way, it's easy to encounter problems when installing Mink Engine and PyTorch3D. You can follow the official document carefully.

Data Preparation

Please refer to EmbodiedScan for downloading and organization.

And the enhanced detection model and augmented text for training are proposed by DenseG. If you are interested in obtaining the data for your own research or experiments, please get in touch with the authors of the relevant work.

Training and evaluation

We provide some scripts for your reference.

To train our grounding model with pytorch, you can run:

torchrun --nproc_per_node={NUM_NODES} tools/train.py {CONFIG} --work-dir={WORK_DIR} --launcher="pytorch"

To inference and evaluate the model, you can run:

torchrun --nproc_per_node={NUM_NODES} tools/eval.py {CONFIG} --work-dir={WORK_DIR}  --launcher="pytorch" --resume {CKPT}

📦 Model & Weights

You can download our model from my Google Drive.

📬 Bugs or questions?

If you have any questions related to the codes or the paper, please feel free to contact Qihang Peng ([email protected]) or open an issue.

🔗 Citation

If you find our work helpful, please cite:

@inproceedings{peng2025proxytransformation,
  title={ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding},
  author={Peng, Qihang and Zheng, Henry and Huang, Gao},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={24582--24592},
  year={2025}
}

@misc{peng2025proxytransformationpreshapingpointcloud,
      title={ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding}, 
      author={Qihang Peng and Henry Zheng and Gao Huang},
      year={2025},
      eprint={2502.19247},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.19247}, 
}

👏 Acknowledgements

The development of ProxyTransformation is based on EmbodiedScan and DenseG. We deeply appreciate their contribution to the community.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
asset		asset
configs		configs
embodiedscan		embodiedscan
requirements		requirements
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.py		install.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding

🔥 News

⭐ Motivation

📖 Framework

📝 TODO List

📚 Getting Started

Installation

Data Preparation

Training and evaluation

📦 Model & Weights

📬 Bugs or questions?

🔗 Citation

👏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

pqh22/ProxyTransformation

Folders and files

Latest commit

History

Repository files navigation

ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding

🔥 News

⭐ Motivation

📖 Framework

📝 TODO List

📚 Getting Started

Installation

Data Preparation

Training and evaluation

📦 Model & Weights

📬 Bugs or questions?

🔗 Citation

👏 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages