GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding

1University of Science and Technology of China 2Northeastern University 3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

Abstract

Open-Vocabulary 3D object affordance grounding aims to anticipate ``action possibilities'' regions on 3D objects with arbitrary instructions, which is crucial for robots to generically perceive real scenarios and respond to operational changes. Existing methods focus on combining images or languages that depict interactions with 3D geometries to introduce external interaction priors. However, they are still vulnerable to a limited semantic space by failing to leverage implied invariant geometries and potential interaction intentions. Normally, humans address complex tasks through multi-step reasoning and respond to diverse situations by leveraging associative and analogical thinking. In light of this, we propose GREAT (GeometRy-intEntion collAboraTive inference) for Open-Vocabulary 3D Object Affordance Grounding, a novel framework that mines the object invariant geometry attributes and performs analogically reason in potential interaction scenarios to form affordance knowledge, fully combining the knowledge with both geometries and visual contents to ground 3D object affordance. Besides, we introduce the Point Image Affordance Dataset v2 (PIADv2), the largest 3D object affordance dataset at present to support the task. Extensive experiments demonstrate the effectiveness and superiority of GREAT.



Seen Partition

The training and test sets share the same objects and affordances.


Unseen Object Partition

Affordances are consistent between the training and test sets, but some objects in the test set do not appear in the training set.


Unseen Affordance Partition

Affordances in the test set are not present in the training set, and so does certain objects: "backpack", "suitcase".

Difference and Motivation

Difference and Motivation. (a) object affordance grounding on seen setting. (b) Open-Vocabulary Affordance Grounding (OVAG) with previous paradigms. (c) when observing interaction images, people engage in brainstorming through memory representations, drawing on prior interaction experiences to perform analogical reasoning and infer appropriate actions. (d) OVAG with our geometry-intention collaborative inference with chain-of-thought, step-by-step identifies the interaction part, extracts geometric attributes, reasons about corresponding interaction and brainstorms underlying interaction intentions, jointly grounding the 3D object affordance.

Method Overview

GREAT Pipeline. Initially, it extracts the respective features Fi, Fp through modality-specific backbones, then results of MHACoT inference are encoded and aggregated to form object/affordance knowledge features To, Ta. Next, GREAT utilizes CMAFM to inject knowledge into Fp and Fi is directly fused to obtain fusion features Ftp, Fti. Eventually, Ftp and Fti are sent to the decoder to ground 3D object affordance ϕ.

PIADv2 Dataset

PIADv2 Dataset. (a) Extensive data examples from PIADv2, the red region in point clouds is the affordance annotation. (b) Category distribution in PIADv2. (c) Confusion matrix between affordance and object categories, where the horizontal axis represents object category and the vertical axis represents affordance category. (d) The ratio of images and point clouds in each affordance category.

Experiment Results

Visualization Results Compared with SOTA Methods. The first row is the interaction image and the last row is the ground truth of 3D object affordance in point cloud. The left-middle-right partitions correspond to the visual comparison results for different 3D object affordance in the Seen, Unseen Object, and Unseen Affordance partitions, respectively. The depth of red represents the affordance probability.

Multiple Objects

Multiple Objects. GREAT could anticipate the affordance of distinct object categories with the same interaction image.


Multiple Affordances

Multiple Affordances. GREAT could anticipate the affordance of the same object with distinct interaction images.

BibTeX


            @article{GREAT_Shao,
              title={GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding},
              author={Shao, Yawen and Zhai, Wei and Yang, Yuhang and Luo, Hongchen and Cao, Yang and Zha, Zheng-Jun},
              journal={arXiv preprint arXiv:2411.19626},
              year={2024}
            }