ViSPLA: Visual Iterative Self-Prompting for Language-Guided 3D Affordance Learning

Abstract

We address the problem of language-guided 3D affordance prediction, a core capability for embodied agents interacting with unstructured environments. Existing methods often rely on fixed affordance categories or require external expert prompts, limiting their ability to generalize across different objects and interpret multi-step instructions.

In this work, we introduce ViSPLA, a novel iterative self-prompting framework that leverages the intrinsic geometry of predicted masks for continual refinement. We redefine affordance detection as a language-conditioned segmentation task: given a 3D point cloud and language instruction, our model predicts a sequence of refined affordance masks, each guided by differential geometric feedback including Laplacians, normal derivatives, and curvature fields. This feedback is encoded into visual prompts that drive a multi-stage refinement decoder, enabling the model to self-correct and adapt to complex spatial structures.

To further enhance precision and coherence, we introduce Implicit Neural Affordance Fields (INAFS), which define continuous probabilistic regions over the 3D surface without additional supervision. Additionally, our Spectral Convolutional Self-Prompting (SCSP) module operates in the frequency domain of the point cloud, enabling multi-scale refinement that captures both coarse and fine affordance structures. Extensive experiments demonstrate that ViSPLA achieves state-of-the-art results on both seen and unseen objects on two benchmark datasets.

Methodology

The ViSPLA framework consists of three core components that work in a closed-loop refinement cycle:

IDGSP

Iterative Differential Geometry-based Self-Prompting extracts geometric descriptors (Laplacian, Mean Curvature, Normal Derivatives) from the current mask prediction. These act as "visual prompts" for the next iteration, correcting geometric inconsistencies.

INAFS

Implicit Neural Affordance Fields learn a continuous probabilistic occupancy function over the 3D surface. This self-supervised module enforces smoothness and topological consistency, filling in gaps left by discrete point sampling.

SCSP

Spectral Convolutional Self-Prompting operates in the frequency domain of the point cloud graph. It separates coarse shapes from fine details, ensuring multi-scale refinement for complex affordance structures.

Experimental Results

We evaluated ViSPLA on the LASO and PIAD datasets. Our method achieves State-of-the-Art performance, significantly outperforming baselines like GEAL and 3D-AffordanceLLM, particularly in zero-shot (unseen) settings.

Method	Dataset	aIoU (%)	AUC (%)	SIM
GEAL (CVPR'24)	PIAD (Seen)	22.5	85.0	0.601
ViSPLA (Ours)	PIAD (Seen)	23.1	85.8	0.664
GEAL (CVPR'24)	LASO (Unseen)	16.7	80.9	0.567
ViSPLA (Ours)	LASO (Unseen)	17.1	81.5	0.571

Qualitative Results

Click the buttons below to toggle between different results

Baseline Comparison
Iterative Process
Ablation Study

Comparison with State-of-the-Art: ViSPLA provides sharper boundary adherence compared to GEAL.

BibTeX

@inproceedings{basak2025vispla,
  title={ViSPLA: Visual Iterative Self-Prompting for Language-Guided 3D Affordance Learning},
  author={Basak, Hritam and Yin, Zhaozheng},
  booktitle={Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS)},
  year={2025}
}