Skip to content

PLAN-Lab/SpatialReasonerR1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

$\color{orange}{\textbf{{[NeurIPS 2025]}}}$ Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Yifan Shen1, Yuanzhe Liu2, Jingyuan Zhu2, Xu Cao1, Xiaofeng Zhang3, Yixiao He1, Wenming Ye4, James Rehg1, Ismini Lourentzou1

Teaser


Overview

Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a novel VLM designed to address these limitations. First, we propose Multi-LLM Guided Monte Carlo Tree Search (M3CTS) and Fine-Grained Spatial Rewards methods to construct a high-quality dataset. Second, we use fine-grained Direct Preference Optimization (fDPO) to train our model. fDPO introduces segment-specific preference granularity for descriptive grounding and logical reasoning, achieving an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% boost in spatial quantity tasks. To address the scarcity of multi-step spatial reasoning data, M3CTS enables collaborative exploration of diverse reasoning paths, significantly enriching spatial comprehension and logical coherence. Empirical evaluations demonstrate that SpatialReasoner-R1 sets a new state-of-the-art on SpatialRGPT-Bench, outperforming the strongest baseline by 9.4% in average accuracy, while maintaining competitive performance on general vision-language tasks.


Training

🛠️ Installation
  1. Please install Python and PyTorch first:
conda create -n vlm python=3.10
conda activate vlm
conda install pytorch==2.3.1 torchvision==0.18.1 pytorch-cuda=12.1 cuda -c pytorch -c "nvidia/label/cuda-12.1.0" -c "nvidia/label/cuda-12.1.1"
  1. Install mmcv. We use version 2.1.0 as default:
pip install mmcv==2.1.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.3/index.html
  1. Install other dependencies:
pip install -r requirements.txt

Please make sure to use the correct versions of transformers and peft.

🧩 Model Preparation

You are expected to download the following pretrained models and place them in the ./pretrained directory:

You can download the remaining models from the InternVL2.5 Hugging Face collections.

And download the base model Sa2VA Model Zoo

🚀 Training Script

For training:

bash tools/dist.sh train projects/llava_sam2/configs/my_4b.py 2 --deepspeed deepspeed_zero2

bash tools/dist.sh train projects/llava_sam2/configs/my_4b_preference.py 2 --deepspeed deepspeed_zero2

References

⭐ If you find this work useful, please cite our paper

@article{shen2025fine,
  title={Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs},
  author={Shen, Yifan and Liu, Yuanzhe and Zhu, Jingyuan and Cao, Xu and Zhang, Xiaofeng and He, Yixiao and Ye, Wenming and Rehg, James Matthew and Lourentzou, Ismini},
  journal={arXiv preprint arXiv:2506.21656},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages