$\color{orange}{\textbf{{[NeurIPS 2025]}}}$ Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

`Yifan Shen`¹, Yuanzhe Liu², Jingyuan Zhu², Xu Cao¹, Xiaofeng Zhang³, Yixiao He¹, Wenming Ye⁴, James Rehg¹, Ismini Lourentzou¹

Overview

Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a novel VLM designed to address these limitations. First, we propose Multi-LLM Guided Monte Carlo Tree Search (M3CTS) and Fine-Grained Spatial Rewards methods to construct a high-quality dataset. Second, we use fine-grained Direct Preference Optimization (fDPO) to train our model. fDPO introduces segment-specific preference granularity for descriptive grounding and logical reasoning, achieving an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% boost in spatial quantity tasks. To address the scarcity of multi-step spatial reasoning data, M3CTS enables collaborative exploration of diverse reasoning paths, significantly enriching spatial comprehension and logical coherence. Empirical evaluations demonstrate that SpatialReasoner-R1 sets a new state-of-the-art on SpatialRGPT-Bench, outperforming the strongest baseline by 9.4% in average accuracy, while maintaining competitive performance on general vision-language tasks.

Training

🛠️ Installation

Please install Python and PyTorch first:

conda create -n vlm python=3.10
conda activate vlm
conda install pytorch==2.3.1 torchvision==0.18.1 pytorch-cuda=12.1 cuda -c pytorch -c "nvidia/label/cuda-12.1.0" -c "nvidia/label/cuda-12.1.1"

Install mmcv. We use version 2.1.0 as default:

pip install mmcv==2.1.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.3/index.html

Install other dependencies:

pip install -r requirements.txt

Please make sure to use the correct versions of transformers and peft.

🧩 Model Preparation

You are expected to download the following pretrained models and place them in the ./pretrained directory:

You can download the remaining models from the InternVL2.5 Hugging Face collections.

And download the base model Sa2VA Model Zoo

🚀 Training Script

For training:

bash tools/dist.sh train projects/llava_sam2/configs/my_4b.py 2 --deepspeed deepspeed_zero2

bash tools/dist.sh train projects/llava_sam2/configs/my_4b_preference.py 2 --deepspeed deepspeed_zero2

References

⭐ If you find this work useful, please cite our paper

@article{shen2025fine,
  title={Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs},
  author={Shen, Yifan and Liu, Yuanzhe and Zhu, Jingyuan and Cao, Xu and Zhang, Xiaofeng and He, Yixiao and Ye, Wenming and Rehg, James Matthew and Lourentzou, Ismini},
  journal={arXiv preprint arXiv:2506.21656},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
projects		projects
tools		tools
vlm		vlm
xtuner		xtuner
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test_dpo.sh		test_dpo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

$\color{orange}{\textbf{{[NeurIPS 2025]}}}$ Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

`Yifan Shen`¹, Yuanzhe Liu², Jingyuan Zhu², Xu Cao¹, Xiaofeng Zhang³, Yixiao He¹, Wenming Ye⁴, James Rehg¹, Ismini Lourentzou¹

Overview

Training

References

About

Uh oh!

Releases

Packages

Languages

License

PLAN-Lab/SpatialReasonerR1

Folders and files

Latest commit

History

Repository files navigation

$\color{orange}{\textbf{{[NeurIPS 2025]}}}$ Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Yifan Shen1, Yuanzhe Liu2, Jingyuan Zhu2, Xu Cao1, Xiaofeng Zhang3, Yixiao He1, Wenming Ye4, James Rehg1, Ismini Lourentzou1

Overview

Training

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Yifan Shen`¹, Yuanzhe Liu², Jingyuan Zhu², Xu Cao¹, Xiaofeng Zhang³, Yixiao He¹, Wenming Ye⁴, James Rehg¹, Ismini Lourentzou¹

Packages