$\color{orange}{\textbf{{[NeurIPS 2025]}}}$ Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
Yifan Shen1, Yuanzhe Liu2, Jingyuan Zhu2, Xu Cao1, Xiaofeng Zhang3, Yixiao He1, Wenming Ye4, James Rehg1, Ismini Lourentzou1
Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a novel VLM designed to address these limitations. First, we propose Multi-LLM Guided Monte Carlo Tree Search (M3CTS) and Fine-Grained Spatial Rewards methods to construct a high-quality dataset. Second, we use fine-grained Direct Preference Optimization (fDPO) to train our model. fDPO introduces segment-specific preference granularity for descriptive grounding and logical reasoning, achieving an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% boost in spatial quantity tasks. To address the scarcity of multi-step spatial reasoning data, M3CTS enables collaborative exploration of diverse reasoning paths, significantly enriching spatial comprehension and logical coherence. Empirical evaluations demonstrate that SpatialReasoner-R1 sets a new state-of-the-art on SpatialRGPT-Bench, outperforming the strongest baseline by 9.4% in average accuracy, while maintaining competitive performance on general vision-language tasks.
🛠️ Installation
- Please install Python and PyTorch first:
conda create -n vlm python=3.10
conda activate vlm
conda install pytorch==2.3.1 torchvision==0.18.1 pytorch-cuda=12.1 cuda -c pytorch -c "nvidia/label/cuda-12.1.0" -c "nvidia/label/cuda-12.1.1"- Install mmcv. We use version 2.1.0 as default:
pip install mmcv==2.1.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.3/index.html- Install other dependencies:
pip install -r requirements.txtPlease make sure to use the correct versions of transformers and peft.
🧩 Model Preparation
You are expected to download the following pretrained models and place them in the ./pretrained directory:
You can download the remaining models from the InternVL2.5 Hugging Face collections.
And download the base model Sa2VA Model Zoo
🚀 Training Script
For training:
bash tools/dist.sh train projects/llava_sam2/configs/my_4b.py 2 --deepspeed deepspeed_zero2
bash tools/dist.sh train projects/llava_sam2/configs/my_4b_preference.py 2 --deepspeed deepspeed_zero2⭐ If you find this work useful, please cite our paper
@article{shen2025fine,
title={Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs},
author={Shen, Yifan and Liu, Yuanzhe and Zhu, Jingyuan and Cao, Xu and Zhang, Xiaofeng and He, Yixiao and Ye, Wenming and Rehg, James Matthew and Lourentzou, Ismini},
journal={arXiv preprint arXiv:2506.21656},
year={2025}
}


