VIEW2SPACE Studying Multi-View Visual Reasoning from Sparse Observations

ECCV 2026 Accepted Paper

VIEW2SPACE teaser showing sparse drone and robot-dog views aligned for multi-view reasoning.
Sparse observations from heterogeneous viewpoints require matching objects, visibility, and spatial relations that are not fully visible in any single image.

A benchmark and scalable data engine for testing whether vision-language models can integrate sparse, incomplete viewpoints into a shared spatial understanding.

Fucai Ke, Zhixi Cai, Boying Li, Long Chen, Beibei Lin

Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Hamid Rezatofighi

Monash University, Australia · National University of Singapore

Overview

A controlled testbed for sparse multi-view reasoning.

Abstract

Multi-view visual reasoning is essential for intelligent systems that operate from sparse and discrete viewpoints. VIEW2SPACE uses physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable generation of grounded question-answer pairs for evaluation and training.

The benchmark shows that current vision-language and spatial models remain far from solving sparse multi-view reasoning. Grounded Chain-of-Thought with Visual Evidence substantially improves performance, while difficulty-aware scaling analyses reveal persistent limits in deep compositional reasoning.

3M

grounded QA pairs supported by the engine

2,000

diverse 3D scenes for large-scale generation

40+

scene themes with controllable layouts

3,591

questions in the public VIEW2SPACE-v1 evaluation split

01

Scalable Benchmark

Multi-view spatial reasoning with graded difficulty and strict visual-evidence constraints.

02

Model Analysis

Systematic evaluation reveals weak grounding and shortcut behavior in current models.

03

Grounded CoT

Step-wise visual evidence improves reasoning and synthetic-to-real transfer.

04

Scaling Limits

Compositional reasoning remains difficult even with more data and larger models.

Benchmark

Tasks shrink the answer space until visual grounding is unavoidable.

VIEW2SPACE-v1 spans multiple-choice reasoning, visual counting, and visual grounding detection, with difficulty controlled through view count, reasoning hops, and key-object visibility.

Examples of VIEW2SPACE counting, visual grounding, and multiple-choice task designs.
The benchmark moves from permissive answer formats to grounding-heavy tasks where hallucinated reasoning is least tolerated.

Data Engine

Physically grounded scenes with view-consistent metadata.

The engine curates realistic 3D assets, normalizes object scale, generates diverse themed scenes, renders sparse viewpoints, and extracts metadata including 2D boxes, 3D boxes, object locations, camera poses, occlusion rates, and segmentation.

RGBmulti-view observations
2D/3Dgeometric supervision
Disjointtrain and test scenes
VIEW2SPACE data engine pipeline from asset library and scene design to rendering, metadata extraction, and question generation.
A controllable generation pipeline makes it possible to vary viewpoint geometry, visibility, and reasoning depth while preserving precise supervision.
Overview-driven semantic differentiation and tagging pipeline for VIEW2SPACE assets.
Semantic tagging Overview-driven tags separate asset categories into grounded, queryable semantic differences.
VIEW2SPACE scene thematic categories including aerial, shopping, garden, car park, beach, factory, library, and more.
Everyday themes More than 40 scene families support broad spatial, semantic, and viewpoint variation.

Results

Sparse multi-view reasoning remains largely unsolved.

VIEW2SPACE-v1 first asks whether current models can reason across sparse views at all. The answer is clear: most general and spatial models remain close to random under strict visual grounding, while training on VIEW2SPACE substantially improves the gap.

Table 1

Comparison on VIEW2SPACE-v1

Current MLLMs and spatial models struggle on VIEW2SPACE, especially under strict visual grounding. Training helps substantially: Grounded CoT reaches 64.93 MCQ accuracy, 54.99 counting accuracy, and 69.34 grounding mIoU.

Method MCQ ACC Count MAE Count ACC Grounding mIoU Grounding F1
Baseline
Random (chance)28.5911.373.390.220.00
Random (frequency)28.592.4613.052.010.27
Open-source MLLMs
InternVL2.5-8B30.932.0422.500.790.00
Mantis-8B (SigLip)30.072.9417.601.240.23
DeepSeek-VL2-Small29.432.2024.372.860.19
Gemma-3-12B-it31.142.0824.532.490.24
Idefics2-8B26.643.9211.340.390.00
Qwen2.5-VL-7B30.292.7013.032.680.32
Qwen3-VL-2B31.432.6316.3411.205.07
Qwen3-VL-4B35.192.2621.1516.5318.72
Qwen3-VL-8B37.412.1024.9210.057.37
Open-source Spatial Models
Molmo2-4B31.362.2122.171.170.13
Molmo2-8B34.575.9321.661.510.00
RoboBrain2.030.792.2221.490.000.00
Spatial-MLLM27.932.109.980.000.00
OpenView30.722.5719.801.080.30
MINDCUBE-Aug29.004.5218.100.000.00
MINDCUBE-Plain30.214.7719.120.120.00
Closed-source MLLMs
GPT-4o33.931.5129.033.600.56
GPT-5-nano33.361.7324.282.890.87
GPT-5-mini49.001.2038.717.422.77
GPT-559.861.3138.108.183.43
GPT-5.237.071.2836.508.742.84
Fine-tuned Models (Qwen3VL-4B)
Ours (Instruct-tuning)36.500.7150.0847.5850.07
Ours (Grounded CoT, Direct Answer at Test)33.9510.7529.1016.3511.50
Ours (CoT)59.570.6851.4349.7752.34
Ours (Grounded CoT)64.930.5854.9969.3470.92
+52.81

absolute mIoU gain over instruct tuning

69.34

grounding mIoU with Grounded CoT

70.92

grounding F1 with Grounded CoT

Table 2

Sim-to-real transfer on MINDCUBE-Tiny

The trained model is not only fitting VIEW2SPACE. Without any MINDCUBE fine-tuning, VIEW2SPACE-trained Grounded-CoT transfers to real-world MINDCUBE-Tiny and surpasses the official MINDCUBE checkpoints. The website highlights the key comparison; please see Table 2 in the paper for the full benchmark table.

Method Tuning on MINDCUBE Overall Rotation Among Around
Baseline
Random (chance)-32.3536.3632.2930.66
Random (frequency)-33.0238.3032.6635.79
Open-source MLLMs and Spatial Models
InternVL2.5-8Bw/o Tuning18.6836.4547.1126.91
Molmo2-4Bw/o Tuning37.3336.0038.8335.75
Molmo2-8Bw/o Tuning34.3327.5036.5034.50
Mantis-8B (SigLip)w/o Tuning41.0537.6540.2350.99
DeepSeek-VL2-Smallw/o Tuning47.6237.0050.3826.91
Gemma-3-12B-itw/o Tuning46.6738.3948.3834.63
Idefics2-8Bw/o Tuning35.8635.1535.9435.49
Qwen2.5-VL-7Bw/o Tuning29.2638.7629.5021.35
Qwen3-VL-2Bw/o Tuning32.5826.5036.3330.00
Qwen3-VL-4Bw/o Tuning36.6741.5034.6737.25
Qwen3-VL-8Bw/o Tuning37.4246.0035.8335.50
RoboBrainw/o Tuning37.3835.8038.2829.53
Spatial-MLLMw/o Tuning32.0638.3920.9232.82
OpenVieww/o Tuning40.0824.5045.5039.75
Closed-source MLLMs
Claude-4-Sonnetw/o Tuning44.7548.4244.2147.62
GPT-4ow/o Tuning38.8132.6540.1729.16
GPT-5-nanow/o Tuning38.5844.0038.6735.75
GPT-5-miniw/o Tuning46.6489.0044.6751.75
MINDCUBE Checkpoints
MINDCUBE-Augw Tuning55.2449.5052.5066.40
MINDCUBE-Plainw Tuning60.7647.5062.3367.60
VIEW2SPACE-Trained Models
Ours 2B (Grounded-CoT)w/o Tuning59.9249.0056.5070.50
Ours 4B (Grounded-CoT)w/o Tuning70.0071.5066.3374.75
70.00

overall MINDCUBE-Tiny accuracy without MINDCUBE tuning

+9.24

absolute gain over MINDCUBE-Plain checkpoint

Scaling Analysis

Scaling helps perception, but harder reasoning and visibility constraints still break models.

Increasing data and model size improves mIoU, especially for lower-difficulty cases. However, the curves fall as reasoning difficulty and key-object visibility difficulty increase, showing that deep compositional multi-view reasoning remains structurally hard.

Scaling curves showing 2B and 4B model performance against reasoning difficulty and key-object visibility difficulty.
Difficulty-aware scaling: larger models and more training data improve performance, but performance declines as reasoning depth and visibility difficulty rise.

Resources

Public release entry points.

Use the evaluation split, training data, and released checkpoint to reproduce or extend the benchmark.

Citation

Cite VIEW2SPACE.

@article{ke2026view2space,
  title={VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations},
  author={Ke, Fucai and Cai, Zhixi and Li, Boying and Chen, Long and Lin, Beibei and Wang, Weiqing and Haghighi, Pari Delir and Haffari, Gholamreza and Rezatofighi, Hamid},
  journal={arXiv preprint arXiv:2603.16506},
  year={2026}
}