VIEW2SPACE Studying Multi-View Visual Reasoning from Sparse Observations

ECCV 2026 Accepted Paper

VIEW2SPACE teaser showing sparse drone and robot-dog views aligned for multi-view reasoning. — Sparse observations from heterogeneous viewpoints require matching objects, visibility, and spatial relations that are not fully visible in any single image.

A benchmark and scalable data engine for testing whether vision-language models can integrate sparse, incomplete viewpoints into a shared spatial understanding.

Fucai Ke, Zhixi Cai, Boying Li, Long Chen, Beibei Lin

Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Hamid Rezatofighi

Monash University, Australia · National University of Singapore

arXiv Code Hugging Face

Overview

A controlled testbed for sparse multi-view reasoning.

Abstract

Multi-view visual reasoning is essential for intelligent systems that operate from sparse and discrete viewpoints. VIEW2SPACE uses physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable generation of grounded question-answer pairs for evaluation and training.

The benchmark shows that current vision-language and spatial models remain far from solving sparse multi-view reasoning. Grounded Chain-of-Thought with Visual Evidence substantially improves performance, while difficulty-aware scaling analyses reveal persistent limits in deep compositional reasoning.

grounded QA pairs supported by the engine

2,000

diverse 3D scenes for large-scale generation

40+

scene themes with controllable layouts

3,591

questions in the public VIEW2SPACE-v1 evaluation split

Scalable Benchmark

Multi-view spatial reasoning with graded difficulty and strict visual-evidence constraints.

Model Analysis

Systematic evaluation reveals weak grounding and shortcut behavior in current models.

Grounded CoT

Step-wise visual evidence improves reasoning and synthetic-to-real transfer.

Scaling Limits

Compositional reasoning remains difficult even with more data and larger models.

Benchmark

Tasks shrink the answer space until visual grounding is unavoidable.

VIEW2SPACE-v1 spans multiple-choice reasoning, visual counting, and visual grounding detection, with difficulty controlled through view count, reasoning hops, and key-object visibility.

Examples of VIEW2SPACE counting, visual grounding, and multiple-choice task designs. — The benchmark moves from permissive answer formats to grounding-heavy tasks where hallucinated reasoning is least tolerated.

Data Engine

Physically grounded scenes with view-consistent metadata.

The engine curates realistic 3D assets, normalizes object scale, generates diverse themed scenes, renders sparse viewpoints, and extracts metadata including 2D boxes, 3D boxes, object locations, camera poses, occlusion rates, and segmentation.

RGBmulti-view observations

2D/3Dgeometric supervision

Disjointtrain and test scenes

VIEW2SPACE data engine pipeline from asset library and scene design to rendering, metadata extraction, and question generation. — A controllable generation pipeline makes it possible to vary viewpoint geometry, visibility, and reasoning depth while preserving precise supervision.

Rendered RGB, mesh, depth, 2D boxes, 3D boxes, and segmentation metadata samples. — **Metadata samples** RGB, mesh, depth, 2D/3D boxes, occlusion-aware annotations, and segmentation supervision.

Overview-driven semantic differentiation and tagging pipeline for VIEW2SPACE assets. — **Semantic tagging** Overview-driven tags separate asset categories into grounded, queryable semantic differences.

VIEW2SPACE scene thematic categories including aerial, shopping, garden, car park, beach, factory, library, and more. — **Everyday themes** More than 40 scene families support broad spatial, semantic, and viewpoint variation.

Results

Sparse multi-view reasoning remains largely unsolved.

VIEW2SPACE-v1 first asks whether current models can reason across sparse views at all. The answer is clear: most general and spatial models remain close to random under strict visual grounding, while training on VIEW2SPACE substantially improves the gap.

Table 1

Comparison on VIEW2SPACE-v1

Current MLLMs and spatial models struggle on VIEW2SPACE, especially under strict visual grounding. Training helps substantially: Grounded CoT reaches 64.93 MCQ accuracy, 54.99 counting accuracy, and 69.34 grounding mIoU.

Method	MCQ ACC	Count MAE	Count ACC	Grounding mIoU	Grounding F1
Baseline
Random (chance)	28.59	11.37	3.39	0.22	0.00
Random (frequency)	28.59	2.46	13.05	2.01	0.27
Open-source MLLMs
InternVL2.5-8B	30.93	2.04	22.50	0.79	0.00
Mantis-8B (SigLip)	30.07	2.94	17.60	1.24	0.23
DeepSeek-VL2-Small	29.43	2.20	24.37	2.86	0.19
Gemma-3-12B-it	31.14	2.08	24.53	2.49	0.24
Idefics2-8B	26.64	3.92	11.34	0.39	0.00
Qwen2.5-VL-7B	30.29	2.70	13.03	2.68	0.32
Qwen3-VL-2B	31.43	2.63	16.34	11.20	5.07
Qwen3-VL-4B	35.19	2.26	21.15	16.53	18.72
Qwen3-VL-8B	37.41	2.10	24.92	10.05	7.37
Open-source Spatial Models
Molmo2-4B	31.36	2.21	22.17	1.17	0.13
Molmo2-8B	34.57	5.93	21.66	1.51	0.00
RoboBrain2.0	30.79	2.22	21.49	0.00	0.00
Spatial-MLLM	27.93	2.10	9.98	0.00	0.00
OpenView	30.72	2.57	19.80	1.08	0.30
MINDCUBE-Aug	29.00	4.52	18.10	0.00	0.00
MINDCUBE-Plain	30.21	4.77	19.12	0.12	0.00
Closed-source MLLMs
GPT-4o	33.93	1.51	29.03	3.60	0.56
GPT-5-nano	33.36	1.73	24.28	2.89	0.87
GPT-5-mini	49.00	1.20	38.71	7.42	2.77
GPT-5	59.86	1.31	38.10	8.18	3.43
GPT-5.2	37.07	1.28	36.50	8.74	2.84
Fine-tuned Models (Qwen3VL-4B)
Ours (Instruct-tuning)	36.50	0.71	50.08	47.58	50.07
Ours (Grounded CoT, Direct Answer at Test)	33.95	10.75	29.10	16.35	11.50
Ours (CoT)	59.57	0.68	51.43	49.77	52.34
Ours (Grounded CoT)	64.93	0.58	54.99	69.34	70.92

+52.81

absolute mIoU gain over instruct tuning

69.34

grounding mIoU with Grounded CoT

70.92

grounding F1 with Grounded CoT

Table 2

Sim-to-real transfer on MINDCUBE-Tiny

The trained model is not only fitting VIEW2SPACE. Without any MINDCUBE fine-tuning, VIEW2SPACE-trained Grounded-CoT transfers to real-world MINDCUBE-Tiny and surpasses the official MINDCUBE checkpoints. The website highlights the key comparison; please see Table 2 in the paper for the full benchmark table.

Method	Tuning on MINDCUBE	Overall	Rotation	Among	Around
Baseline
Random (chance)	-	32.35	36.36	32.29	30.66
Random (frequency)	-	33.02	38.30	32.66	35.79
Open-source MLLMs and Spatial Models
InternVL2.5-8B	w/o Tuning	18.68	36.45	47.11	26.91
Molmo2-4B	w/o Tuning	37.33	36.00	38.83	35.75
Molmo2-8B	w/o Tuning	34.33	27.50	36.50	34.50
Mantis-8B (SigLip)	w/o Tuning	41.05	37.65	40.23	50.99
DeepSeek-VL2-Small	w/o Tuning	47.62	37.00	50.38	26.91
Gemma-3-12B-it	w/o Tuning	46.67	38.39	48.38	34.63
Idefics2-8B	w/o Tuning	35.86	35.15	35.94	35.49
Qwen2.5-VL-7B	w/o Tuning	29.26	38.76	29.50	21.35
Qwen3-VL-2B	w/o Tuning	32.58	26.50	36.33	30.00
Qwen3-VL-4B	w/o Tuning	36.67	41.50	34.67	37.25
Qwen3-VL-8B	w/o Tuning	37.42	46.00	35.83	35.50
RoboBrain	w/o Tuning	37.38	35.80	38.28	29.53
Spatial-MLLM	w/o Tuning	32.06	38.39	20.92	32.82
OpenView	w/o Tuning	40.08	24.50	45.50	39.75
Closed-source MLLMs
Claude-4-Sonnet	w/o Tuning	44.75	48.42	44.21	47.62
GPT-4o	w/o Tuning	38.81	32.65	40.17	29.16
GPT-5-nano	w/o Tuning	38.58	44.00	38.67	35.75
GPT-5-mini	w/o Tuning	46.64	89.00	44.67	51.75
MINDCUBE Checkpoints
MINDCUBE-Aug	w Tuning	55.24	49.50	52.50	66.40
MINDCUBE-Plain	w Tuning	60.76	47.50	62.33	67.60
VIEW2SPACE-Trained Models
Ours 2B (Grounded-CoT)	w/o Tuning	59.92	49.00	56.50	70.50
Ours 4B (Grounded-CoT)	w/o Tuning	70.00	71.50	66.33	74.75

70.00

overall MINDCUBE-Tiny accuracy without MINDCUBE tuning

+9.24

absolute gain over MINDCUBE-Plain checkpoint

Scaling Analysis

Scaling helps perception, but harder reasoning and visibility constraints still break models.

Increasing data and model size improves mIoU, especially for lower-difficulty cases. However, the curves fall as reasoning difficulty and key-object visibility difficulty increase, showing that deep compositional multi-view reasoning remains structurally hard.

Scaling curves showing 2B and 4B model performance against reasoning difficulty and key-object visibility difficulty. — Difficulty-aware scaling: larger models and more training data improve performance, but performance declines as reasoning depth and visibility difficulty rise.

Resources

Public release entry points.

Use the evaluation split, training data, and released checkpoint to reproduce or extend the benchmark.

arXiv arXiv:2603.16506

Read the current VIEW2SPACE preprint and appendix.

Code GitHub repository

Evaluation scripts, prompts, and training utilities.

Dataset view2space-v1

Public evaluation split for count, detection, and MCQ tasks.

Training view2space-train

Released training data for adapting sparse multi-view reasoning models.

Checkpoint view2space_GCoT_4b

Grounded-CoT checkpoint for public evaluation.

Collection VIEW2SPACE on HF

All public VIEW2SPACE resources in one collection.

Citation

Cite VIEW2SPACE.

@article{ke2026view2space,
  title={VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations},
  author={Ke, Fucai and Cai, Zhixi and Li, Boying and Chen, Long and Lin, Beibei and Wang, Weiqing and Haghighi, Pari Delir and Haffari, Gholamreza and Rezatofighi, Hamid},
  journal={arXiv preprint arXiv:2603.16506},
  year={2026}
}