CubeBench | Project Page

CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations

Huan-ang Gao^*1, Zikang Zhang^*1, Tianwei Luo¹, Kaisen Yang¹, Xinzhe Juan³, Jiahao Qiu²,
Tianxing Chen⁴, Bingxiang He¹, Hao Zhao¹, Hao Zhou¹, Shilong Liu^†2, Mengdi Wang^†2

¹ Tsinghua University ² Princeton University ³ SJTU & UMich ⁴ HKU

^*Indicates Equal Contribution ^†Indicates Corresponding Author

Paper Code arXiv

Overview of CubeBench Performance. An overview of the performance of leading LLMs on the CubeBench benchmark, broken down by its three diagnostic tiers. Tier 1 (Full Symbolic State) tests foundational state tracking using complete symbolic information, where the best average pass rate is only 37.5%. Tier 2 (Full Visual State) challenges visual and spatial reasoning by requiring agents to interpret a 2D unfolded map, and Tier 3 (Partial Visual State) evaluates active exploration from partial views. Across all tiers, GPT-5 emerges as the top-performing model, though the results highlight a significant performance gap between symbolic and visual reasoning tasks.

Three Core Cognitive Challenges

Visualization of the three core cognitive challenges required for spatial reasoning. We identify three critical challenges: (1) Spatial Reasoning - understanding 3D geometry and action consequences, (2) Long-Horizon State Tracking - maintaining and updating the world model over long sequences, (3) Exploration under Partial Observation - constructing a complete mental model from limited views.

The CubeBench Framework

Three-Tiered Diagnostic Framework

Illustration of the three-tiered task structure. Tier 1 (Full Symbolic State) provides complete state information as a string, making it a fully observable MDP. Tier 2 (Full Visual State) presents the full state as a 2D unfolded map, challenging visual thinking. Tier 3 (Partial Visual State) provides only a partial view (Face view or Vertex view), requiring active exploration.

Interaction Protocol

Agent interaction follows the ReAct paradigm. Each step consists of a Thought-Code-Observation cycle, with a maximum of 20 steps and 30-minute timeout per run.

Diagnostic Evaluation Framework

Three-part diagnostic framework for systematically evaluating LLM agents. To answer Q1, we test a basic agent with only fundamental interaction tools. For Q2, we augment the agent with various dense reward signals. Finally, for Q3, we deploy agents with different levels of tool support to diagnose whether failures originate from high-level planning, state reconstruction, or procedural data transformation.

Key Results

Experiment 1: Basic Agent Performance

We evaluated leading LLMs across all four observation modalities on both short- and long-horizon tasks. The results reveal several critical limitations:

Universal failure on long-horizon tasks: All models exhibit a 0.00% pass rate on long-horizon tasks across all input modalities, exposing a fundamental deficit in long-horizon state tracking.
Sharp decline from symbolic to visual inputs: Non-zero pass rates are achieved almost exclusively with symbolic string input; performance on visual inputs is near or at zero for most models.
GPT-5 leads, but struggles: GPT-5 achieves a 0.75 pass rate on symbolic short-horizon tasks, significantly exceeding all other models, yet still fails on all long-horizon challenges.

Baseline Performance: Pass Rates and Number of make_move Calls Across Modalities
Model	Full Symbolic		Full Visual		Face View		Vertex View
Model	S	L	S	L	S	L	S	L
GPT-5	0.75	0.00	0.20	0.00	0.40	0.00	0.05	0.00
MLP (Policy Gradient)	0.75	0.00	--	--	--	--	--	--
gpt-oss-120b	0.20	0.00	--	--	--	--	--	--
Grok-4	0.20	0.00	0.05	0.00	0.00	0.00	0.00	0.00
Kimi K2 (2024-09-05)	0.15	0.00	--	--	--	--	--	--
Gemini 2.5 Pro	0.10	0.00	0.05	0.00	0.05	0.00	0.00	0.00
DeepSeek-R1 (2025-05-28)	0.05	0.00	--	--	--	--	--	--
Claude Sonnet 4	0.05	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Qwen3-Max	0.05	0.00	--	--	--	--	--	--
DeepSeek-V3.1	0.05	0.00	--	--	--	--	--	--
doubao-seed-1-6-vision	0.05	0.00	0.00	0.00	0.00	0.00	0.00	0.00
InternVL-3 (78B)	0.00	0.00	0.00	0.00	0.05	0.00	0.00	0.00
Qwen2.5-VL-72B-Instruct	0.00	0.00	0.00	0.00	0.05	0.00	0.00	0.00
kimi-vl-a3b-thinking	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
GPT-4o	0.00	0.00	0.00	0.00	0.10	0.00	0.00	0.00
GLM-4.5V	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Gemma-3-27B-IT	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Seed-OSS-36B-Instruct	0.00	0.00	--	--	--	--	--	--

Pink = Proprietary models Blue = Open-source models Purple = Traditional RL baseline S = Short-horizon (depth 1-4), L = Long-horizon (depth 8-20). "--" = Model does not support visual inputs.
Critical Finding: All models show 0.00 pass rate on all long-horizon tasks.

Experiment 2: Impact of Dense Rewards

We tested whether dense reward signals (face, sticker, and heuristic) can guide agent search:

Short-horizon improvement: Dense rewards generally increase pass rates on short-horizon tasks, acting as local guides.
Long-horizon failure persists: Pass rates on all long-horizon tasks remain at 0.00%, regardless of reward type, indicating that local feedback cannot compensate for fundamental planning deficits.
Model-dependent effectiveness: More capable agents like GPT-5 sometimes perform worse with external rewards, suggesting conflict with internal strategies.

Effect of Dense Rewards on Pass Rates (Short-Horizon Tasks Only)
Model	Reward Type	Full Symbolic	Full Visual	Face View	Vertex View
GPT-5	no reward	0.75	0.20	0.40	0.05
	face	0.85	0.55	0.50	0.40
	sticker	0.65	0.55	0.55	0.50
	heuristic	0.50	0.45	0.65	0.30
Gemini 2.5 Pro	no reward	0.10	0.05	0.05	0.00
	face	0.00	0.00	0.00	0.00
	sticker	0.10	0.00	0.05	0.00
	heuristic	0.05	0.00	0.10	0.00
Claude Sonnet 4	no reward	0.05	0.00	0.00	0.00
	face	0.10	0.10	0.05	0.00
	sticker	0.25	0.15	0.00	0.05
	heuristic	0.20	0.05	0.05	0.10

Note: All long-horizon tasks remain at 0.00 pass rate regardless of reward type.

Experiment 3: Diagnostic with Solver Tools

By equipping agents with optimal solvers, we isolated specific cognitive bottlenecks:

Planning as primary bottleneck: Standard-Solver agents show marked improvement, confirming that long-horizon planning can be successfully offloaded to external tools.
Spatial reasoning matters: The performance gap between Standard-Solver and Ideal-Solver agents reveals that spatial transformation for tool use is non-trivial.
Partial observation is fundamental: Universal failure on Vertex view tasks, even with ideal solvers, isolates exploration under partial observation as the ultimate bottleneck.
Emergent tool learning: Agents exhibit remarkable autonomous tool-learning through trial-and-error, suggesting that discovery-oriented environments may be more effective than explicit instruction.

Comparison of Agent Configurations: Basic vs Standard-Solver vs Ideal-Solver
Model	Agent Type	Full Symbolic		Full Visual		Face View		Vertex View
Model	Agent Type	S	L	S	L	S	L	S	L
GPT-5	Basic	0.75	0.00	0.20	0.00	0.40	0.00	0.05	0.00
	Standard-Solver	0.95	0.95	0.65	0.70	1.00	0.95	0.00	0.00
	Ideal-Solver	1.00	1.00	0.95	0.80	0.85	1.00	0.00	0.00
Gemini 2.5 Pro	Basic	0.10	0.00	0.05	0.00	0.05	0.00	0.00	0.00
	Standard-Solver	0.70	0.65	0.25	0.00	0.20	0.00	0.00	0.00
	Ideal-Solver	1.00	1.00	0.25	0.00	0.00	0.00	0.00	0.00
Claude Sonnet 4	Basic	0.05	0.00	0.00	0.00	0.00	0.00	0.00	0.00
	Standard-Solver	0.35	0.85	0.00	0.00	0.00	0.00	0.00	0.00
	Ideal-Solver	1.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00

S = Short-horizon tasks (depth 1-4), L = Long-horizon tasks (depth 8-20). Light Blue = Standard-Solver, Darker Blue = Ideal-Solver. Note the universal failure on Vertex View even with Ideal-Solver.