CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations

Huan-ang Gao*1, Zikang Zhang*1, Tianwei Luo1, Kaisen Yang1, Xinzhe Juan3, Jiahao Qiu2,
Tianxing Chen4, Bingxiang He1, Hao Zhao1, Hao Zhou1, Shilong Liu†2, Mengdi Wang†2
1 Tsinghua University     2 Princeton University     3 SJTU & UMich     4 HKU

*Indicates Equal Contribution
Indicates Corresponding Author

Overview of CubeBench Performance. An overview of the performance of leading LLMs on the CubeBench benchmark, broken down by its three diagnostic tiers. Tier 1 (Full Symbolic State) tests foundational state tracking using complete symbolic information, where the best average pass rate is only 37.5%. Tier 2 (Full Visual State) challenges visual and spatial reasoning by requiring agents to interpret a 2D unfolded map, and Tier 3 (Partial Visual State) evaluates active exploration from partial views. Across all tiers, GPT-5 emerges as the top-performing model, though the results highlight a significant performance gap between symbolic and visual reasoning tasks.

Abstract

Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik's Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically-grounded intelligent agents.

Three Core Cognitive Challenges

Visualization of the three core cognitive challenges required for spatial reasoning. We identify three critical challenges: (1) Spatial Reasoning - understanding 3D geometry and action consequences, (2) Long-Horizon State Tracking - maintaining and updating the world model over long sequences, (3) Exploration under Partial Observation - constructing a complete mental model from limited views.

The CubeBench Framework

Three-Tiered Diagnostic Framework

Illustration of the three-tiered task structure. Tier 1 (Full Symbolic State) provides complete state information as a string, making it a fully observable MDP. Tier 2 (Full Visual State) presents the full state as a 2D unfolded map, challenging visual thinking. Tier 3 (Partial Visual State) provides only a partial view (Face view or Vertex view), requiring active exploration.


Interaction Protocol

Agent interaction follows the ReAct paradigm. Each step consists of a Thought-Code-Observation cycle, with a maximum of 20 steps and 30-minute timeout per run.


Diagnostic Evaluation Framework

Three-part diagnostic framework for systematically evaluating LLM agents. To answer Q1, we test a basic agent with only fundamental interaction tools. For Q2, we augment the agent with various dense reward signals. Finally, for Q3, we deploy agents with different levels of tool support to diagnose whether failures originate from high-level planning, state reconstruction, or procedural data transformation.

Key Results

Experiment 1: Basic Agent Performance

We evaluated leading LLMs across all four observation modalities on both short- and long-horizon tasks. The results reveal several critical limitations:

  • Universal failure on long-horizon tasks: All models exhibit a 0.00% pass rate on long-horizon tasks across all input modalities, exposing a fundamental deficit in long-horizon state tracking.
  • Sharp decline from symbolic to visual inputs: Non-zero pass rates are achieved almost exclusively with symbolic string input; performance on visual inputs is near or at zero for most models.
  • GPT-5 leads, but struggles: GPT-5 achieves a 0.75 pass rate on symbolic short-horizon tasks, significantly exceeding all other models, yet still fails on all long-horizon challenges.
Baseline Performance: Pass Rates and Number of make_move Calls Across Modalities
Model Full Symbolic Full Visual Face View Vertex View
S L S L S L S L
GPT-5 0.75 0.00 0.20 0.00 0.40 0.00 0.05 0.00
MLP (Policy Gradient) 0.75 0.00 -- -- -- -- -- --
gpt-oss-120b 0.20 0.00 -- -- -- -- -- --
Grok-4 0.20 0.00 0.05 0.00 0.00 0.00 0.00 0.00
Kimi K2 (2024-09-05) 0.15 0.00 -- -- -- -- -- --
Gemini 2.5 Pro 0.10 0.00 0.05 0.00 0.05 0.00 0.00 0.00
DeepSeek-R1 (2025-05-28) 0.05 0.00 -- -- -- -- -- --
Claude Sonnet 4 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Qwen3-Max 0.05 0.00 -- -- -- -- -- --
DeepSeek-V3.1 0.05 0.00 -- -- -- -- -- --
doubao-seed-1-6-vision 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00
InternVL-3 (78B) 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00
Qwen2.5-VL-72B-Instruct 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00
kimi-vl-a3b-thinking 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
GPT-4o 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00
GLM-4.5V 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Gemma-3-27B-IT 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Seed-OSS-36B-Instruct 0.00 0.00 -- -- -- -- -- --

Pink = Proprietary models   Blue = Open-source models   Purple = Traditional RL baseline   S = Short-horizon (depth 1-4), L = Long-horizon (depth 8-20). "--" = Model does not support visual inputs.
Critical Finding: All models show 0.00 pass rate on all long-horizon tasks.

Experiment 2: Impact of Dense Rewards

We tested whether dense reward signals (face, sticker, and heuristic) can guide agent search:

  • Short-horizon improvement: Dense rewards generally increase pass rates on short-horizon tasks, acting as local guides.
  • Long-horizon failure persists: Pass rates on all long-horizon tasks remain at 0.00%, regardless of reward type, indicating that local feedback cannot compensate for fundamental planning deficits.
  • Model-dependent effectiveness: More capable agents like GPT-5 sometimes perform worse with external rewards, suggesting conflict with internal strategies.
Effect of Dense Rewards on Pass Rates (Short-Horizon Tasks Only)
Model Reward Type Full Symbolic Full Visual Face View Vertex View
GPT-5 no reward 0.75 0.20 0.40 0.05
face 0.85 0.55 0.50 0.40
sticker 0.65 0.55 0.55 0.50
heuristic 0.50 0.45 0.65 0.30
Gemini 2.5 Pro no reward 0.10 0.05 0.05 0.00
face 0.00 0.00 0.00 0.00
sticker 0.10 0.00 0.05 0.00
heuristic 0.05 0.00 0.10 0.00
Claude Sonnet 4 no reward 0.05 0.00 0.00 0.00
face 0.10 0.10 0.05 0.00
sticker 0.25 0.15 0.00 0.05
heuristic 0.20 0.05 0.05 0.10

Note: All long-horizon tasks remain at 0.00 pass rate regardless of reward type.

Experiment 3: Diagnostic with Solver Tools

By equipping agents with optimal solvers, we isolated specific cognitive bottlenecks:

  • Planning as primary bottleneck: Standard-Solver agents show marked improvement, confirming that long-horizon planning can be successfully offloaded to external tools.
  • Spatial reasoning matters: The performance gap between Standard-Solver and Ideal-Solver agents reveals that spatial transformation for tool use is non-trivial.
  • Partial observation is fundamental: Universal failure on Vertex view tasks, even with ideal solvers, isolates exploration under partial observation as the ultimate bottleneck.
  • Emergent tool learning: Agents exhibit remarkable autonomous tool-learning through trial-and-error, suggesting that discovery-oriented environments may be more effective than explicit instruction.
Comparison of Agent Configurations: Basic vs Standard-Solver vs Ideal-Solver
Model Agent Type Full Symbolic Full Visual Face View Vertex View
S L S L S L S L
GPT-5 Basic 0.75 0.00 0.20 0.00 0.40 0.00 0.05 0.00
Standard-Solver 0.95 0.95 0.65 0.70 1.00 0.95 0.00 0.00
Ideal-Solver 1.00 1.00 0.95 0.80 0.85 1.00 0.00 0.00
Gemini 2.5 Pro Basic 0.10 0.00 0.05 0.00 0.05 0.00 0.00 0.00
Standard-Solver 0.70 0.65 0.25 0.00 0.20 0.00 0.00 0.00
Ideal-Solver 1.00 1.00 0.25 0.00 0.00 0.00 0.00 0.00
Claude Sonnet 4 Basic 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Standard-Solver 0.35 0.85 0.00 0.00 0.00 0.00 0.00 0.00
Ideal-Solver 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

S = Short-horizon tasks (depth 1-4), L = Long-horizon tasks (depth 8-20). Light Blue = Standard-Solver, Darker Blue = Ideal-Solver. Note the universal failure on Vertex View even with Ideal-Solver.

BibTeX

If you find our work useful in your research, please consider citing:
@article{gao2025cubebench,
  title={CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations},
  author={Gao, Huan-ang and Zhang, Zikang and Luo, Tianwei and Yang, Kaisen and Juan, Xinzhe and Qiu, Jiahao and Chen, Tianxing and He, Bingxiang and Zhao, Hao and Zhou, Hao and Liu, Shilong and Wang, Mengdi},
  journal={arXiv preprint},
  year={2025}
}