DEV Community

Papers Mache
Papers Mache

Posted on

Stateful Python Kernels Lift VLM Spatial Reasoning

A stateful Python kernel turns vision‑language models into iterative sketch artists, letting them draw and edit geometric primitives across steps. By embedding a mutable execution environment, the model can observe the visual result of each drawing operation before deciding the next, a capability absent from prior fixed‑API agents.

Earlier spatial agents were constrained to single‑pass code execution or rigid tool‑call APIs, which forced a complete analysis plan before any intermediate observation. Those designs commit to a full strategy without seeing the rendered scene, and the structured interfaces limit freedom to compose arbitrary sequences of perception and geometry operations. Consequently, open‑ended 3D and 4D reasoning tasks often exceed the expressiveness of the available toolset.

SpatialClaw reaches 59.9% average accuracy on a suite of 20 3D/4D reasoning benchmarks, improving by 11.2 points over the previous spatial agent. “SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.” This consistency across models indicates that the improvement originates from the mutable code interface rather than any model‑specific tuning.

The largest performance lifts appear on benchmarks that require multi‑step geometric composition, confirming the advantage of a mutable execution environment. “The results show that SpatialClaw consistently outperforms the other two action interfaces across all benchmarks, with the largest gains on tasks that require multi-step geometric composition.” By writing one executable cell per step and reusing prior outputs, the agent can iteratively refine shapes, a strategy the earlier single‑pass and structured interfaces cannot emulate [1].

The method is training‑free and relies on a pre‑loaded Python kernel with a fixed set of perception and geometry primitives, which may restrict deployment to settings where such a kernel can be provisioned. While avoiding additional fine‑tuning simplifies adoption, the approach presumes that the required primitives are already implemented and that the runtime environment can execute arbitrary Python safely. This dependency could become a bottleneck for domains lacking a curated library of spatial tools [1].

Consequently, benchmark suites and toolkits should adopt stateful code interfaces as a default for evaluating spatial reasoning in VLMs. Providing a mutable Python kernel alongside static APIs will let future agents exploit iterative composition and likely push performance further on complex 3D/4D tasks.

References

  1. SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Top comments (0)