Generating a consistent whole-house VR tour from a floorplan and style reference requires both photorealistic panoramas and cross-view spatial coherence. Pure 2D generators produce appealing single panoramas but re-imagine geometry and materials when the viewpoint changes, whereas monolithic 3D generation becomes expensive and loses fine texture at multi-room scale. We introduce PanoWorld, a generative spatial world model that treats whole-house synthesis as autoregressive generation of node-based 360-degree panoramas, matching the discrete navigation used by real VR tour products. PanoWorld uses a floorplan-derived 3D shell as a global geometric proxy and a dynamic 3D Gaussian Splatting cache as renderable spatial memory. A feed-forward panoramic LRM designed for metric-scale multi-room 360-degree inputs lifts generated panoramas into local 3DGS updates, while Room-aware Group Attention suppresses cross-room feature interference. A topology-aware progressive caching strategy fuses these local updates without repeatedly reconstructing the full history. By decoupling shell-based geometry guidance from cache-rendered visual memory, PanoWorld preserves high-frequency 2D synthesis quality while improving cross-node layout and material consistency.
Explore the whole-house panoramic results generated by PanoWorld through a VR-style viewer. Given a floorplan and a style reference, PanoWorld synthesizes a coherent set of multi-view panoramas for the entire house, exhibiting strong cross-view consistency across different viewpoints.
Drag to look around. Scroll to zoom. Click floating markers or the floorplan map to jump to any viewpoint.
Room-aware panoramic LRM. Grouped attention allows dense intra-room interaction and restricted cross-room communication only through topological boundaries.
Progressive 3DGS caching. PanoWorld updates spatial memory through local topology-aware increments instead of full history reconstruction.
We compare PanoWorld with representative adapted baselines on multi-node panorama generation.
PanoWorld preserves cross-room geometry and material identity while generating furnished panoramas under different target styles.
The comparison shows room-level panorama renderings for different reconstruction methods.
HPSv3 measures single-node aesthetic quality, CLIP-I Style measures image-reference style consistency, and cross-node consistency is evaluated by Overlap PSNR (PSNRov). Bar heights are normalized per metric for readability, while exact values are shown above each bar.
Metrics are computed from panorama renderings of reconstructed 3D representations. LPIPS is a lower-is-better metric, and its bar heights are inverted only for visual comparison.
@misc{jia2026panoworldgenerativespatialworld,
title={PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis},
author={Jinrang Jia and Zhenjia Li and Yijiang Hu and Yifeng Shi},
year={2026},
eprint={2605.17916},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.17916},
}