SS4D: Native 4D Generative Model via Structured Spacetime Latents

1The Chinese University of Hong Kong, 2Shanghai Artificial Intelligence Laboratory
3Zhejiang University 4Stanford University

Abstract

We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video.

Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion and motion blur, leading to high-quality generation.

Extensive experiments show that SS4D produces spatio-temporally consistent 4D objects with superior quality and efficiency, significantly outperforming state-of-the-art methods on both synthetic and real-world datasets.

Method

Method Overview

Left: SS4D pipeline. Our method takes a monocular video as input. It extracts a coarse voxel-based structure using a 4D Flow Transformer, then generates spacetime latents through a 4D Sparse Flow Transformer, both incorporating a Temporal Layer to capture temporal consistency. These latents are subsequently decoded into a sequence of 3D Gaussians, forming the final 4D content.

Right: Temporal Layer. It combines temporal self-attention with shifted windows and hybrid 1D Rotary Position Embeddings to efficiently model dynamic 4D content and ensure consistency across frames.

BibTeX

@article{li2025ss4d,
      author = {Li, Zhibing and Zhang, Mengchen and Wu, Tong and Tan, Jing and Wang, Jiaqi and Lin, Dahua},
      title = {SS4D: Native 4D Generative Model via Structured Spacetime Latents},
      year = {2025},
      issue_date = {December 2025},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      volume = {44},
      number = {6},
      issn = {0730-0301},
      url = {https://doi.org/10.1145/3763302},
      doi = {10.1145/3763302},
      journal = {ACM Trans. Graph.},
      month = dec,
      articleno = {244},
      numpages = {12},
      keywords = {4D generation, 3D generation, animation, generative model}
}