StreamSplat: Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

Figure 1. Given an uncalibrated video stream, StreamSplat performs instant reconstruction of dynamic 3D Gaussian scenes in an online manner, enabling video reconstruction, interpolation, depth estimation, and novel view synthesis.

Abstract

Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams demands robust online methods that recover scene dynamics from sparse observations under strict latency and memory constraints. Yet most dynamic reconstruction methods rely on hours of per-scene optimization under full-sequence access, limiting practical deployment.

In this work, we introduce StreamSplat, a fully feed-forward framework that instantly transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner.

It is achieved via three key technical innovations: 1) a probabilistic sampling mechanism that robustly predicts 3D Gaussians from uncalibrated inputs; 2) a bidirectional deformation field that yields reliable associations across frames and mitigates long-term error accumulation; 3) an adaptive Gaussian fusion operation that propagates persistent Gaussians while handling emerging and vanishing ones.

Extensive experiments on standard dynamic and static benchmarks demonstrate that StreamSplat achieves state-of-the-art reconstruction quality and dynamic scene modeling. Uniquely, our method supports the online reconstruction of arbitrarily long video streams with a 1200× speedup over optimization-based methods. 🚀

Key Features

⚡

Feed-Forward

No per-scene optimization. Instant 3D reconstruction in a single forward pass.

📷

Camera-Free

Works directly with uncalibrated monocular video — no camera poses needed.

🌊

Dynamic Scenes

Handles both static and dynamic elements via bidirectional deformation fields.

♾️

Online & Infinite

Processes arbitrarily long video streams with constant memory via online inference.

Video Results

Qualitative comparisons of StreamSplat against competing methods, and demonstrations of 3D Gaussian point tracking.

DAVIS Method Comparison — Dynamic Scenes

Side-by-side comparisons on the DAVIS dynamic video benchmark against competing methods.

RE10K Method Comparison — Static Scenes

Side-by-side comparisons on the RealEstate10K static-scene benchmark against competing methods.

Tracking 3D Gaussian Point Tracking

StreamSplat enables tracking individual 3D Gaussian points across frames, demonstrating temporally consistent scene-level correspondence.

Method Overview

Figure 2. Overview of the StreamSplat framework. Given a pair of frames, we first encode them using the Static Encoder to produce canonical 3D Gaussians, then pass the 3DGS Embeddings to the Dynamic Decoder to predict the deformation field. The resulting dynamic 3D Gaussians can be rendered at arbitrary time to produce RGB images and depth maps.

Key Technical Contributions

1

Probabilistic 3D Gaussian Encoding

Instead of directly regressing 3D positions, we predict a truncated normal distribution for each Gaussian offset. This promotes spatial exploration during early training and stabilizes convergence, yielding a +6.36 dB PSNR improvement over deterministic prediction.

2

Bidirectional Deformation Field

Our dynamic decoder jointly models forward and backward temporal motion between consecutive frames. Each deformation field predicts a 3D velocity and opacity coefficient, enabling smooth transitions and seamless Gaussian fusion across frames with an adaptive soft-matching mechanism.

3

Two-Stage Training Protocol

Stage 1: Train a static encoder on RGBD inputs to produce canonical 3D Gaussians. Stage 2: Freeze the encoder and train the dynamic decoder to predict bidirectional deformation fields from consecutive frames, supervised with photometric, depth, and mask losses.

Qualitative Results

Qualitative comparison on DAVIS. Blue box: given frames; Red box: interpolated frames. StreamSplat produces high-fidelity and temporally coherent interpolations across both 5-frame and 8-frame interval tasks.

Qualitative results on RealEstate10K. StreamSplat produces detailed and consistent 3D reconstructions across diverse indoor and outdoor scenes, whereas other methods often exhibit distortions.

Visualization of reconstructed dynamic scenes from canonical and novel views. Our method captures consistent 3D motion over time, enabling faithful reconstruction at arbitrary time and viewpoints.

Ablation study. w/o sampling: deterministic position prediction; w/o depth: no depth supervision. Our probabilistic sampling provides a +6.36 dB improvement and depth supervision prevents geometric distortions.

Quantitative Results

We evaluate on both static (RealEstate10K) and dynamic (DAVIS) benchmarks. StreamSplat excels especially on dynamic scenes — the core focus of our work.

Table 2: Quantitative results on DAVIS

Method	Scene Rep.	Key Frames			Middle-4 Frames			Time
		PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
Omnimotion	NeRF	24.11	0.714	0.371	–	–	–	> 8 hrs
RoDynRF	NeRF	24.79	0.723	0.394	–	–	–	> 24 hrs
CoDeF	NeRF	31.49	0.939	0.088	19.40	0.498	0.400	~10 min
MonST3R	Points	42.33	0.980	0.012	–	–	–	~30 s
4DGS	3DGS	18.12	0.573	0.513	–	–	–	~40 min
Splatter a Video	3DGS	28.63	0.837	0.228	–	–	–	~30 min
DGMarbles	3DGS	28.38	0.879	0.172	21.33	0.619	0.313	~30 min
StreamSplat (Ours)	3DGS	37.83	0.982	0.016	23.66	0.684	0.193	1.48 s

Bold indicates best result among all 3DGS-based methods. StreamSplat is the only method capable of near real-time dynamic 3D reconstruction per frame.

Table 1: Quantitative results on RealEstate10K

Method	Type	Given Views			Novel Views			Average
		PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
pixelSplat	Stat.	30.70	0.952	0.055	28.31	0.905	0.097	28.99	0.918	0.085
MVSplat	Stat.	31.48	0.962	0.046	28.48	0.909	0.091	29.34	0.924	0.078
NoPoSplat	Stat.	29.50	0.939	0.069	28.65	0.913	0.096	28.90	0.920	0.089
CoDeF	Dyn.	35.13	0.943	0.091	20.51	0.591	0.402	21.77	0.625	0.374
DGMarbles	Dyn.	27.48	0.867	0.232	23.40	0.727	0.333	23.73	0.738	0.325
StreamSplat (Ours)	Dyn.	41.60	0.992	0.010	24.68	0.777	0.167	29.51	0.839	0.122

StreamSplat significantly outperforms all baselines on given-view reconstruction and achieves competitive average performance. Best among all dynamic approaches in every setting.

Table 3: Video interpolation on DAVIS-7 (8-frame interval)

Method	Type	PSNR↑	SSIM↑	LPIPS↓
AMT	pixel	21.09	0.544	0.254
RIFE	pixel	20.48	0.511	0.258
FILM	pixel	20.71	0.528	0.270
LDMVFI	pixel	19.98	0.479	0.276
VIDIM	pixel	19.62	0.470	0.257
CoDeF	3D	20.34	0.520	0.365
DGMarbles	3D	19.83	0.548	0.353
StreamSplat (Ours)	3D	22.10	0.613	0.234

StreamSplat outperforms all baselines including pixel-level video interpolation methods, demonstrating the effectiveness of explicit 3D dynamic modeling.

Table 4: Component-wise ablation study

Configuration	Eval.	PSNR↑	SSIM↑	LPIPS↓
w/o Probabilistic Sampling	Key	31.47	0.946	0.073
w/o Depth Supervision	Key	36.68	0.975	0.039
Full Model (Ours)	Key	37.83	0.982	0.016
w/o Bidirectional Deformation	Mid.	18.89	0.582	0.492
Full Model (Ours)	Mid.	23.66	0.684	0.193

Probabilistic sampling yields +6.36 dB PSNR for key frames; bidirectional deformation provides +4.77 dB PSNR for interpolated frames.

Future Directions

🚗

Autonomous Driving

Exploring real-time dynamic reconstruction for autonomous navigation scenarios.

🎬

Video Generation

Leveraging dynamic 3D representations for controllable video synthesis.

🔬

Extended Temporal Context

Adaptive mechanisms for fusing Gaussians across longer frame histories.

BibTeX

@article{wu2025streamsplat,
  title={StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams},
  author={Wu, Zike and Yan, Qi and Yi, Xuanyu and Wang, Lele and Liao, Renjie},
  journal={arXiv preprint arXiv:2506.08862},
  year={2025}
}

Acknowledgements

This work was funded by the NSERC DG Grant, the Vector Institute for AI, Canada CIFAR AI Chair, and a Google Gift Fund. Resources provided by the Province of Ontario, the Digital Research Alliance of Canada, and Advanced Research Computing at UBC.

This project builds upon 3D Gaussian Splatting, DINOv2, Depth Anything V2, Gamba, NutWorld, and EDM.