🏆 ICLR 2026

StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

1University of British Columbia,  2Vector Institute for AI,  3Nanyang Technological University,  4Canada CIFAR AI Chair
StreamSplat Teaser

Figure 1. Given an uncalibrated video stream, StreamSplat performs instant reconstruction of dynamic 3D Gaussian scenes in an online manner, enabling video reconstruction, interpolation, depth estimation, and novel view synthesis.

Abstract

Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams demands robust online methods that recover scene dynamics from sparse observations under strict latency and memory constraints. Yet most dynamic reconstruction methods rely on hours of per-scene optimization under full-sequence access, limiting practical deployment.

In this work, we introduce StreamSplat, a fully feed-forward framework that instantly transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner.

It is achieved via three key technical innovations: 1) a probabilistic sampling mechanism that robustly predicts 3D Gaussians from uncalibrated inputs; 2) a bidirectional deformation field that yields reliable associations across frames and mitigates long-term error accumulation; 3) an adaptive Gaussian fusion operation that propagates persistent Gaussians while handling emerging and vanishing ones.

Extensive experiments on standard dynamic and static benchmarks demonstrate that StreamSplat achieves state-of-the-art reconstruction quality and dynamic scene modeling. Uniquely, our method supports the online reconstruction of arbitrarily long video streams with a 1200× speedup over optimization-based methods. 🚀

Key Features

Feed-Forward

No per-scene optimization. Instant 3D reconstruction in a single forward pass.

📷

Camera-Free

Works directly with uncalibrated monocular video — no camera poses needed.

🌊

Dynamic Scenes

Handles both static and dynamic elements via bidirectional deformation fields.

♾️

Online & Infinite

Processes arbitrarily long video streams with constant memory via online inference.

Video Results

Qualitative comparisons of StreamSplat against competing methods, and demonstrations of 3D Gaussian point tracking.

DAVIS Method Comparison — Dynamic Scenes

Side-by-side comparisons on the DAVIS dynamic video benchmark against competing methods.

RE10K Method Comparison — Static Scenes

Side-by-side comparisons on the RealEstate10K static-scene benchmark against competing methods.

Tracking 3D Gaussian Point Tracking

StreamSplat enables tracking individual 3D Gaussian points across frames, demonstrating temporally consistent scene-level correspondence.

Method Overview

StreamSplat Pipeline Overview

Figure 2. Overview of the StreamSplat framework. Given a pair of frames, we first encode them using the Static Encoder to produce canonical 3D Gaussians, then pass the 3DGS Embeddings to the Dynamic Decoder to predict the deformation field. The resulting dynamic 3D Gaussians can be rendered at arbitrary time to produce RGB images and depth maps.

Key Technical Contributions

1

Probabilistic 3D Gaussian Encoding

Instead of directly regressing 3D positions, we predict a truncated normal distribution for each Gaussian offset. This promotes spatial exploration during early training and stabilizes convergence, yielding a +6.36 dB PSNR improvement over deterministic prediction.

2

Bidirectional Deformation Field

Our dynamic decoder jointly models forward and backward temporal motion between consecutive frames. Each deformation field predicts a 3D velocity and opacity coefficient, enabling smooth transitions and seamless Gaussian fusion across frames with an adaptive soft-matching mechanism.

3

Two-Stage Training Protocol

Stage 1: Train a static encoder on RGBD inputs to produce canonical 3D Gaussians. Stage 2: Freeze the encoder and train the dynamic decoder to predict bidirectional deformation fields from consecutive frames, supervised with photometric, depth, and mask losses.

Qualitative Results

Qualitative comparison on DAVIS
Qualitative comparison on DAVIS. Blue box: given frames; Red box: interpolated frames. StreamSplat produces high-fidelity and temporally coherent interpolations across both 5-frame and 8-frame interval tasks.
Qualitative results on RE10K
Qualitative results on RealEstate10K. StreamSplat produces detailed and consistent 3D reconstructions across diverse indoor and outdoor scenes, whereas other methods often exhibit distortions.
Novel view synthesis visualization
Visualization of reconstructed dynamic scenes from canonical and novel views. Our method captures consistent 3D motion over time, enabling faithful reconstruction at arbitrary time and viewpoints.
Ablation study
Ablation study. w/o sampling: deterministic position prediction; w/o depth: no depth supervision. Our probabilistic sampling provides a +6.36 dB improvement and depth supervision prevents geometric distortions.

Quantitative Results

We evaluate on both static (RealEstate10K) and dynamic (DAVIS) benchmarks. StreamSplat excels especially on dynamic scenes — the core focus of our work.

Table 2: Quantitative results on DAVIS
Method Scene Rep. Key Frames Middle-4 Frames Time
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
Omnimotion NeRF 24.11 0.714 0.371 > 8 hrs
RoDynRF NeRF 24.79 0.723 0.394 > 24 hrs
CoDeF NeRF 31.49 0.939 0.088 19.40 0.498 0.400 ~10 min
MonST3R Points 42.33 0.980 0.012 ~30 s
4DGS 3DGS 18.12 0.573 0.513 ~40 min
Splatter a Video 3DGS 28.63 0.837 0.228 ~30 min
DGMarbles 3DGS 28.38 0.879 0.172 21.33 0.619 0.313 ~30 min
StreamSplat (Ours) 3DGS 37.83 0.982 0.016 23.66 0.684 0.193 1.48 s
Bold indicates best result among all 3DGS-based methods. StreamSplat is the only method capable of near real-time dynamic 3D reconstruction per frame.
Table 1: Quantitative results on RealEstate10K
Method Type Given Views Novel Views Average
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
pixelSplat Stat. 30.70 0.952 0.055 28.31 0.905 0.097 28.99 0.918 0.085
MVSplat Stat. 31.48 0.962 0.046 28.48 0.909 0.091 29.34 0.924 0.078
NoPoSplat Stat. 29.50 0.939 0.069 28.65 0.913 0.096 28.90 0.920 0.089
CoDeF Dyn. 35.13 0.943 0.091 20.51 0.591 0.402 21.77 0.625 0.374
DGMarbles Dyn. 27.48 0.867 0.232 23.40 0.727 0.333 23.73 0.738 0.325
StreamSplat (Ours) Dyn. 41.60 0.992 0.010 24.68 0.777 0.167 29.51 0.839 0.122
StreamSplat significantly outperforms all baselines on given-view reconstruction and achieves competitive average performance. Best among all dynamic approaches in every setting.
Table 3: Video interpolation on DAVIS-7 (8-frame interval)
Method Type PSNR↑ SSIM↑ LPIPS↓
AMT pixel 21.09 0.544 0.254
RIFE pixel 20.48 0.511 0.258
FILM pixel 20.71 0.528 0.270
LDMVFI pixel 19.98 0.479 0.276
VIDIM pixel 19.62 0.470 0.257
CoDeF 3D 20.34 0.520 0.365
DGMarbles 3D 19.83 0.548 0.353
StreamSplat (Ours) 3D 22.10 0.613 0.234
StreamSplat outperforms all baselines including pixel-level video interpolation methods, demonstrating the effectiveness of explicit 3D dynamic modeling.
Table 4: Component-wise ablation study
Configuration Eval. PSNR↑ SSIM↑ LPIPS↓
w/o Probabilistic Sampling Key 31.47 0.946 0.073
w/o Depth Supervision Key 36.68 0.975 0.039
Full Model (Ours) Key 37.83 0.982 0.016
w/o Bidirectional Deformation Mid. 18.89 0.582 0.492
Full Model (Ours) Mid. 23.66 0.684 0.193
Probabilistic sampling yields +6.36 dB PSNR for key frames; bidirectional deformation provides +4.77 dB PSNR for interpolated frames.

Future Directions

🚗

Autonomous Driving

Exploring real-time dynamic reconstruction for autonomous navigation scenarios.

🎬

Video Generation

Leveraging dynamic 3D representations for controllable video synthesis.

🔬

Extended Temporal Context

Adaptive mechanisms for fusing Gaussians across longer frame histories.

BibTeX

@article{wu2025streamsplat,
  title={StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams},
  author={Wu, Zike and Yan, Qi and Yi, Xuanyu and Wang, Lele and Liao, Renjie},
  journal={arXiv preprint arXiv:2506.08862},
  year={2025}
}

Acknowledgements

This work was funded by the NSERC DG Grant, the Vector Institute for AI, Canada CIFAR AI Chair, and a Google Gift Fund. Resources provided by the Province of Ontario, the Digital Research Alliance of Canada, and Advanced Research Computing at UBC.

This project builds upon 3D Gaussian Splatting, DINOv2, Depth Anything V2, Gamba, NutWorld, and EDM.