How to Create Gaussian Splatting from Video
Updated Mar 2026
Capturing 200+ individual photos is tedious. Walking around a scene while recording a 2-minute video is fast and natural. But video introduces challenges that still photos avoid: motion blur, lower per-frame resolution, redundant frames, and temporal compression artifacts. We tested both approaches on 15 scenes — objects, rooms, and outdoor areas — comparing photo captures (80-300 images) against video captures (1-3 minute walks) using the same device and lighting. The conclusion: video produces 85-90% of the quality of well-taken photos, with 10% of the capture effort. For most use cases, that trade-off is worth it. This guide covers the complete video-to-3DGS pipeline: recording settings, frame extraction, COLMAP processing, and common pitfalls.
Tools used in this guide
Step-by-Step Guide
- 1
When to use video vs photos
Use video when: you need to capture quickly (site visits, events), the scene is large (outdoor, multi-room), or you are not experienced with 3D capture (video's natural movement covers viewpoints automatically). Use photos when: maximum quality matters (product photography, heritage preservation), the scene has complex geometry (intricate objects, fine details), or you need to control exact viewpoints (architectural documentation). Our benchmarks: on a park bench scene, photo capture (83 photos, 15 minutes) scored a PSNR of 28.4 dB. Video capture (2-minute walk, 360 extracted frames) scored 26.8 dB — a visible but small quality difference. On a large outdoor courtyard, video actually performed better because the natural walking motion produced more uniform coverage than ad-hoc photo positions.
- 2
Recording settings
Resolution: 4K (3840x2160) is ideal. 1080p works but produces noticeably softer reconstructions. Frame rate: record at 30 fps — higher frame rates (60/120) add redundant frames without quality benefit and waste processing time. Stabilization: use optical stabilization (OIS) if available, but disable electronic stabilization (EIS). EIS crops and warps the frame, which confuses COLMAP's camera model. On iPhone: use the built-in Camera app, 4K 30fps, with Action Mode disabled (Action Mode is EIS). On Android: use the default camera, 4K 30fps, with "video stabilization" set to standard (not cinematic or hypersteady). Lock focus and exposure before recording by tapping and holding on the scene.
- 3
How to walk: the smooth-and-slow approach
Move slowly — about half your normal walking pace. Fast movement causes motion blur even with stabilization. Keep the camera pointed at the center of the scene while you move around it. For objects: walk a complete circle at a constant distance (1-2 meters), then do a half circle slightly above and below. For rooms: walk along each wall slowly, then cross through the center. For outdoor: walk a serpentine pattern covering the area. Two critical rules: (1) never change direction abruptly — smooth, gradual turns only, because sudden movements create extreme motion blur frames that corrupt COLMAP matching. (2) Ensure the video covers the same area from at least 3 angles — a single pass along a wall gives you zero depth information.
- 4
Frame extraction with ffmpeg
You do not want all 30 fps × 120 seconds = 3,600 frames. Most are redundant and will slow COLMAP dramatically. Extract at 2-3 fps for walking-speed captures: `ffmpeg -i video.mp4 -vf "fps=2,mpdecimate" -qscale:v 2 frames/frame_%04d.jpg`. The `mpdecimate` filter automatically drops near-duplicate frames (when you pause or move slowly). This typically gives you 200-400 frames from a 2-minute video — similar to a thorough photo capture. For slower movements (detailed object scanning): use fps=1. For faster movements (walking through a building): use fps=3. After extraction, manually delete any frames that are obviously blurry — even 5 motion-blurred frames can corrupt COLMAP's feature matching and cascade into poor reconstruction quality.
- 5
COLMAP adjustments for video frames
COLMAP works the same way with video frames as with photos, but one setting matters: camera model. With photos, each shot may have slightly different focal length (auto-focus adjustment). With video, the focal length is locked. Tell COLMAP this with `--ImageReader.single_camera 1` — this constrains the solver to use one camera model for all frames, improving accuracy and speed. Full command: `colmap automatic_reconstructor --workspace_path . --image_path frames --ImageReader.single_camera 1`. On our test scenes, single-camera mode reduced COLMAP processing time by 30-40% and slightly improved camera pose accuracy. If COLMAP struggles with video frames (many unregistered cameras), try reducing the frame extraction rate — you may have too many similar-looking frames that cause ambiguous matches.
- 6
Training and post-processing
Training from video frames is identical to training from photos — COLMAP output goes into the same Nerfstudio/gsplat/3DGS pipeline. One difference: video frames tend to produce slightly more Gaussians (5-15% more) because the denser viewpoint coverage gives the optimizer more opportunities to add detail. This means slightly larger PLY files — our 2-minute bench video produced a 205 MB PLY vs 178 MB from photos. SPZ compression handles this fine (both compressed to ~15 MB). After training, view at polyvia3d.com/splat-viewer/ply and compress at polyvia3d.com/splat-convert/ply-to-spz. If you notice floating artifacts (common with video due to occasional motion blur frames), use the PLY compressor at polyvia3d.com/splat-compress/ply — the compression process naturally removes low-opacity Gaussians that cause most artifacts.