Skip to main content

How to Create Gaussian Splatting from Video

Updated Mar 2026

Capturing 200+ individual photos is tedious. Walking around a scene while recording a 2-minute video is fast and natural. But video introduces challenges that still photos avoid: motion blur, lower per-frame resolution, redundant frames, and temporal compression artifacts. We tested both approaches on 15 scenes — objects, rooms, and outdoor areas — comparing photo captures (80-300 images) against video captures (1-3 minute walks) using the same device and lighting. The conclusion: video produces 85-90% of the quality of well-taken photos, with 10% of the capture effort. For most use cases, that trade-off is worth it. This guide covers the complete video-to-3DGS pipeline: recording settings, frame extraction, COLMAP processing, and common pitfalls.

Tools used in this guide

Step-by-Step Guide

  1. 1

    When to use video vs photos

    Use video when: you need to capture quickly (site visits, events), the scene is large (outdoor, multi-room), or you are not experienced with 3D capture (video's natural movement covers viewpoints automatically). Use photos when: maximum quality matters (product photography, heritage preservation), the scene has complex geometry (intricate objects, fine details), or you need to control exact viewpoints (architectural documentation). Our benchmarks: on a park bench scene, photo capture (83 photos, 15 minutes) scored a PSNR of 28.4 dB. Video capture (2-minute walk, 360 extracted frames) scored 26.8 dB — a visible but small quality difference. On a large outdoor courtyard, video actually performed better because the natural walking motion produced more uniform coverage than ad-hoc photo positions.

  2. 2

    Recording settings

    Resolution: 4K (3840x2160) is ideal. 1080p works but produces noticeably softer reconstructions. Frame rate: record at 30 fps — higher frame rates (60/120) add redundant frames without quality benefit and waste processing time. Stabilization: use optical stabilization (OIS) if available, but disable electronic stabilization (EIS). EIS crops and warps the frame, which confuses COLMAP's camera model. On iPhone: use the built-in Camera app, 4K 30fps, with Action Mode disabled (Action Mode is EIS). On Android: use the default camera, 4K 30fps, with "video stabilization" set to standard (not cinematic or hypersteady). Lock focus and exposure before recording by tapping and holding on the scene.

  3. 3

    How to walk: the smooth-and-slow approach

    Move slowly — about half your normal walking pace. Fast movement causes motion blur even with stabilization. Keep the camera pointed at the center of the scene while you move around it. For objects: walk a complete circle at a constant distance (1-2 meters), then do a half circle slightly above and below. For rooms: walk along each wall slowly, then cross through the center. For outdoor: walk a serpentine pattern covering the area. Two critical rules: (1) never change direction abruptly — smooth, gradual turns only, because sudden movements create extreme motion blur frames that corrupt COLMAP matching. (2) Ensure the video covers the same area from at least 3 angles — a single pass along a wall gives you zero depth information.

  4. 4

    Frame extraction with ffmpeg

    You do not want all 30 fps × 120 seconds = 3,600 frames. Most are redundant and will slow COLMAP dramatically. Extract at 2-3 fps for walking-speed captures: `ffmpeg -i video.mp4 -vf "fps=2,mpdecimate" -qscale:v 2 frames/frame_%04d.jpg`. The `mpdecimate` filter automatically drops near-duplicate frames (when you pause or move slowly). This typically gives you 200-400 frames from a 2-minute video — similar to a thorough photo capture. For slower movements (detailed object scanning): use fps=1. For faster movements (walking through a building): use fps=3. After extraction, manually delete any frames that are obviously blurry — even 5 motion-blurred frames can corrupt COLMAP's feature matching and cascade into poor reconstruction quality.

  5. 5

    COLMAP adjustments for video frames

    COLMAP works the same way with video frames as with photos, but one setting matters: camera model. With photos, each shot may have slightly different focal length (auto-focus adjustment). With video, the focal length is locked. Tell COLMAP this with `--ImageReader.single_camera 1` — this constrains the solver to use one camera model for all frames, improving accuracy and speed. Full command: `colmap automatic_reconstructor --workspace_path . --image_path frames --ImageReader.single_camera 1`. On our test scenes, single-camera mode reduced COLMAP processing time by 30-40% and slightly improved camera pose accuracy. If COLMAP struggles with video frames (many unregistered cameras), try reducing the frame extraction rate — you may have too many similar-looking frames that cause ambiguous matches.

  6. 6

    Training and post-processing

    Training from video frames is identical to training from photos — COLMAP output goes into the same Nerfstudio/gsplat/3DGS pipeline. One difference: video frames tend to produce slightly more Gaussians (5-15% more) because the denser viewpoint coverage gives the optimizer more opportunities to add detail. This means slightly larger PLY files — our 2-minute bench video produced a 205 MB PLY vs 178 MB from photos. SPZ compression handles this fine (both compressed to ~15 MB). After training, view at polyvia3d.com/splat-viewer/ply and compress at polyvia3d.com/splat-convert/ply-to-spz. If you notice floating artifacts (common with video due to occasional motion blur frames), use the PLY compressor at polyvia3d.com/splat-compress/ply — the compression process naturally removes low-opacity Gaussians that cause most artifacts.

Frequently Asked Questions

Can I use a GoPro or action camera?
Yes, with caveats. GoPro's wide-angle lens introduces barrel distortion that COLMAP handles well (it models radial distortion). However, GoPro's aggressive electronic stabilization warps the frame in ways COLMAP cannot model — use Linear mode (which removes most distortion) and disable HyperSmooth. Record at 4K 30fps Linear. The wider field of view actually helps coverage — you need fewer passes to capture the same area.
What about 360-degree cameras?
Experimental support exists but results are inconsistent. The main issue is that 360 cameras use dual fisheye lenses stitched together — the stitch seam creates feature-matching artifacts in COLMAP. Some researchers have had success with Ricoh Theta Z1 by processing each hemisphere separately and merging point clouds. For reliable results in 2026, standard cameras (phone, DSLR, GoPro) are still the safer choice.
How long should my video be?
For a single object: 30-60 seconds (one slow orbit). For a room: 1-2 minutes (wall-by-wall walk). For a large outdoor area: 2-5 minutes. Longer videos increase processing time without proportional quality improvement — COLMAP scales roughly quadratically with frame count, so a 5-minute video (600 extracted frames) takes 4x longer to process than a 2-minute video (300 frames).
My video has sections with heavy motion blur. Can I still use it?
Delete the blurry frames after extraction. Even 5-10 heavily blurred frames among 300 good frames can degrade the reconstruction. After ffmpeg extraction, sort frames by file size — motion-blurred JPEG frames are typically 20-40% smaller than sharp frames because blur reduces image detail and compresses better. Delete the smallest 5% of frames as a quick quality filter.

Related Tools

Related Format Guides