3D Reconstruction: From Hours to Minutes

Rendering a 3D Scene

At its core, computer graphics is about creating two-dimensional images from three-dimensional virtual worlds. This process, known as rendering, is a fascinating blend of geometry, linear algebra, and physics simulation. Let's break down how we go from an abstract 3D object to a final, viewable image.

The Building Blocks: 3D Objects as Meshes

Every object you see in a game or an animated film starts as a collection of points in 3D space. These points are called vertices, each having a specific $(x, y, z)$ coordinate. To create a solid surface, we connect these vertices to form flat polygons, which are called faces. Most commonly, these faces are triangles or quadrilaterals (quads).

This collection of vertices and faces is known as a 3D mesh. Think of it as a digital sculpture or a blueprint that defines the shape and structure of any object in the 3D world, whether it's a simple cube or a complex character.

Elements of 3D objects: vertices, edges, faces, and surfaces.
Image from Wikipedia.

The Observer: The Virtual Camera

To see our 3D mesh, we need a virtual camera. Just like a real-world camera, its purpose is to capture a specific view of the scene. In computer graphics, a camera's properties are defined by two crucial sets of information, often represented by matrices:

World-to-Camera (W2C) Matrix: This is an extrinsic matrix that defines the camera's position and orientation in the 3D world. It answers the questions: Where is the camera located? And which direction is it pointing? It transforms the coordinates of all objects from the global "world space" into "camera space," where the camera is effectively at the origin (0,0,0) and looking down a specific axis (like the negative Z-axis).
Camera-to-View (C2V) Matrix: This is an intrinsic matrix, also known as the projection matrix. It defines the camera's internal properties, like the lens. It dictates how the 3D scene is flattened onto a 2D plane, creating the sense of perspective. Key attributes controlled by this matrix include the field of view (acting like a zoom lens), the aspect ratio (matching the final image dimensions), and the near and far clipping planes (defining the range of visible depths).

The Rendering Process: From 3D to 2D

With our 3D mesh and camera in place, the rendering process can begin. It's fundamentally a two-step procedure that uses the matrices we just discussed to transform our 3D data into a 2D image.

As illustrated in the diagram below, a 3D mesh exists in its own coordinate system within the larger 3D space. The rendering pipeline transforms the vertices of this mesh into the 2D pixel grid of the camera's view.

The rendering pipeline transforms 3D meshes into 2D images through projection and shading

Projection: This is the geometric heart of rendering. The engine takes the list of 3D vertices from our mesh and multiplies them by the world-to-camera matrix and then by the perspective camera-to-view matrix. This series of matrix multiplications is a powerful mathematical operation that effectively projects each 3D vertex onto the camera's 2D image plane. The result is a new set of 2D vertices that represent the object's silhouette from the camera's point of view.
Shading (or Rasterization): Simply projecting the vertices gives us a 2D wireframe. To create a solid, realistic image, we need to fill in the pixels. This step, often called rasterization or shading, determines the final color of each pixel covered by our 2D projected shape. The graphics pipeline iterates over the pixels within the projected triangles of the mesh and calculates their color based on various factors: the object's base color (texture), its material properties (e.g., how shiny or rough it is), the location and intensity of lights in the scene, and whether the pixel is in shadow. The result is the final shaded, textured, and lit 2D image that we see on our screen.

Reconstructing 3D Scene

Inputs and Initial Problems

The journey of reconstruction typically begins with a set of overlapping images of a scene, often sourced from the frames of a video 📹. This collection of 2D views is our only input. From this, we must solve two fundamental problems before we can even think about building a 3D model.

Triangulation: By matching the same point across multiple camera views,
we can determine its 3D position in space.
(Image source: 3D Reconstruction from Multiple Images)

First is camera pose retrieval. As a camera moves through a scene, its position and orientation change with every frame. We must accurately calculate this 6-DoF (six degrees of freedom: 3 for position, 3 for rotation) pose for each image to understand where it was taken from.

Second, we must determine the depth for each pixel. Depth estimation tells us how far every point in the scene is from the camera, transforming our flat images into a spatial map.

Solving for camera poses and depth across multiple views is a complex geometric puzzle known as Structure from Motion (SfM) and Multi-View Stereo (MVS).

From Problems to Representations

Once we have a handle on the camera poses and depth maps, we can begin to build the actual 3D representation. The choice of representation dictates the nature and quality of the final model:

Point Clouds: By taking the depth value of a pixel and projecting it into 3D space using its corresponding camera pose, we can create a point cloud. This is the most direct output of the initial reconstruction steps - a raw, foundational skeleton of the scene. While useful for understanding the basic geometry, this representation is shallow because it consists only of disconnected points and lacks surfaces for realistic rendering.

A 3D point cloud reconstructed from multiple images of a turtle

Explicit Meshes: To create a more solid and tangible model, we can process a point cloud to generate an explicit mesh. Algorithms connect the individual 3D points to form a continuous surface of polygons (usually triangles). This gives us a watertight model that can be properly textured and lit, making it ideal for applications like games or simulations.

3D mesh-triangles with different resolution
(Image source: 3D Modelling for Programmers)

Implicit Representations (NeRFs): A more modern approach bypasses creating an explicit mesh altogether. A Neural Radiance Field (NeRF) is a neural network that acts as an implicit representation. It learns a continuous function directly from the images and their camera poses. By feeding the network a 3D coordinate and a viewing direction, it outputs the color and density at that point. This allows for the rendering of incredibly photorealistic novel views by effectively learning both the geometry (like depth) and appearance (color and light interaction) of the entire scene at once.

NeRF pipeline: rays are sampled through a scene and processed by a neural network.
(Image source: blakegella.com)

Hybrid Representations (3D Gaussian Splatting): Occupying a powerful middle ground, 3D Gaussian Splatting has recently emerged. This technique takes the initial point cloud (generated from SfM) and converts each point into a 3D Gaussian - a soft, colored, transparent blob. These Gaussians are an explicit representation that can be "splatted" or projected onto a 2D plane with extreme efficiency. This hybrid approach achieves the photorealism of NeRFs but with the significant advantage of real-time rendering speeds, marking a major leap forward for the field.

3D Gaussian Splatting: 3D Gaussians are projected into image for rendering.
(Image source: A Survey on 3D Gaussian Splatting)

3D Point Clouds with COLMAP

Now that we understand the reconstruction challenge, let's see how COLMAP actually solves it. Think of COLMAP as a geometric detective - it examines a collection of photos and figures out both where each camera was positioned and the 3D structure of the scene.

The Three-Stage Process

Feature Detection: COLMAP scans each image for distinctive visual landmarks - corners, edges, and unique patterns that can be recognized from different angles. Using SIFT features, it creates mathematical "fingerprints" for thousands of points per image.

Feature Matching: Like a matchmaker, COLMAP compares these fingerprints across all image pairs. Here's where the computational reality hits - with $N$ images, COLMAP must perform $N \times (N-1)/2$ pairwise comparisons. For a modest 100-image dataset, that's 4,950 image pairs to analyze. For matching process COLMAP uses SIFT(Scale-Invariant Feature Transform) features. Each comparison involves matching potentially thousands of features, making this stage scale quadratically with the number of input images.

SIFT feature matching between two views of the same scene
(Image source: SIFT)

Bundle Adjustment: The magic happens here, but it's computationally expensive magic. COLMAP simultaneously estimates camera poses and triangulates 3D point positions through global optimization. This involves solving large sparse systems with thousands of variables, often taking hours for complex scenes.

The Computational Cost

The quadratic complexity means processing time explodes with dataset size. A 50-image sequence might take 30 minutes, while 200 images could require 8+ hours on the same hardware. The feature matching stage alone can consume 70-80% of the total runtime, as COLMAP exhaustively searches for correspondences across all image pairs.

The Result: A Sparse Foundation

What emerges after this lengthy process is a sparse but highly accurate 3D point cloud - a constellation of reliable points that multiple cameras agree upon. Each point represents a location verified through geometric constraints across multiple views.

Input: Multi-view images

Output: Sparse point cloud with camera poses

COLMAP workflow: from input images to sparse 3D reconstruction
(Image source: Peter Falkingham)

This sparse output serves as the geometric backbone for all advanced techniques. Every modern 3D reconstruction method - from NeRFs to Gaussian Splatting - builds upon the camera poses and sparse geometry that COLMAP provides.

COLMAP's reliability comes at a computational cost, driving researchers to seek faster alternatives - which is exactly what we'll explore next with DUST3R.

Modern Alternatives: DUST3R and MASt3R-SFM

What if we could skip the entire classical SfM pipeline altogether? Instead of detecting keypoints, matching them, and triangulating - what if a neural network could directly predict 3D structure from raw pixels?

DUST3R: Direct 3D Prediction

DUST3R takes a radically different approach. Given two images, it directly regresses dense "pointmaps" - essentially a 3D coordinate for every pixel. The core insight is elegant: rather than building complex pipelines to infer 3D structure, train a transformer to predict it directly.

The network architecture follows a Siamese design where both images are encoded through shared ViT encoders, then processed by twin decoders that output pointmaps expressed in the first camera's coordinate frame. This shared coordinate system is crucial - it means the 3D points from both views are already aligned, eliminating the need for complex pose estimation procedures.

DUST3R pipeline: Direct pointmap prediction from image pairs
(Image source: DUSt3R)

The magic happens in training: DUST3R learns geometric priors from massive datasets, developing an intuitive understanding of 3D structure that often surpasses handcrafted algorithms. It handles challenging cases where classical methods fail: minimal overlap between views, textureless regions, or insufficient camera motion.

But here's the catch - DUST3R processes image pairs. For N images, you need N² forward passes. With a typical transformer requiring substantial memory, this becomes prohibitive beyond ~50 images. The method trades classical complexity for neural complexity, but the quadratic scaling remains.

MASt3R-SfM: Making Neural SfM Scale

MASt3R-SfM tackles DUST3R's scalability head-on with two key innovations.

Richer representations: Instead of just predicting 3D coordinates, MASt3R outputs dense feature descriptors for each pixel patch. These learned features capture local geometric and appearance information, enabling robust matching between pointmaps even when 3D coordinates alone might be ambiguous.

Graph-based processing: Rather than exhaustively processing all pairs, MASt3R-SfM builds a sparse scene graph. The frozen MASt3R encoder - originally designed for 3D prediction - turns out to be an excellent image retriever. Using ASMK aggregation on encoder features, it identifies likely overlapping pairs with minimal computational overhead.

MASt3R-SFM pipeline: encoding, graph-based processing, decoding and two-stage optimization
(Image source: MAStR-SFM)

The resulting graph contains only O(N) edges instead of O(N²). A typical approach connects images to k=10 nearest neighbors plus a sparse set of "keyframe" images, creating a connected graph with linear complexity.

Two-stage optimization then aligns these local reconstructions globally. First, pointmaps are rigidly aligned in 3D space using the learned correspondences. Then a refinement stage minimizes 2D reprojection errors, similar to bundle adjustment but operating on the neural predictions rather than classical triangulated points.

Why This Matters

The complexity difference is dramatic:

COLMAP: O(N²) feature matching with expensive RANSAC iterations
DUST3R: O(N²) transformer forward passes, memory-bound
MASt3R-SfM: O(N) forward passes + O(N²) retrieval (but retrieval is fast)

In practice, MASt3R-SfM handles 200+ images where DUST3R runs out of memory at 50. The neural approach also shows remarkable consistency - while classical methods degrade significantly with fewer views or challenging motion, MASt3R-SfM maintains steady performance from 3 to 100+ images.

MASt3R-SfM works on truly unconstrained image collections. No assumptions about camera motion patterns, no requirements for high overlap, no manual parameter tuning. You feed it photos, it gives you cameras and 3D structure.

Point cloud and camera outputs generated by the MASt3R method

NeRFs: Neural Radiance Fields

From Discrete Points to Continuous Fields

We've successfully extracted camera poses and 3D structure from our image collections. Whether using COLMAP's methodical geometric analysis or MASt3R-SfM's neural predictions, we end up with the same foundation: a constellation of reliable 3D points and precisely calibrated camera positions.

But here's the fundamental limitation we now face - our 3D representation is still fundamentally discrete and sparse. Those reconstructed points are just the skeleton of our scene. Between each point lies vast empty space, and our original training images contain rich photometric information that no point cloud can capture: subtle color variations, complex material interactions, view-dependent lighting effects, and fine geometric details that exist in the continuous space between our samples.

This is where we make a conceptual leap from discrete geometry to continuous representation.

Neural Radiance Fields (NeRFs) transform our sparse geometric foundation into something entirely different - a learned, continuous function that describes every possible location in 3D space. Instead of asking "where are the points?", NeRFs answer "what exists here?" for any coordinate you can imagine, even in the seemingly empty regions between our reconstructed points.

The NeRF Architecture

A NeRF represents the entire scene as a continuous function. Feed the network any 3D coordinate $(x,y,z)$ and viewing direction $(\theta,\phi)$, and it outputs the color $(r,g,b)$ and density $(\sigma)$ at that point:

$$F(x,y,z,\theta,\phi) \rightarrow (r,g,b,\sigma)$$

The scene becomes a learned, continuous field of radiance values. This is fundamentally different from discrete representations—every point in 3D space has a potential color and opacity.

Volume Rendering Process

Training follows a computationally intensive pattern. For each training image, rays are cast from the camera center through every pixel into the 3D scene. Along each ray, multiple 3D points are sampled and queried through the neural network.

Example of how the volume is sampled along rays and the neural network is trained.
Image source: Neural Fields

The key insight is volume rendering. Instead of finding surface intersections, NeRF accumulates color and opacity along the entire ray. Points with high density contribute more to the final pixel color, while transparent regions allow deeper points to show through. This differentiable process enables backpropagation of pixel-level losses directly to the network weights.

The Overfitting Strategy

NeRFs deliberately overfit to their training data. The network must memorize how light behaves in this specific environment, learning not just the geometry we reconstructed, but also complex material interactions, shadows, and volumetric effects.

This process requires thousands of training iterations, typically taking 6–12 hours on modern GPUs for a single scene. The network encodes every lighting condition, surface detail, and shadow captured in the training images.

The Payoff: Photorealistic Novel Views

Once trained, the NeRF can render photorealistic images from any camera position within the training distribution. The results often surpass traditional mesh-based rendering, naturally handling view-dependent effects like reflections, volumetric phenomena like smoke, and fine geometric details. The continuous nature enables rendering at any resolution with smooth, artifact-free camera motion.

Modern implementations like Instant-NGP have reduced training time from hours to minutes, but the fundamental trade-off remains. Creating a neural representation that can synthesize photorealistic novel views requires substantial computation to overfit the network to your specific scene.

3D Gaussian Splatting

We've seen how NeRFs achieve photorealistic novel views by learning continuous radiance fields. However, NeRFs have a critical limitation: rendering speed. Even optimized versions require seconds per image, making them unsuitable for interactive applications like gaming or AR.

3D Gaussian Splatting (3DGS) bridges this gap by representing scenes as collections of 3D Gaussian primitives instead of neural networks. This approach maintains NeRF-quality visuals while achieving real-time rendering speeds.

The Gaussian Representation

Instead of querying a neural network along camera rays, 3DGS represents the scene using millions of 3D Gaussians—colored, semi-transparent ellipses positioned throughout space. Think of it as replacing NeRF's smooth function with a pointillist painting made of 3D brushstrokes.

Each 3D Gaussian Splat stores:

Center Position $\mu \in \mathbb{R}^3$: Specifies the central location of the Gaussian in the 3D space.
Color Representation: Utilizes spherical harmonics (SH) coefficients $c \in \mathbb{R}^k$ to encode the color information, where $k$ represents the degrees of freedom in the color model.
Rotation Factor $r \in \mathbb{R}^4$: Defined in quaternion terms to manage the orientation of the Gaussian.
Scale Factor $s \in \mathbb{R}^3$: Determines the size of the Gaussian along each axis, forming an ellipsoidal shape.
Opacity $\alpha \in \mathbb{R}$: Controls the transparency.

How 3DGS parameters optimization influence 3D Gaussian Splat?

When rendering, these 3D Gaussians project onto the image plane and blend together using alpha compositing—a process that's highly parallelizable on modern GPUs.

Training: Sculpting with Mathematics

Like NeRFs, 3DGS requires camera poses and benefits enormously from sparse point clouds (from COLMAP or MASt3R-SfM). Each sparse point becomes an initial Gaussian that optimization then refines.

The training objective minimizes reconstruction error:

$$L = (1 - \lambda)L_1 + \lambda L_{D-SSIM}$$

The real innovation is adaptive density control. During optimization, the system continuously monitors reconstruction quality and automatically adjusts the representation:

Densification: In under-reconstructed regions, large Gaussians split into smaller ones (like cell division), while new Gaussians clone in sparse areas.

Two densification strategies for 3DGS for over and under - reconstructed regions.
Image source: 3DGS Densification

Pruning: Gaussians with low opacity or minimal contribution get removed.

This creates organic sculpting where complexity adapts to scene requirements. Detailed areas like foliage develop dense clusters of small Gaussians, while smooth surfaces use fewer, larger ones.

Performance Revolution

The performance difference is dramatic. 3DGS trains in 15-30 minutes compared to NeRF's 6-12 hours a 10-20x speedup. This comes from the explicit representation requiring no expensive neural network evaluations during optimization.

Rendering performance is even more impressive: 30-60+ FPS at high resolutions. The rasterization process (project, sort by depth, alpha blend) maps perfectly to graphics hardware, unlike NeRF's complex ray marching.

Beyond Rendering: Analysis and Editing

The explicit nature of 3D Gaussians unlocks capabilities impossible with NeRFs. Since scenes consist of discrete Gaussians with interpretable parameters, you can:

Analyze structure by examining Gaussian distributions
Edit scenes by moving, removing, or modifying individual Gaussians
Extract semantics by clustering Gaussians with similar properties
Compress by pruning redundant elements

Researchers have demonstrated removing objects, changing materials, and animating static scenes—all by manipulating the underlying Gaussian representation.

Technical Considerations

3DGS initialization strongly depends on sparse point cloud quality. Poor initial geometry leads to longer training and potential artifacts, especially for camera positions far from training views.

Memory scales with scene complexity, typical outdoor scenes require 1-5 million Gaussians. However, this explicit storage often proves more efficient than the large neural networks needed for comparable NeRF quality.

The method can struggle with complex optical effects. Highly reflective surfaces or intricate volumetric phenomena may not be perfectly captured by discrete Gaussians, though visual quality typically remains high for practical scenes.

From Theory to Practice

Two tools let you experiment with the complete reconstruction pipeline: a web demo and a local installation.

Hugging Face Demo

This space runs the full pipeline from images to 3D models in two steps.

Upload overlapping images of any scene rooms, objects, or outdoor areas. The first tab processes these through MASt3R:

Neural feature extraction and matching
Graph-based scene reconstruction
Dense pointmap generation
Camera pose estimation

Within minutes, you get sparse point clouds and camera parameters. Switch to the "3DGS" tab to train a Gaussian Splatting model on your reconstructed scene. The training takes 15-30 minutes and produces real-time renderable results you can download.

Step 1: MASt3R point cloud reconstruction

Step 2: 3DGS optimization result

Speed Revolution: Minutes to 3D

The combination of MASt3R and 3DGS represents a speed breakthrough in 3D reconstruction. For small datasets (up to 10 images), you can go from raw photos to a fully interactive 3D model in under 10 minutes total.

This dramatic speedup comes from two innovations working together. MASt3R's neural approach skips the expensive feature matching that makes classical methods slow, while 3DGS training converges quickly thanks to the high-quality initialization from MASt3R's dense pointmaps.

Compare this to traditional workflows: COLMAP alone could take hours on the same dataset, and NeRF training would require additional hours even with perfect camera poses. The MASt3R + 3DGS combination achieves comparable visual quality in a fraction of the time.

For larger datasets, processing scales gracefully—20-30 images typically complete in 20-30 minutes, while datasets of 50+ images finish within an hour. This makes iterative experimentation practical, where you can quickly test different capture strategies or scene compositions.

Local Installation

This repository provides the complete pipeline for local experimentation.

Installation follows standard practices:

git clone https://github.com/nerlfield/wild-gaussian-splatting
cd wild-gaussian-splatting
conda env create -f environment.yml

The repository supports multiple reconstruction methods:

DUSt3R: Neural reconstruction for smaller datasets or memory-constrained scenarios.
MASt3R: Fast neural reconstruction that scales to larger image collections.

Both methods output camera poses and point clouds converted to COLMAP format, ensuring compatibility with the broader 3D reconstruction ecosystem. This standardized output are fed into the 3DGS training pipeline, for direct comparison of reconstruction quality and training time across approaches.

The conversion to COLMAP format is particularly valuable because it allows integration with any downstream method that expects traditional SfM outputs.

References

Mildenhall, Ben, et al. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis." European Conference on Computer Vision (ECCV), 2020. https://arxiv.org/abs/2003.08934
Kerbl, Bernhard, et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM Transactions on Graphics (SIGGRAPH), 2023. https://arxiv.org/abs/2308.04079
Schönberger, Johannes L., and Jan-Michael Frahm. "Structure-from-Motion Revisited." IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. https://arxiv.org/abs/1604.07379
Wang, Shuzhe, et al. "DUSt3R: Geometric 3D Vision Made Easy." arXiv preprint, 2023. https://arxiv.org/abs/2312.14132
Leroy, Vincent, et al. "MASt3R: Matching Anything for Stereoscopic Reconstruction." arXiv preprint, 2024. https://arxiv.org/abs/2409.19152
Müller, Thomas, et al. "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding." ACM Transactions on Graphics (SIGGRAPH), 2022. https://arxiv.org/abs/2201.05989
Lowe, David G. "Distinctive Image Features from Scale-Invariant Keypoints." International Journal of Computer Vision, 2004. Paper PDF
Triggs, Bill, et al. "Bundle Adjustment — A Modern Synthesis." Vision Algorithms: Theory and Practice, Springer, 2000. Paper PDF
Pharr, Matt, Wenzel Jakob, and Greg Humphreys. "Physically Based Rendering: From Theory to Implementation." Third Edition, Morgan Kaufmann, 2016. https://www.pbr-book.org/
Scratchapixel Team. "Computer Graphics from Scratch." Online Resource. https://www.scratchapixel.com/