Project Overview
This project explores the implementation of Neural Radiance Fields (NeRF) to represent 2D images and 3D scenes. The project is divided into two parts: fitting a neural field to a 2D image and creating a NeRF for multi-view 3D scenes. Techniques like sinusoidal positional encoding, ray sampling, and volume rendering are used to build and optimize these models.
Part 1: Neural Field for 2D Image
In this part, a neural field is optimized to fit a 2D image. A multilayer perceptron (MLP) is used with sinusoidal positional encoding to learn the mapping from pixel coordinates to RGB values. The training process includes random pixel sampling and minimizing mean squared error loss.
The architecture used is the one below:
The hyperparameters we used were:
- Number of layers: 4
- Channel size: 256
- Max frequency for the positional encoding: 10
- Learning rate: 0.01
- Batch Size: 10K
- Number of Epochs: 10
Training Process Visualization
Iteration 10
Iteration 100
Iteration 500
Iteration 800
Original Image
Visualization of the predicted image across training iterations.
PSNR Curve for Obama
Plot showing the PSNR across training iterations for the picture of Obama. We can see that the PSNR has occassional spikes which don't appear in the fox PSNR curve.
Here is the image of the fox over iterations with the tasks hypersparameters and the original image.
Iteration 10
Iteration 100
Iteration 100
Iteration 500
Original Image
PSNR Curve for the fox
Plot showing the PSNR across training iterations for the Fox.
Part 2: Neural Radiance Field for 3D Scenes
Part 2.1: Create Rays from Cameras
Camera to World Coordinate Conversion
We implemented the camera-to-world coordinate transformation by
defining a function x_w = transform(c2w, x_c)
, which
converts points from the camera space to the world space. The inverse
operation was also implemented to verify the correctness of the
transformation. We used either NumPy or PyTorch for matrix
multiplication, ensuring support for batched operations to handle
multiple points efficiently.
Pixel to Camera Coordinate Conversion
To map pixel coordinates (u, v)
to camera coordinates
(x_c, y_c, z_c)
, we defined a function
x_c = pixel_to_camera(K, uv, s)
, where K
is
the camera intrinsic matrix and s
represents depth. This
function was implemented to invert the projection process defined by
the intrinsic matrix. We also added batch processing support for
scalability during rendering.
Pixel to Ray Conversion
For converting pixel coordinates to rays, we implemented a function
ray_o, ray_d = pixel_to_ray(K, c2w, uv)
, where
ray_o
is the ray origin and ray_d
is the
normalized ray direction. This function leverages the earlier
transformations and computes the ray origin as the camera position and
the ray direction using normalized vector math. The implementation
supports batched coordinates for processing multiple rays
simultaneously.
Part 2.2: Sampling
Sampling Rays from Images
We extended the random sampling approach from Part 1 to include multi-view images. A dataloader was implemented to handle multiple images and generate rays uniformly or with per-image sampling. The pixel coordinates were converted to rays using the functions from Part 2.1, and each ray was associated with the corresponding pixel color.
Sampling Points along Rays
To sample 3D points along each ray, we defined t
as a
uniform range using np.linspace(near, far, n_samples)
and
computed each 3D point as x = R_o + R_d * t
. To introduce
randomness and avoid overfitting, we added a perturbation to
t
during training, ensuring better generalization. This
step was crucial for generating accurate volume rendering results.
Part 2.3: Putting the Dataloader Together
We created a custom dataloader that processes pixel coordinates from multi-view images, converts them into rays, and samples 3D points along each ray. This dataloader returns the ray origin, ray direction, sampled points, and pixel colors. To verify correctness, we visualized a subset of rays and sampled points to ensure they matched the expected camera frustums.
Part 2.4: Neural Radiance Field
Network Architecture
We modified the neural network from Part 1 to handle 3D inputs instead of 2D. The network now predicts both the density and RGB color for each 3D point. Positional encoding was used for the 3D input coordinates and ray directions, with a reduced frequency for ray direction encoding. We made the MLP deeper and injected the positional encoding features of the input at intermediate layers to improve gradient flow and model capacity.
This part extends the neural field to 3D scenes using multi-view images. Rays are sampled from the images, and a volume rendering equation is applied to integrate densities and colors along the rays to generate pixel colors. Below is a diagram of the structure for the network implemented.
Camera and Ray Visualization
Below is the image of how the rays travel from the cameras through the images. The points are added to show how the rays sample points randomly on the rays and they are perturbed which we can see from the fact that the points are not spread out uniformly. The first image is a picture of rays from all cameras. The second image is rays from only one camera.
Visualization of sampled rays and camera frustums.
Volume Rendering Results
Progression of the rendered images during training. The image starts out blurry and becomes sharper as the model learns the scene.
Epoch 0 Batch 500
Epoch 0 Batch 1000
Epoch 1 Batch 500
Epoch 1 Batch 1000
Epoch 2 Batch 500
Epoch 2 Batch 1000
Here is the video of the final result of the 3D scene from both low resolution and high resolution. The high resolution had the following hyperparameters:
- Number of positional encoding frequencies (L_x): 10
- Number of directional encoding frequencies (L_r_d): 4
- Hidden layer dimension: 256
- RGB output dimension: 3
- Density output dimension: 1
- Number of hidden layers: 8
Low Resolution
High Resolution
PSNR Curve
As we can see from the plot below, the curve is rapidly increasing in the begninning and slowly converging to a high PSNR value of around 23-24 PSNR. Our final PSNR value was 23.6.
Plot showing the PSNR for validation images during training.
Bells and Whistles
For the B&W Portion, we added a blue background to the video by changing the volume function. The way we did this was by setting a threshold where if the transmittance was below a certain value, we would set the color to blue. This however, causes some aliasing issues as we can see which could have been fixed by implementing a solution where you add a color inverse proporsionally to the transmittance. Below is a video of the final result after 1 epoch given time contraints.
Custom Background Rendering
Acknowledgements
This project is part of CS 180 at UC Berkeley. Testing code and datasets were provided by course staff.