A refactored codebase for Gaussian Splatting. Fastest(4.7x)!! Modular!! Pure Python or CUDA Extension
The Fastest Operators! 4.7x Acceleration, Modular, and Available in Pure Python or CUDA
This repository provides a refactored codebase aimed at improving the flexibility and performance of Gaussian splatting.
We’re excited to release a preview of LiteGS’s latest advancements, achieving 30%-35% speedup over the previous version. Compared to the original 3DGS implementation, LiteGS delivers 4.7x acceleration, making it the fastest 3DGS operator to our knowledge. Note: The preview version has not yet completed adaptation for older GPU architectures - ensure your device is sm_8.x or higher, otherwise use the stable branch. Technical details will be presented in an upcoming paper/report. Thank you for your interest.
Gaussian splatting is a powerful technique used in various computer graphics and vision applications. It involves representing 3D data as Gaussian distributions in space, allowing for efficient and accurate representation of spatial data. However, the original implementation (https://github.com/graphdeco-inria/gaussian-splatting) of Gaussian splatting in PyTorch faced several limitations:
Modular Design: The refactored codebase breaks forward and backward into multiple PyTorch extension functions, significantly improving modularity and enabling easier access to intermediate variables. Additionally, in some cases, leveraging PyTorch Autograd eliminates the need to manually derive gradient formulas.
Flexible: LiteGS provides two modular APIs—one implemented in CUDA and the other in Python. The Python-based API facilitates straightforward modifications to calculation logic without requiring expertise in C code, enabling rapid prototyping. Additionally, tensor dimensions are permuted to maintain competitive training speeds for the Python API. For performance-critical tasks, the CUDA-based API is fully customizable.
Better Performance and Fewer Resources: LiteGS achieves an 4.7x speed improvement over the original 3DGS implementation while reducing GPU memory usage by around 30%. These optimizations enhance training efficiency without compromising flexibility or readability.
Algorithm Preservation: LiteGS retains the core 3DGS algorithm, making only minor adjustments to the training logic due to culstering.
pip install lite-gaussian-splatting
Clone
git clone --recursive https://github.com/MooreThreads/LiteGS.git
cd LiteGS
Install simple-knn
cd litegs/submodules/simple-knn
python install .
Install fused-ssim
cd litegs/submodules/fussed_ssim
python intsall .
Install litegs_fused
cd litegs/submodules/gaussian_raster
python install .
If you need the cmake project:
cd litegs/submodules/gaussian_raster
mkdir ./build
cd ./build
#for Windows PowerShell: $env:CMAKE_PREFIX_PATH = (python -c "import torch; print(torch.utils.cmake_prefix_path)")
export CMAKE_PREFIX_PATH=$(python -c "import torch; print(torch.utils.cmake_prefix_path)")
cmake ../
cmake --build . --config Release
Begin training with the following command:
./example_train.py --sh_degree 3 -s DATA_SOURCE -i IMAGE_FOLDER -m OUTPUT_PATH
The training results of LiteGS using the Mip-NeRF 360 dataset on an NVIDIA A100 GPU are presented below. The training and evaluation command used is:
python ./full_eval.py --mipnerf360 SOURCE_PATH
metric|scene | Bicycle | flowers | garden | stump | treehill | room | counter | kitchen | bonsai |
---|---|---|---|---|---|---|---|---|---|
SSIM(Test) | 0.746 | 0.594 | 0.858 | 0.788 | 0.637 | 0.928 | 0.915 | 0.932 | 0.947 |
PSNR(Test) | 25.15 | 21.67 | 27.50 | 27.11 | 22.69 | 32.00 | 29.22 | 32.01 | 32.43 |
LPIPS(Test) | 0.251 | 0.357 | 0.121 | 0.227 | 0.350 | 0.199 | 0.186 | 0.119 | 0.181 |
Below is a summary of the training times for LiteGS and the original 3DGS implementations on NVIDIA A100 and NVIDIA RTX3090 GPUs. In the current setup, the original 3DGS implementation includes an integrated acceleration scheme from Taming-GS that can be enabled via training parameters. For comparison purposes, we have also included this TamingGS-integrated version in our evaluation.
Our experimental results demonstrate that even with kernel splitting to achieve modularity, LiteGS still delivers remarkable performance improvements. Specifically, compared to the original 3DGS implementation, LiteGS achieves a 4.7x speedup on RTX3090. Moreover, when compared to the TamingGS-integrated version of 3DGS, LiteGS still offers a 2.0x speedup. These results clearly underscore the effectiveness of our approach in enhancing training efficiency while maintaining high performance.
Bicycle | flowers | garden | stump | treehill | room | counter | kitchen | bonsai | |
---|---|---|---|---|---|---|---|---|---|
Original-d9fad7b(RTX3090) | 40:06 | 28:46 | 42:18 | 32:40 | 29:39 | 24:53 | 24:16 | 29:30 | 20:15 |
Taming-Intergrated(RTX3090) | 15:29 | 11:42 | 16:24 | 11:49 | 11:48 | 10:06 | 12:00 | 16:38 | 09:01 |
LiteGS(RTX3090) | 07:33 | 06:03 | 08:08 | 06:25 | 06:36 | 04:25 | 04:56 | 05:57 | 04:39 |
Unlike the original 3DGS, which encapsulates nearly the entire rendering process into a single PyTorch extension function, LiteGS divides the process into multiple modular functions. This design allows users to access intermediate variables and integrate custom computation logic using Python scripts, eliminating the need to modify C code. The rendering process in LiteGS is broken down into the following steps:
Cluster Culling
LiteGS divides the Gaussian points into several chunks, with each chunk containing 1,024 points. The first step in the rendering pipeline is frustum culling, where points outside the camera’s view are filtered out.
Cluster Compact
Similar to mesh rendering, LiteGS compacts visible primitives after frustum culling. Each property of the visible points is reorganized into sequential memory to improve processing efficiency.
3DGS Projection
Gaussian points are projected into screen space in this step, with no modifications made compared to the original 3DGS implementation.
Create Visibility Table
A visibility table is created in this step, mapping tiles to their visible primitives, enabling efficient parallel processing in subsequent stages.
Rasterization
In the final step, each tile rasterizes its visible primitives in parallel, ensuring high computational efficiency.
LiteGS makes slight adjustments to density control to accommodate its clustering-based approach.
The gaussian_splatting/wrapper.py file contains two sets of APIs, offering flexibility in choosing between Python-based and CUDA-based implementations. The Python-based API is invoked using call_script(), while the CUDA-based API is available via call_fused(). While the CUDA-based API delivers significant performance improvements, it lacks the flexibility. The choice between these implementations depends on the specific use case:
python-based api: Provides greater flexibility, making it ideal for rapid prototyping and development where training speed is less critical.
cuda-based api: Offers the highest performance and is recommended for production environments where training speed is a priority.
Additionally, an interface validate() and the accompanying check_wrapper.py script are provided to verify that both APIs produce consistent gradients.
Here is an example that demonstrates the flexibility of LiteGS. In this instance, our goal is to create a more precise bounding box for a 2D Gaussian when generating visibility tables. In the original 3DGS implementation, the bounding box is determined as three times the length of the major axis of the Gaussian. However, incorporating opacity can allow for a smaller bounding box.
To implement this change in the original 3DGS, the following steps are required:
In LiteGS, the same change can be achieved by simply editing a Python script.
original:
axis_length=(3.0*eigen_val.abs()).sqrt().ceil()
modified:
coefficient=2*((255*opacity).log())
axis_length=(coefficient*eigen_val.abs()).sqrt().ceil()
If you find this project useful in your research, please consider cite:
@misc{LiteGS,
title={LiteGS},
author={LiteGS Contributors},
howpublished = {\url{https://github.com/MooreThreads/LiteGS}},
year={2024}
}
Comming soon.