Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
The benchmarks in this repository don’t aim to cover every topic entirely, but they help form a mindset and intuition for performance-oriented software design.
It also provides an example of using some non-STL but de facto standard libraries in C++, importing them via CMake and compiling from source.
For higher-level abstractions and languages, check outless_slow.rs
andless_slow.py
.
I needed many of these measurements to reconsider my own coding habits, but hopefully they’re helpful to others as well.
Most of the code is organized in very long, ordered, and nested#pragma
sections — not necessarily the preferred form for everyone.
Much of modern code suffers from common pitfalls — bugs, security vulnerabilities, and performance bottlenecks.
University curricula and coding bootcamps tend to stick to traditional coding styles and standard features, rarely exposing the more fun, unusual, and potentially efficient design opportunities.
This repository explores just that.
The code leverages C++20 and CUDA features and is designed primarily for GCC, Clang, and NVCC compilers on Linux, though it may work on other platforms.
The topics range from basic micro-kernels executing in a few nanoseconds to more complex constructs involving parallel algorithms, coroutines, and polymorphism.
Some of the highlights include:
std::sin
in just 3 lines of code.std::ranges
and iterators!-O3
: Learn about less obvious flags and techniques for another 2x speedup.SEGFAULT
.std::error_code
or std::variant
-like ADTs?consteval
RegEx engines?io_uring
from user-space?<thrust>
and <cub>
?asm
, and separate .S
files for your performance-critical code?To read, jump to the less_slow.cpp
source file and read the code snippets and comments.
Keep in mind, that most modern IDEs have a navigation bar to help you view and jump between #pragma region
sections.
Follow the instructions below to run the code in your environment and compare it to the comments as you read through the source.
The project aims to be compatible with GCC, Clang, and MSVC compilers on Linux, MacOS, and Windows.
That said, to cover the broadest functionality, using GCC on Linux is recommended:
If you are familiar with C++ and want to review code and measurements as you read, you can clone the repository and execute the following commands.
git clone https://github.com/ashvardanian/less_slow.cpp.git # Clone the repository
cd less_slow.cpp # Change the directory
pip install cmake --upgrade # PyPI has a newer version of CMake
sudo apt-get install -y build-essential g++ # Install default build tools
sudo apt-get install -y pkg-config liburing-dev # Install liburing for kernel-bypass
sudo apt-get install -y libopenblas-base # Install numerics libraries
cmake -B build_release -D CMAKE_BUILD_TYPE=Release # Generate the build files
cmake --build build_release --config Release # Build the project
build_release/less_slow # Run the benchmarks
The build will pull and compile several third-party dependencies from the source:
std::ranges
.std::format
.std::string
.std::regex
.To build without Parallel STL, Intel TBB, and CUDA:
cmake -B build_release -D CMAKE_BUILD_TYPE=Release -D USE_INTEL_TBB=OFF -D USE_NVIDIA_CCCL=OFF
cmake --build build_release --config Release
To build on MacOS, pulling key dependencies from Homebrew:
brew install openblas
cmake -B build_release \
-D CMAKE_BUILD_TYPE=Release \
-D CMAKE_C_FLAGS="-I$(brew --prefix openblas)/include" \
-D CMAKE_CXX_FLAGS="-I$(brew --prefix openblas)/include" \
-D CMAKE_EXE_LINKER_FLAGS="-L$(brew --prefix openblas)/lib"
cmake --build build_release --config Release
To control the output or run specific benchmarks, use the following flags:
build_release/less_slow --benchmark_format=json # Output in JSON format
build_release/less_slow --benchmark_out=results.json # Save the results to a file instead of `stdout`
build_release/less_slow --benchmark_filter=std_sort # Run only benchmarks containing `std_sort` in their name
To enhance stability and reproducibility, disable Simultaneous Multi-Threading (SMT) on your CPU and use the --benchmark_enable_random_interleaving=true
flag, which shuffles and interleaves benchmarks as described here.
build_release/less_slow --benchmark_enable_random_interleaving=true
Google Benchmark supports User-Requested Performance Counters through libpmf
.
Note that collecting these may require sudo
privileges.
sudo build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_format=json --benchmark_perf_counters="CYCLES,INSTRUCTIONS"
Alternatively, use the Linux perf
tool for performance counter collection:
sudo perf stat taskset 0xEFFFEFFFEFFFEFFFEFFFEFFFEFFFEFFF build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_filter=super_sort
The primary file of this repository is clearly the less_slow.cpp
C++ file with CPU-side code.
Several other files for different hardware-specific optimizations are created:
$ tree .
.
├── CMakeLists.txt # Build & assembly instructions for all files
├── less_slow.cpp # Primary CPU-side benchmarking code with the majority of examples
├── less_slow_amd64.S # Hand-written Assembly kernels for 64-bit x86 CPUs
├── less_slow_aarch64.S # Hand-written Assembly kernels for 64-bit Arm CPUs
├── less_slow.cu # CUDA C++ examples for parallel algorithms for Nvidia GPUs
├── less_slow_sm70.ptx # Hand-written PTX IR kernels for Nvidia Volta GPUs
└── less_slow_sm90a.ptx # Hand-written PTX IR kernels for Nvidia Hopper GPUs
Educational content without memes?!
Come on!
![]() |
![]() |
This benchmark suite uses most of the features provided by Google Benchmark.
If you write a lot of benchmarks and avoid going to the full User Guide, here is a condensed list of the most useful features:
->Args({x, y})
- Pass multiple arguments to parameterized benchmarksBENCHMARK()
- Register a basic benchmark functionBENCHMARK_CAPTURE()
- Create variants of benchmarks with different captured valuesCounter::kAvgThreads
- Specify thread-averaged countersDoNotOptimize()
- Prevent compiler from optimizing away operationsClobberMemory()
- Force memory synchronization->Complexity(oNLogN)
- Specify and validate algorithmic complexity->SetComplexityN(n)
- Set input size for complexity calculations->ComputeStatistics("max", ...)
- Calculate custom statistics across runs->Iterations(n)
- Control exact number of iterations->MinTime(n)
- Set minimum benchmark duration->MinWarmUpTime(n)
- To warm up the data caches->Name("...")
- Assign custom benchmark names->Range(start, end)
- Profile for a range of input sizes->RangeMultiplier(n)
- Set multiplier between range values->ReportAggregatesOnly()
- Show only aggregated statisticsstate.counters["name"]
- Create custom performance countersstate.PauseTiming()
, ResumeTiming()
- Control timing measurementstate.SetBytesProcessed(n)
- Record number of bytes processedstate.SkipWithError()
- Skip benchmark with error message->Threads(n)
- Run benchmark with specified number of threads->Unit(kMicrosecond)
- Set time unit for reporting->UseRealTime()
- Measure real time instead of CPU time->UseManualTime()
- To feed custom timings for GPU and IO benchmarks