less_slow.cpp

Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO

ashvardanian

1821

C++

Playing Around Less Slow Coding Practices for C++, CUDA, and Assembly Code

The benchmarks in this repository don’t aim to cover every topic entirely, but they help form a mindset and intuition for performance-oriented software design.
It also provides an example of using some non-STL but de facto standard libraries in C++, importing them via CMake and compiling from source.
For higher-level abstractions and languages, check out less_slow.rs and less_slow.py.
I needed many of these measurements to reconsider my own coding habits, but hopefully they’re helpful to others as well.
Most of the code is organized in very long, ordered, and nested #pragma sections — not necessarily the preferred form for everyone.

Much of modern code suffers from common pitfalls — bugs, security vulnerabilities, and performance bottlenecks.
University curricula and coding bootcamps tend to stick to traditional coding styles and standard features, rarely exposing the more fun, unusual, and potentially efficient design opportunities.
This repository explores just that.

Less Slow C++

The code leverages C++20 and CUDA features and is designed primarily for GCC, Clang, and NVCC compilers on Linux, though it may work on other platforms.
The topics range from basic micro-kernels executing in a few nanoseconds to more complex constructs involving parallel algorithms, coroutines, and polymorphism.
Some of the highlights include:

100x cheaper random inputs?! Discover how input generation sometimes costs more than the algorithm.
1% error in trigonometry at 1/40 cost: Approximate STL functions like std::sin in just 3 lines of code.
4x faster lazy-logic with custom std::ranges and iterators!
Compiler optimizations beyond -O3: Learn about less obvious flags and techniques for another 2x speedup.
Multiplying matrices? Check how a 3x3x3 GEMM can be 70% slower than 4x4x4, despite 60% fewer ops.
Scaling AI? Measure the gap between theoretical ALU throughput and your BLAS.
How many if conditions are too many? Test your CPU’s branch predictor with just 10 lines of code.
Prefer recursion to iteration? Measure the depth at which your algorithm will SEGFAULT.
Why avoid exceptions? Take std::error_code or std::variant-like ADTs?
Scaling to many cores? Learn how to use OpenMP, Intel’s oneTBB, or your custom thread pool.
How to handle JSON avoiding memory allocations? Is it easier with C++ 20 or old-school C 99 tools?
How to properly use STL’s associative containers with custom keys and transparent comparators?
How to beat a hand-written parser with consteval RegEx engines?
Is the pointer size really 64 bits and how to exploit pointer-tagging?
How many packets is UDP dropping and how to serve web requests in io_uring from user-space?
Scatter and Gather for 50% faster vectorized disjoint memory operations.
Intel’s oneAPI vs Nvidia’s CCCL? What’s so special about <thrust> and <cub>?
CUDA C++, PTX Intermediate Representations, and SASS, and how do they differ from CPU code?
How to choose between intrinsics, inline asm, and separate .S files for your performance-critical code?
Tensor Cores & Memory differences on CPUs, and Volta, Ampere, Hopper, and Blackwell GPUs!
How coding FPGA differs from GPU and what is High-Level Synthesis, Verilog, and VHDL? 🔜 #36
What are Encrypted Enclaves and what’s the latency of Intel SGX, AMD SEV, and ARM Realm? 🔜 #31

To read, jump to the less_slow.cpp source file and read the code snippets and comments.
Keep in mind, that most modern IDEs have a navigation bar to help you view and jump between #pragma region sections.
Follow the instructions below to run the code in your environment and compare it to the comments as you read through the source.

Running the Benchmarks

The project aims to be compatible with GCC, Clang, and MSVC compilers on Linux, MacOS, and Windows.
That said, to cover the broadest functionality, using GCC on Linux is recommended:

If you are on Windows, it’s recommended that you set up a Linux environment using WSL.
If you are on MacOS, consider using the non-native distribution of Clang from Homebrew or MacPorts.
If you are on Linux, make sure to install CMake and a recent version of GCC or Clang compilers to support C++20 features.

If you are familiar with C++ and want to review code and measurements as you read, you can clone the repository and execute the following commands.

git clone https://github.com/ashvardanian/less_slow.cpp.git # Clone the repository
cd less_slow.cpp                                            # Change the directory

pip install cmake --upgrade                                 # PyPI has a newer version of CMake
sudo apt-get install -y build-essential g++                 # Install default build tools
sudo apt-get install -y pkg-config liburing-dev             # Install liburing for kernel-bypass
sudo apt-get install -y libopenblas-base                    # Install numerics libraries

cmake -B build_release -D CMAKE_BUILD_TYPE=Release          # Generate the build files
cmake --build build_release --config Release                # Build the project
build_release/less_slow                                     # Run the benchmarks

The build will pull and compile several third-party dependencies from the source:

Google’s Benchmark is used for profiling.
Intel’s oneTBB is used as the Parallel STL backend.
Meta’s libunifex is used for senders & executors.
Eric Niebler’s range-v3 replaces std::ranges.
Victor Zverovich’s fmt replaces std::format.
Ash Vardanian’s StringZilla replaces std::string.
Hana Dusíková’s CTRE replaces std::regex.
Niels Lohmann’s json is used for JSON deserialization.
Yaoyuan Guo’s yyjson for faster JSON processing.
Google’s Abseil replaces STL’s associative containers.
Lewis Baker’s cppcoro implements C++20 coroutines.
Jens Axboe’s liburing to simplify Linux kernel-bypass.
Chris Kohlhoff’s ASIO as a networking TS extension.
Nvidia’s CCCL for GPU-accelerated algorithms.
Nvidia’s CUTLASS for GPU-accelerated Linear Algebra.

To build without Parallel STL, Intel TBB, BLAS, and CUDA:

cmake -B build_release -D CMAKE_BUILD_TYPE=Release -D USE_INTEL_TBB=OFF -D USE_NVIDIA_CCCL=OFF -D USE_BLAS=OFF
cmake --build build_release --config Release

To build on MacOS, pulling key dependencies from Homebrew:

brew install openblas
cmake -B build_release \
      -D CMAKE_BUILD_TYPE=Release \
      -D CMAKE_C_FLAGS="-I$(brew --prefix openblas)/include" \
      -D CMAKE_CXX_FLAGS="-I$(brew --prefix openblas)/include" \
      -D CMAKE_EXE_LINKER_FLAGS="-L$(brew --prefix openblas)/lib"
cmake --build build_release --config Release

To control the output or run specific benchmarks, use the following flags:

build_release/less_slow --benchmark_format=json             # Output in JSON format
build_release/less_slow --benchmark_out=results.json        # Save the results to a file instead of `stdout`
build_release/less_slow --benchmark_filter=std_sort         # Run only benchmarks containing `std_sort` in their name

To enhance stability and reproducibility, disable Simultaneous Multi-Threading (SMT) on your CPU and use the --benchmark_enable_random_interleaving=true flag, which shuffles and interleaves benchmarks as described here.

build_release/less_slow --benchmark_enable_random_interleaving=true

Google Benchmark supports User-Requested Performance Counters through libpmf.
Note that collecting these may require sudo privileges.

sudo build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_format=json --benchmark_perf_counters="CYCLES,INSTRUCTIONS"

Alternatively, use the Linux perf tool for performance counter collection:

sudo perf stat taskset 0xEFFFEFFFEFFFEFFFEFFFEFFFEFFFEFFF build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_filter=super_sort

Project Structure

The primary file of this repository is clearly the less_slow.cpp C++ file with CPU-side code.
Several other files for different hardware-specific optimizations are created:

$ tree .
.
├── CMakeLists.txt          # Build & assembly instructions for all files
├── less_slow.cpp           # Primary CPU-side benchmarking code with the majority of examples
├── less_slow_amd64.S       # Hand-written Assembly kernels for 64-bit x86 CPUs
├── less_slow_aarch64.S     # Hand-written Assembly kernels for 64-bit Arm CPUs
├── less_slow.cu            # CUDA C++ examples for parallel algorithms for Nvidia GPUs
├── less_slow_sm70.ptx      # Hand-written PTX IR kernels for Nvidia Volta GPUs
└── less_slow_sm90a.ptx     # Hand-written PTX IR kernels for Nvidia Hopper GPUs

Memes and References

Educational content without memes?!
Come on!

Google Benchmark Functionality

This benchmark suite uses most of the features provided by Google Benchmark.
If you write a lot of benchmarks and avoid going to the full User Guide, here is a condensed list of the most useful features:

->Args({x, y}) - Pass multiple arguments to parameterized benchmarks
BENCHMARK() - Register a basic benchmark function
BENCHMARK_CAPTURE() - Create variants of benchmarks with different captured values
Counter::kAvgThreads - Specify thread-averaged counters
DoNotOptimize() - Prevent compiler from optimizing away operations
ClobberMemory() - Force memory synchronization
->Complexity(oNLogN) - Specify and validate algorithmic complexity
->SetComplexityN(n) - Set input size for complexity calculations
->ComputeStatistics("max", ...) - Calculate custom statistics across runs
->Iterations(n) - Control exact number of iterations
->MinTime(n) - Set minimum benchmark duration
->MinWarmUpTime(n) - To warm up the data caches
->Name("...") - Assign custom benchmark names
->Range(start, end) - Profile for a range of input sizes
->RangeMultiplier(n) - Set multiplier between range values
->ReportAggregatesOnly() - Show only aggregated statistics
state.counters["name"] - Create custom performance counters
state.PauseTiming(), ResumeTiming() - Control timing measurement
state.SetBytesProcessed(n) - Record number of bytes processed
state.SkipWithError() - Skip benchmark with error message
->Threads(n) - Run benchmark with specified number of threads
->Unit(kMicrosecond) - Set time unit for reporting
->UseRealTime() - Measure real time instead of CPU time
->UseManualTime() - To feed custom timings for GPU and IO benchmarks