Memory allocation benchmark
This repository uses the binary trees benchmark of the computer language
benchmark game to compare various memory allocation schemes.
The results should not be generalized for general allocation performance.
Real world programs (especially in languages that support value types) do
not spend nearly as much time allocating and freeing memory, but typically
only 10%-20% of overall CPU time. Performance differences for actual code
will therefore usually be a fraction of the results of this synthetic
benchmark.
The goal of the benchmark is not to make some universal statement about
approaches to allocation, but to dispel generalized preconceptions about
performance of memory management techniques. Other use cases may result in
different performance profiles. For example, data that is not so
pointer-heavy will put less of a burden on GC marking. Performance
can vary greatly by workload.
The code exclusively allocates small objects consisting of two machine
words, which any smart allocator implementation will serve from a pool in
essentially constant time. Calling functions in external libraries and
synchronization can therefore make up a significant part of the allocation
overhead.
The benchmark is specifically designed not to make life easy for garbage
collectors by keeping a large part of the heap alive.
We’re measuring throughput only, not pause times, though there is some
instrumentation in the C code to measure pause times if you want to.
A few notes:
free()
has a per-object overhead just to call the function. ThisBenchmarks were run on Macs, which also accounts for the poor
performance of system malloc() and free() (which have notoriously poorly
performing default implementations in macOS).
Below are the results of sample runs, both on an M1 mini and a 2018
Macbook Pro with a 2.6GHz Core i7 with six cores.
Note that those are a single run each and results can vary somewhat. Again,
the goal is not to say something about typical performance of allocators,
but to show that naive assumptions about the performance of various methods
of memory management may not actually always be true. Therefore, getting
precise measurements was not a priority.
# jemalloc explicit malloc()/free()
tools/bench.sh ./btree-jemalloc 21
12.11 real 12.07 user 0.03 sys
# mimalloc explicit malloc()/free()
tools/bench.sh ./btree-mimalloc 21
4.00 real 3.90 user 0.07 sys
# dlmalloc explicit malloc()/free() (not threadsafe)
tools/bench.sh ./btree-dlmalloc 21
4.09 real 4.06 user 0.03 sys
# dlmalloc explicit malloc()/free() (threadsafe)
tools/bench.sh ./btree-dlmalloc-lock 21
12.66 real 12.62 user 0.04 sys
# C++ shared pointers (with mimalloc as base allocator)
tools/bench.sh ./btree-shared-ptr 21
12.13 real 12.05 user 0.07 sys
# C++ shared pointers + const ref (with mimalloc as base allocator)
tools/bench.sh ./btree-shared-ptr-const-ref 21
9.48 real 9.39 user 0.07 sys
# Boehm GC with four parallel marker threads
GC_MARKERS=4 tools/bench.sh ./btree-gc 21
5.49 real 6.92 user 0.04 sys
# Boehm GC with single-threaded marking
GC_MARKERS=1 tools/bench.sh ./btree-gc 21
6.68 real 6.65 user 0.02 sys
# Boehm GC with four parallel markers and minimal space overhead
GC_MARKERS=4 GC_FREE_SPACE_DIVISOR=10 tools/bench.sh ./btree-gc 21
5.80 real 8.08 user 0.06 sys
# Boehm GC with single-threaded marking and minimal space overhead
GC_MARKERS=1 GC_FREE_SPACE_DIVISOR=10 tools/bench.sh ./btree-gc 21
7.80 real 7.77 user 0.02 sys
# Boehm GC with explicit deallocation
tools/bench.sh ./btree-gc-free 21
8.30 real 8.28 user 0.02 sys
# OCaml
tools/bench.sh ./btree-ml 21
2.09 real 2.06 user 0.02 sys
# Nim reference counting GC
tools/bench.sh ./btree-nim 21
12.14 real 12.07 user 0.04 sys
# Nim mark and sweep GC
tools/bench.sh ./btree-nim-ms 21
11.87 real 11.80 user 0.06 sys
# Nim with ARC and cycle collection enabled
tools/bench.sh ./btree-nim-arc 21
6.95 real 6.90 user 0.04 sys
# Nim with Boehm GC
GC_MARKERS=1 tools/bench.sh ./btree-nim-boehm 21
7.35 real 7.31 user 0.03 sys
# D garbage collector (classes)
tools/bench.sh ./btree-d 21
12.04 real 11.96 user 0.05 sys
# D garbage collector (structs)
tools/bench.sh ./btree-d-struct 21
17.76 real 17.70 user 0.05 sys
# Go garbage collector
GOMAXPROCS=4 tools/bench.sh ./btree-go 21
11.40 real 21.35 user 0.11 sys
# Dart garbage collector
tools/bench.sh ./btree-dart 21
4.71 real 6.40 user 0.40 sys
# Java G1GC garbage collector
tools/bench.sh java -XX:+UseG1GC -XX:ParallelGCThreads=4 -XX:ConcGCThreads=3 btree 21
2.01 real 2.25 user 0.15 sys
# Java ZGC garbage collector
tools/bench.sh java -XX:+UseZGC -XX:ParallelGCThreads=4 -XX:ConcGCThreads=3 btree 21
2.95 real 4.00 user 0.39 sys
# Java Shenandoah garbage collector
tools/bench.sh java -XX:+UseShenandoahGC -XX:ParallelGCThreads=4 -XX:ConcGCThreads=3 btree 21
2.65 real 3.00 user 0.40 sys
# System malloc()/free()
tools/bench.sh ./btree-sysmalloc 21
16.57 real 16.07 user 0.49 sys
# Tiny GC (with dlmalloc as base allocator)
tools/bench.sh ./btree-tiny-gc 21
18.31 real 18.17 user 0.13 sys
Apple clang version 14.0.0 (clang-1400.0.29.102)
jemalloc 5.3.0
mimalloc 2.0.6
Boehm GC 8.2.2
LDC - the LLVM D compiler (1.30.0):
Nim Compiler Version 1.6.8 [MacOSX: arm64]
go version go1.18.1 darwin/arm64
ocamlopt 4.14.0
Dart SDK version: 2.18.3 (stable) (Mon Oct 17 13:23:20 2022 +0000) on "macos_arm64"
javac 17.0.3
# jemalloc explicit malloc()/free()
tools/bench.sh ./btree-jemalloc 21
12.69 real 12.55 user 0.12 sys
# mimalloc explicit malloc()/free()
tools/bench.sh ./btree-mimalloc 21
5.85 real 5.63 user 0.12 sys
# dlmalloc explicit malloc()/free() (not threadsafe)
tools/bench.sh ./btree-dlmalloc 21
6.98 real 6.92 user 0.05 sys
# dlmalloc explicit malloc()/free() (threadsafe)
tools/bench.sh ./btree-dlmalloc-lock 21
23.71 real 23.64 user 0.06 sys
# C++ shared pointers (with mimalloc as base allocator)
tools/bench.sh ./btree-shared-ptr 21
25.38 real 25.22 user 0.14 sys
# C++ shared pointers + const ref (with mimalloc as base allocator)
tools/bench.sh ./btree-shared-ptr-const-ref 21
19.76 real 19.60 user 0.15 sys
# Boehm GC with four parallel marker threads
GC_MARKERS=4 tools/bench.sh ./btree-gc 21
10.28 real 12.50 user 0.08 sys
# Boehm GC with single-threaded marking
GC_MARKERS=1 tools/bench.sh ./btree-gc 21
12.04 real 11.97 user 0.05 sys
# Boehm GC with four parallel markers and minimal space overhead
GC_MARKERS=4 GC_FREE_SPACE_DIVISOR=10 tools/bench.sh ./btree-gc 21
10.75 real 14.05 user 0.08 sys
# Boehm GC with single-threaded marking and minimal space overhead
GC_MARKERS=1 GC_FREE_SPACE_DIVISOR=10 tools/bench.sh ./btree-gc 21
13.19 real 13.14 user 0.04 sys
# Boehm GC with explicit deallocation
tools/bench.sh ./btree-gc-free 21
13.78 real 13.73 user 0.03 sys
# OCaml
tools/bench.sh ./btree-ml 21
3.72 real 3.65 user 0.06 sys
# Nim reference counting GC
tools/bench.sh ./btree-nim 21
19.32 real 19.22 user 0.08 sys
# Nim mark and sweep GC
tools/bench.sh ./btree-nim-ms 21
25.02 real 24.84 user 0.15 sys
# Nim with ARC and cycle collection enabled
tools/bench.sh ./btree-nim-arc 21
9.81 real 9.48 user 0.12 sys
# Nim with Boehm GC
GC_MARKERS=1 tools/bench.sh ./btree-nim-boehm 21
12.12 real 11.91 user 0.05 sys
# D garbage collector (classes)
tools/bench.sh ./btree-d 21
26.36 real 26.29 user 0.85 sys
# D garbage collector (structs)
tools/bench.sh ./btree-d-struct 21
32.06 real 31.99 user 0.85 sys
# Go garbage collector
GOMAXPROCS=4 tools/bench.sh ./btree-go 21
18.60 real 33.64 user 0.16 sys
# Dart garbage collector
tools/bench.sh ./btree-dart 21
10.57 real 14.09 user 1.15 sys
# Java G1GC garbage collector
tools/bench.sh java -XX:+UseG1GC -XX:ParallelGCThreads=4 -XX:ConcGCThreads=3 btree 21
4.25 real 4.21 user 0.63 sys
# Java ZGC garbage collector
tools/bench.sh java -XX:+UseZGC -XX:ParallelGCThreads=4 -XX:ConcGCThreads=3 btree 21
6.31 real 7.24 user 1.62 sys
# Java Shenandoah garbage collector
tools/bench.sh java -XX:+UseShenandoahGC -XX:ParallelGCThreads=4 -XX:ConcGCThreads=3 btree 21
5.11 real 5.45 user 1.08 sys
# System malloc()/free()
tools/bench.sh ./btree-sysmalloc 21
37.08 real 36.31 user 0.58 sys
# Tiny GC (with dlmalloc as base allocator)
tools/bench.sh ./btree-tiny-gc 21
39.32 real 38.84 user 0.26 sys
Apple clang version 13.1.6 (clang-1316.0.21.2.5)
jemalloc 5.3.0
mimalloc 2.0.6
Boehm GC 8.2.2
LDC - the LLVM D compiler (1.30.0):
Nim Compiler Version 1.6.8 [MacOSX: amd64]
go version go1.18.1 darwin/amd64
ocamlopt 4.14.0
Dart SDK version: 2.18.3 (stable) (Mon Oct 17 13:23:20 2022 +0000) on "macos_x64"
javac 17.0.3