Ruby Performance Optimization - Book abstract
This notes are adapted from these sources:
<= 2.0
). And that is because of high memory consumption & allocation>= 2.1
is 5 times faster than in previous versions1.9 - 2.3
is about the sameSee 001_gc.rb.
Use GC::Profiler to get some GC runs information.
See 002_memory.rb
See wrapper.rb for custom wrapper example.
<<
method instead of +=
to concatenate it and Ruby will not allocate an additional objectSee 004_array_bang.rb (w/ GC)
See 005_files.rb (w/ and w/o GC)
See 006_callbacks_1.rb
See 006_callbacks_2.rb
See 006_callbacks_3.rb
Try to avoid &block
and use yield
instead (former copies context stack while latter not)
Iterators use block arguments, so use them carefully
Note following:
Solutions:
each!
patternSee [007_iter_1.rb](007_iter_1.rb)
See [007_iter_2.rb](007_iter_2.rb) (for Ruby `< 2.3.0`)
Table of `T_NODE` allocations per iterator item for ruby 2.1:
| Iterator | Enum | Array | Range |
| ---------------: | ---- | ----- | ----- |
| all? | 3 | 3 | 3 |
| any? | 2 | 2 | 2 |
| collect | 0 | 1 | 1 |
| cycle | 0 | 1 | 1 |
| delete_if | 0 | — | 0 |
| detect | 2 | 2 | 2 |
| each | 0 | 0 | 0 |
| each_index | 0 | — | — |
| each_key | — | — | 0 |
| each_pair | — | — | 0 |
| each_value | — | — | 0 |
| each_with_index | 2 | 2 | 2 |
| each_with_object | 1 | 1 | 1 |
| fill | 0 | — | — |
| find | 2 | 2 | 2 |
| find_all | 1 | 1 | 1 |
| grep | 2 | 2 | 2 |
| inject | 2 | 2 | 2 |
| map | 0 | 1 | 1 |
| none? | 2 | 2 | 2 |
| one? | 2 | 2 | 2 |
| reduce | 2 | 2 | 2 |
| reject | 0 | 1 | 0 |
| reverse | 0 | — | — |
| reverse_each | 0 | 1 | 1 |
| select | 0 | 1 | 0 |
strptime
methodObject#class
, Object#is_a?
, Object#kind_of?
are slow when used inside iterators
Use SQL for aggregation & calculation if it is possible
See 009_db/ (database query itself is only 30ms)
For App Server:
<50ms
<300ms
>300ms
For API: 2 times faster!
For Frontend:
<500ms
<2s
>2s
#pluck
, #select
to load only necessary data#find_by_sql
to aggregate associations data#find_each
& #find_in_batches
ActiveRecord::Base.connection.execute
ActiveRecord::Base.connection.exec_query
ActiveRecord::Base.connection.select_values
#update_all
render partial: 'a', collection: @col
, which loads partial template only onceUse Rails cache(obj) {}
method to cache ActiveRecord objects expired by either:
updated_at
Avoid ‘Russian doll’ (nested) cache blocks or cache ‘id-arrays’ like following:
<% cache(["list", items.maximum(:updated_at)]) do %>
<ul>
<% items.each do |item| %>
<% cache(item) do %>
...
<% end %>
<% end %>
</ul>
<% end %>
If you refer relations in views, do not forget to touch
objects: belongs_to :obj, touch: true
Main Rails cache stores:
You should avoid falling into swap on Heroku, so calculate number of workers carefully
(128Mb for master and 256Mb for each worker).
If you have > 60 req/s
, use unicorn / puma / passenger (not thin / webrick).
Try derailed_benchmarks to see memory consumption.
Check out 24-hours memory consumption graphs.
Worker killers may be useful if you have undefined memory leaks. Check that killer
perform not more frequently than once an hour.
Serve assets from S3 / CDN or exclude them from monitoring.
Profiling = measuring CPU/Memory usage + interpreting results
For CPU profiling disable GC!
ruby-prof gem has both API (for isolated profiling) and CLI (for startup profiling) interfaces. It also has a Rack Middleware for Rails.
See 010_rp_1.rb
Some programs may spend more time on startup than on actual code execution.
Sometimes GC.disable
may take a significant amount of time because of lazy GC sweep.
Use Rack::RubyProf
middleware to profile Rails apps. Include it before Rack::Runtime
to include other middlewares in the report.
To disable GC, use custom middleware (see 010_rp_rails/config/application.rb).
Rails profiling best practices:
The most useful report types for ruby-prof (see 011_rp_rep.rb):
You also should try rack-mini-profiler with flamegraphs.
Ruby-prof can generate callgrind files with CallTreePrinter (see 011_rp_rep.rb).
Callgrind profiles have double counting issue!
Callgrind profiles show loops as recursion.
It is better to start from the bottom of Call Graph and optimize its leaves first.
Always start optimizing with writing tests & benchmarks.
! Profiler adds up to 10x time to function calls.
If you optimized individual functions but the whole thing is still slow, look at the code at a higher abstraction level.
Optimization tips:
80% of Ruby performance optimization comes from memory optimization.
You have 3 options for memory profiling:
GC#stat
& GC::Profiler
measurementsTo detect if memory profiling needed you should use monitoring and profiling tools.
Good tool for profiling is Valgrind Massif but it shows memory allocations only for C/C++ code.
Another tool is Stackprof that shows number of object allocations (that is proportional to memory consumption) (see 014_stackprof.rb). But if your code allocates a small number of large objects, it won’t help.
Stackprof could generate flamegraphs and it’s OK to use it in production because it has no overhead.
You need RailsExpress patched Ruby (google it). Then set RubyProf measure mode and use one the of printers (see 015_rp_memory.rb). Don’t forget to enable memory stats with GC.enable_stats
.
Modes for memory profiling:
Memory profile shows only new memory allocations (not the total amount of memory at the time) and doesn’t show GC reclaims.
! Ruby allocates temp object for string > 23 chars.
We can measure current memory usage, but it is not very useful.
On Linux we can use OS tools:
memory_before = `ps -o rss= -p #{Process.pid}`.to_i / 1024
do_something
memory_after = `ps -o rss= -p #{Process.pid}`.to_i / 1024
GC#stat
and GC::Profiler
can reveal some information.
For adequate measurements, we should make a number of measurements and take median value.
A lot of external (CPU, OS, latency, etc.) and internal (GC runs, etc.) factors affect measured numbers.
It is impossible to entirely exclude them.
governor
, cpupower
in Linux)Two things can affect application: GC and System calls (including I/O calls).
You may disable GC for measurements or force it before benchmarking with GC.start
(but not in a loop ! because of a new object being created in it).
On Linux & Mac OS process fork is able to fix that issue:
100.times do
GC.start
pid = fork do
GC.start
m = Benchmark.realtime { ... }
end
Process.waitpid(pid)
end
Confidence interval - interval within which we can confidently state the true optimization lies.
Level of confidence - the size of the confidence interval.
Analysis algorithm:
mx = sum(xs) / count(xs); my = sum(ys) / count(ys)
sdx = sqrt(sum(sqr(xi - mx)) / count(xs) - 1); sdy = sqrt(sum(sqr(yi - my)) / count(ys) - 1)
mo = mx - my
err = sqrt(sqr(sdx) / count(xs) + sqr(sdy) / count(ys))
interval = (mo - 2 * err)..(mo + 2 * err)
Both lower and upper bounds of confidence interval should be > 0 (see 016_statistics.rb). Always make more than 30 measurements for good results.
For Ruby, round measures to the tenth of milliseconds (e.g. 1.23 s). For rounding use tie-breaking “round half to even” rule (round 5 to even number: 1.25 ~= 1.2, 1.35 ~= 1.4).
For better results, you may exclude outliers and first (cold) measure results.
For Rails performance testing, use special gems (e.g. rails-perftest) or write your own custom assertions.
Tips:
:info
Ruby program may be optimized not only by optimizing its code. Application may use various dependencies, services, and third party software.
Sometimes it is OK to restart long-running ruby processes.
Memory consumption grows with time and GC slows down with more memory allocated.
! In most cases ruby won’t give objects heap memory back to OS.
Applications cycling ways:
Objects-heavy calculations should be started in forks, so when forked process exits, heap memory will be returned to OS (see 018_heavy_forks.rb).
There are 3 common ways to return result from fork: files, DB, I/O pipe.
! Doesn’t work with threads, only forks (threads share ObjectSpace).
For Rails use background jobs (delayed_job) and workers (sidekiq).
! Sidekiq uses threads, so you should monitor and restart Sidekiq workers yourself.
Not useful since Ruby 2.2.
OOBGC - starting GC when an application has a low workload.
Unicorn has the direct support of OOBGC via unicorn/oob_gc middleware.
For Ruby 2.1+ you can use gem gctool. But be careful with threads: starting GC in one thread will affect all other threads of the process.
For PostgreSQL:
PostgreSQL configuration best practives:
effective_cache_size <RAM * 3/4>
shared_buffers <RAM * 1/4>
# aggregations memory
work_mem <2^(log(RAM / MAX_CONN)/log(2))>
# vacuum & indices creation
maintenance_work_mem <2^(log(RAM / 16)/log(2))>
log_autovacuum_min_duration 1000ms
log_min_duration_statement 1000ms
auto_explain.log_min_duration 1000ms
shared_preload_libraries 'auto_explain'
custom_variable_classes 'auto_explain'
auto_explain.log_analyze off
Most important criteria:
Ruby stores objects in its own heap (objects heap) and uses OS heap for data that doesn’t fit into objects.
Every object in Ruby is RVALUE
struct. Its size:
Check RVALUE
size with following commands:
gdb `rbenv which ruby`
p sizeof(RVALUE)
! A medium-sized Rails App allocates ~ 0.5M objects at startup.
Ruby heap space (objects heap) -> N heap pages -> M heap slots. Heap slot contains one object.
To allocate a new object, Ruby takes an unused slot. If no unused slot found, interpreter allocates more heap pages.
Allocates 10_000 slots at startup (1 page) and then adds by 1 page (page = prev_page * 1.8)
Heap page = 16kB (~ 408 objects).
Allocates 10_000 slots (24 pages) and then adds by N 16kB pages (N = prev_pages * 1.8 - prev_pages).
Some GC stats (GC.stat
) for Ruby 1.9:
GC_HEAP_GROWTH_FACTOR - growth factor (default = 1.8)
GC_HEAP_GROWTH_MAX_SLOTS - slots growth constraint
Allocates 1 page + 24 pages and then adds by N pages (N = prev_nonempty_pages * GC_HEAP_GROWTH_FACTOR - prev_nonempty_pages).
In Ruby 2.1 pages added on demand (heap_length != N of allocated pages, it’s just counter).
Some GC stats (GC.stat
) for Ruby 2.1:
Eden - occupied heap pages.
Tomb - empty heap pages.
To allocate a new object Ruby first looks for free space in eden and only then in tomb.
! Ruby frees (gives it back to OS) objects heap memory by pages.
Algorithm to determine number of pages to free:
sw
, - number of pages touched on sweep (number of objects / HEAP_OBJ_LIMIT)rem = max(total_heap_pages * 0.8, init_slots)
, - pages that should stayfr = total_heap_pages - rem
, - pages to freeUsually objects heap growth is 80% while reduction is 10%.
Some GC stats (GC.stat
) for Ruby 2.2:
Growth is the same as in Ruby 2.1 but relative to eden pages, not allocated pages.
If Ruby object is bigger than half of 40 bytes (on 64-bit OS) it will be stored entirely outside the objects heap. This object memory will be freed and returned to OS after GC (see 019_obj_memory.rb).
Ruby string (RSTRING
struct) can store only 23 bytes of payload.
ObjectSpace.memsize_of(obj)
- shown object size in memory in bytes.
For example, 24 chars String will have a size of 65 bytes (24 outside the heap + 1 for upkeep + 40 bytes inside heap).
It may be OK to allocate a big object in memory because it doesn’t affect GC performance (but may lead to GC run), but it is crucial to allocate a large amount of small objects in objects heap.
Two main purposes are:
If there are no more free slots in objects heap, Ruby invoces GC to free enough slots, which is max(allocated_slots * 0.2, GC_HEAP_FREE_SLOTS)
(see 020_heap_gc.rb).
GC will be triggered when you allocate more than the current memory limit (Ruby <= 2.0 GC_MALLOC_LIMIT
~= 8M bytes (7.63 Mb)) (see 021_malloc_gc.rb).
Malloc limit adjusted in runtime proportional to memory usage by an application, but not any good.
Ruby 2.1 introduced generational GC - it divides all objects into new and old (survived a GC) generations with own limits GC_MALLOC_LIMIT_MIN
, GC_OLDMALLOC_LIMIT_MIN
(both 16 Mb initially).
They can grow up to GC_MALLOC_LIMIT_MAX
, GC_OLDMALLOC_LIMIT_MAX
(32 Mb and 128 Mb by default).
Growth factors are GC_MALLOC_LIMIT_GROWTH_FACTOR
and GC_OLDMALLOC_LIMIT_GROWTH_FACTOR
(1.4 and 1.2 by default). And reduction factor is 0.98.
Ruby 2.2 introduced incremental GC - several mark steps followed by several sweeps (smaller “stop the world”).
Ruby uses mark & sweep GC and stops the world for mark steps.
Generational GC divides all GC invocations to minor (only for new objects) and major (for both new and old ones).
Some related GC.stat
params:
Tune up following env vars:
Tune up following env vars:
To change other Ruby GC parameters for versions below 2.0, you have to recompile interpreter.