GNU-alike tools for parsing RFC 4180 CSVs at high speed.
As our data gets bigger, CSV files grow in size.
The CSV format is not exactly pipe-friendly due to embedded newlines and quoted separators.
onyxfish/csvkit offers a great set of utilties for most tasks you would want to perform on CSV’s in a gnu toolset kind of way.
However, it is not fast. For reasonable data sets, this doesn’t matter, but for CSVs of more than a few MBs, you start to feel the pain.
This repository contains gnu-alike tools for parsing RFC 4180 CSVs at high speed.
csvcut
a cut(1)
equivalent to drop columns from a csv filecsvgrep
a grep(1)
equivalent to match on one or more collumns per row, and only keep the rows matching all or any of the patterns. (it uses PRCE for regular expression goodness)csvawk
a wrapper for awk(1)
which correctly recognizes rows and cells (even across newlines). This is comparable to geoffroy-aubry/awk-csv-parser, except that it also supports embedded newlines.csvpipe
and csvunpipe
translate the newlines separating rows to \0
such that sort -z
and uniq -z
and other null-terminated-line based tools can be used more correctly.Benchmarking is complicated, the primary goal is to measure only that of interest, by reducing the impact of other factors. Originally csvtools was benchmarked on the Canada 2011 census, however, we were primarily measuring the limits of the SSD and the caches around the file system.
Now we benchmark with a custom tool: bench/runner.c
. This benchmark first generates an in memory random csv data set (see bench/generate.c
), and then pipes this into the applications under test. This at least takes the IO and FS out of the equation.
we compare csvtools
with other solutions. Note that these solutions might not correctly handle CSV’s. The reported numbers are median MiB/s.
command | median speed |
---|---|
cat > /dev/null |
2042.1 MiB/s |
wc -l > /dev/null |
2149.0 MiB/s |
md5sum > /dev/null |
566.8 MiB/s |
scenario | csvkit | cut | sed | csvtools |
---|---|---|---|---|
first column | 8.0 MiB/s | 278.8 MiB/s | 356.9 MiB/s | 644.1 MiB/s |
middle column | 8.1 MiB/s | 280.3 MiB/s | 138.6 MiB/s | 555.8 MiB/s |
last column | 8.0 MiB/s | 280.0 MiB/s | 90.1 MiB/s | 565.0 MiB/s |
two adjoining columns | 7.3 MiB/s | 359 MiB/s | 59.6 MiB/s | 561.6 MiB/s |
two distinct columns | 7.3 MiB/s | 449 MiB/s | 59.8 MiB/s | 480.9 MiB/s |
So even compared to sed or cut, which aren’t handling quoted separators correctly, our csvcut
is much faster.
scenario | csvkit | grep | awk | csvtools |
---|---|---|---|---|
first column | 7.6 MiB/s | 347.9 MiB/s | 469.2 MiB/s | 588.0 MiB/s |
middle column | 7.8 MiB/s | 302.8 MiB/s | 379.3 MiB/s | 579.0 MiB/s |
last column | 7.7 MiB/s | 392.7 MiB/s | 341.5 MiB/s | 632.5 MiB/s |
two distinct columns | 9.0 MiB/s | 273.9 MiB/s | 380.0 MiB/s | 569.7 MiB/s |
Faster than grep and awk, this is because the column selection in grep is done with negative character classes multiple times.
There are off course regular expressions possible where PCRE is slower than grep.
scenario | awk | awk-csv-parser | csvtools |
---|---|---|---|
print second column | 428.5 MiB/s | 2.45 MiB/s | 278.5 MiB/s |
sum last column | 350.5 MiB/s | 2.4 MiB/s | 225.9 MiB/s |
Sadly, csvawk
is slower than pure awk
. This is caused by the custom record separator (instead of the normal newline). Benchmarking csvawk
piping to awk
shows it performs around 800 MiB/s, and if newlines are used as separators, the whole csvawk
performs around similar to awk
’s raw performance.
However, newlines are not valid separators, since they can occur inside quoted fields. For csvawk
we generate \x1E
between records (as per ISO 646), and \x1F
between fields in a record.
The results of the second benchmark differ, since awk doesn’t correctly handle nested separators.
No malloc & memcpy!
Or as valgrind reports it:
==2473== total heap usage: 18 allocs, 18 frees, 210 bytes allocated
In the critical path of tokenizing the csv stream and writing it to stdout
, there are no copies or memory allocations. The programs read into a buffer from stdin
(or the file passed as last argument), the tokenizer stores offsets (to that buffer) and lenghts in a cell array, and the printer writes from the same buffer, using the offsets and lengths from the cell array.
make install
(or with prefix: make install prefix=~/.apps/
)