zsv+lib: tabular data swiss-army knife CLI + world's fastest (simd) CSV parser

209
15
C

zsv+lib: the world’s fastest (simd) CSV parser, with an extensible CLI

lib + CLI:
ci
GitHub pre-release)
GitHub release (latest by date)
GitHub all releases (downloads)
License

lib only:
NPM Version
NPM Install Size

zsv+lib is a fast CSV parser library and extensible command-line utility. It
achieves high performance using SIMD operations, efficient memory
use
and other optimization techniques, and can also parse
generic-delimited and fixed-width formats, as well as multi-row-span headers

The ZSV CLI can be compiled to virtually any target, including
WebAssembly, and offers features including select, count,
direct CSV sql, flatten, serialize, 2json conversion, 2db sqlite3
conversion, stack, pretty, 2tsv, compare, paste, overwrite and more.

The ZSV CLI also includes sheet, an in-console interactive grid viewer (TO DO:
that can be extended with your custom code for manipulating and) for viewing
data:

Pre-built CLI packages are available via brew and nuget.

A pre-built library package is available for Node (npm install zsv-lib).
Please note, this package is still in alpha and currently only exposes a small
subset of the zsv library capabilities. More to come.

If you like zsv+lib, do not forget to give it a star! 🌟

Performance

Preliminary performance results compare favorably vs other CSV utilities (xsv,
tsv-utils, csvkit, mlr (miller) etc). Below were results on a pre-M1 macOS
MBA; on most platforms zsvlib was 2x faster, though in some cases the advantage
was smaller e.g. 15-25%) (below, mlr not shown as it was about 25x slower):

count speedselect speed

** See 12/19 update re M1 processor at
https://github.com/liquidaty/zsv/blob/main/app/benchmark/README.md

Which “CSV”

“CSV” is an ambiguous term. This library uses the same definition as Excel. In
addition, it provides a row-level (as well as cell-level) API and provides
“normalized” CSV output (e.g. input of this"iscell1,"thisis,"cell2 becomes
"this""iscell1","thisis,cell2"). Each of these three objectives (Excel
compatibility, row-level API and normalized output) has a measurable performance
impact; conversely, it is possible to achieve-- which a number of other CSV
parsers do-- much faster parsing speeds if any of these requirements (especially
Excel compatibility) are dropped.

Built-in and extensible features

zsv is an extensible CSV utility, which uses zsvlib, for tasks such as slicing
and dicing, querying with SQL, combining, serializing, flattening,
converting between CSV/JSON/sqlite3 and more.

zsv is streamlined for easy development of custom dynamic extensions.

zsvlib and zsv are written in C, but since zsvlib is a library, and zsv
extensions are just shared libraries, you can extend zsv with your own code in
any programming language, so long as it has been compiled into a shared library
that implements the expected
interface.

Key highlights

  • Available as BOTH a library and an application (coming soon: standalone
    zsvutil library for common helper functions such as csv writer)
  • Open-source, permissively licensed
  • Handles real-world CSV the same way that spreadsheet programs do (including
    edge cases
    ). Gracefully handles (and can “clean”) real-world data that may be
    “dirty”.
  • Runs on macOS (tested on clang/gcc), Linux (gcc), Windows (mingw), BSD
    (gcc-only) and in-browser (emscripten/wasm)
  • Fastest (at least, vs all alternatives and on all platforms we’ve benchmarked
    where 256-bit SIMD operations are available). See
    app/benchmark/README.md
  • Low memory usage (regardless of how big your data is) and size footprint for
    both lib (~20k) and CLI executable (< 1MB)
  • Handles general delimited data (e.g. pipe-delimited) and fixed-with input
    (with specified widths or auto-detected widths)
  • Handles multi-row headers
  • Handles input from any stream, including caller-defined streams accessed via a
    single caller-defined fread-like function
  • Easy to use as a library in a few lines of code, via either pull or push
    parsing
  • Includes the zsv CLI with the following built-in commands:
  • CLI is easy to extend/customize with a few lines of code via modular plug-in
    framework. Just write a few custom functions and compile into a distributable
    DLL that any existing zsv installation can use.

Installing

Packages

Download pre-built binaries and packages for macOS, Windows, Linux and BSD from
the Releases page.

You can also download pre-built binaries and packages from
Actions for the latest commits and
PRs but these are retained only for limited days.

[!IMPORTANT]

For musl libc static build, the dynamic
extensions are not supported!

[!NOTE]

After v0.3.9-alpha, all package artifacts will be properly
attested.
To verify, you can use GitHub CLI like this:

gh attestation verify <downloaded-artifact> --repo liquidaty/zsv

macOS

…via Homebrew:

brew tap liquidaty/zsv
brew install zsv

…via MacPorts:

sudo port install zsv

Linux

For Linux (Debian/Ubuntu - *.deb):

# Install
sudo apt install ./zsv-amd64-linux-gcc.deb

# Uninstall
sudo apt remove zsv

For Linux (RHEL/CentOS - *.rpm):

# Install
sudo yum install ./zsv-amd64-linux-gcc.rpm

# Uninstall
sudo yum remove zsv

Windows

For Windows (*.nupkg), install with nuget.exe:

# Install via nuget custom feed (requires absolutes paths)
md nuget-feed
nuget.exe add zsv .\<path>\zsv-amd64-windows-mingw.nupkg -source <path>/nuget-feed
nuget.exe install zsv -version <version> -source <path>/nuget-feed

# Uninstall
nuget.exe delete zsv <version> -source <path>/nuget-feed

For Windows (*.nupkg), install with choco.exe:

# Install
choco.exe install zsv --pre -source <directory containing .nupkg file>

# Uninstall
choco.exe uninstall zsv

Node

The zsv parser library is available for node:

npm install zsv-lib

Please note:

  • This package is still in alpha and currently only exposes a small subset of
    the zsv library capabilities. More to come!
  • The CLI is not yet available as a Node package
  • If you’d like to use additional parser features, or use the CLI as a Node
    package, please feel free to post a request in an issue here.

GHCR (GitHub Container Registry)

zsv CLI is also available as a container image from
Packages.

The container image is published on every release. In addition to the specific
release tag, the image is also tagged as latest i.e. zsv:latest always
points the latest released version.

Example:

$ docker pull ghcr.io/liquidaty/zsv
# ...
$ cat worldcitiespop_mil.csv | docker run -i ghcr.io/liquidaty/zsv count
1000000

For image details, see Dockerfile. You may use this as a
baseline for your own use cases as needed.

GitHub Actions

In a GitHub Actions workflow, you can use zsv/setup-action
to set up zsv+zsvlib:

- name: Set up zsv+zsvlib
  uses: liquidaty/zsv/setup-action@main

See zsv/setup-action/README for more details.

From source

See BUILD.md for more details.

Why another CSV parser/utility?

Our objectives, which we were unable to find in a pre-existing project, are:

  • Reasonably high performance
  • Runs on any platform, including web assembly
  • Available as both a library and a standalone executable / command-line
    interface utility (CLI)
  • Memory-efficient, configurable resource limits
  • Handles real-world CSV cases the same way that Excel does, including all edge
    cases (quote handling, newline handling (either \n or \r), embedded
    newlines, abnormal quoting e.g. aaa"aaa,bbb…)
  • Handles other “dirty” data issues:
    • Assumes valid UTF8, but does not misbehave if input contains bad UTF8
    • Option to specify multi-row headers
    • Does not assume or stop working in case of inconsistent numbers of columns
  • Easy to use library or extend/customize CLI

There are several excellent tools that achieve high performance. Among those we
considered were xsv and tsv-utils. While they met our performance objective,
both were designed primarily as a utility and not a library, and were not easy
enough, for our needs, to customize and/or to support modular customizations
that could be maintained (or licensed) independently of the related project (in
addition to the fact that they were written in Rust and D, respectively, which
happen to be languages with which we lacked deep experience, especially for web
assembly targeting).

Others we considered were Miller (mlr), csvkit and Go (csv module), which
did not meet our performance objective. We also considered various other
libraries using SIMD for CSV parsing, but none that we tried met the “real-world
CSV” objective.

Hence, zsv was created as a library and a versatile application, both optimized
for speed and ease of development for extending and/or customizing to your
needs.

Batteries included

zsv comes with several built-in commands:

  • sheet: an in-console, interactive grid viewer
  • echo: read CSV from stdin and write it back out to stdout. This is mostly
    useful for demonstrating how to use the API and also how to create a plug-in,
    and has several uses beyond that including adding/removing BOM, cleaning up
    bad UTF8, whitespace or blank column trimming, limiting output to a contiguous
    data block, skipping leading garbage, and even proving substitution values
    without modifying the underlying source
  • select: re-shape CSV by skipping leading garbage, combining header rows into
    a single header, selecting or excluding specified columns, removing duplicate
    columns, sampling, converting from fixed-width input, searching and more
  • sql: treat one or more CSV files like database tables and query with SQL
  • desc: provide a quick description of your table data
  • pretty: format for console (fixed-width) display, or convert to markdown
    format
  • 2json: convert CSV to JSON. Optionally, output in
    database schema
  • 2tsv: convert to TSV (tab-delimited) format
  • compare: compare two or more tables of data and output the differences
  • paste (alpha): horizontally paste two tables together (given inputs X and Y,
    output 1…N rows where each row contains the entire corresponding
    row in X followed by the entire corresponding row in Y)
  • serialize (inverse of flatten): convert an NxM table to a single 3x (Nx(M-1))
    table with columns: Row, Column Name, Column Value
  • flatten (inverse of serialize): flatten a table by combining rows that share
    a common value in a specified identifier column
  • stack: merge CSV files vertically
  • jq: run a jq filter
  • 2db: convert from JSON to sqlite3 db
  • overwrite: overwrite a cell value; changes will be reflected in any zsv
    command when the --apply-overwrites option is specified
  • prop: view or save parsing options associated with a file, such as initial
    rows to ignore, or header row span. Saved options are be applied by default
    when processing that file.

Each of these can also be built as an independent executable named zsv_xxx
where xxx is the command name.

Running the CLI

After installing, run zsv help to see usage details. The typical syntax is
zsv <command> <parameters> e.g.

zsv sql my_population_data.csv "select * from data where population > 100000"

Using the API

Simple API usage examples include:

Pull parsing:

zsv_parser parser = zsv_new(...);
while (zsv_next_row(parser) == zsv_status_row) { // for each row
  // ...
  size_t cell_count = zsv_cell_count(parser);
  for (size_t i = 0; i < cell_count; i++) { // for each cell
    struct zsv_cell c = zsv_get_cell(parser, i);
    fprintf(stderr, "Cell: %.*s\n", c.len, c.str);
    // ...
  }
}

Push parsing:

static void my_row_handler(void *ctx) {
  zsv_parser p = ctx;
  size_t cell_count = zsv_cell_count(p);
  for (size_t i = 0, j = zsv_cell_count(p); i < j; i++) {
    // ...
  }
}

int main() {
  zsv_parser p = zsv_new(NULL);
  zsv_set_row_handler(p, my_row_handler);
  zsv_set_context(p, p);
  while (zsv_parse_more(data.parser) == zsv_status_ok);
  return 0;
}

Full application code examples can be found at
examples/lib/README.md.

An example of using the API, compiled to wasm and called via Javascript, is in
examples/js/README.md.

For more sophisticated (but at this time, only sporadically
commented/documented) use cases, see the various CLI C source files in the app
directory such as app/serialize.c.

Creating your own extension

You can extend zsv by providing a pre-compiled shared or static library that
defines the functions specified in extension_template.h and which zsv loads
in one of three ways:

  • as a static library that is statically linked at compile time
  • as a dynamic library that is linked at compile time and located in any library
    search path
  • as a dynamic library that is located in the same folder as the zsv
    executable and loaded at runtime if/as/when the custom mode is invoked

Example and template

You can build and run a sample extension by running make test from
app/ext_example.

The easiest way to implement your own extension is to copy and customize the
template files in app/ext_template

Current release limitations

This release does not yet implement the full range of core features that are
planned for implementation prior to beta release. If you are interested in
helping, please post an issue.

Possible enhancements and related developments

  • online “playground” (soon to be released)
  • optimize search; add search with hyperscan or re2 regex matching, possibly
    parallelize?
  • optional OpenMP or other multi-threading for row processing
  • auto-generated documentation, and better documentation in general
  • Additional benchmarking. Would be great to use
    https://bitbucket.org/ewanhiggs/csv-game/src/master/ as a springboard to
    benchmarking a number of various tasks
  • encoding conversion e.g. UTF16 to UTF8

Contribute

  • Fork the project.
  • Check out the latest main
    branch.
  • Create a feature or bugfix branch from main.
  • Update your required changes.
  • Make sure to run clang-format (version 15 or later) for C source updates.
  • Commit and push your changes.
  • Submit the PR.

License

MIT