Multithreaded header only C++11 CSV parser
NanoCSV is a faster C++11 multithreaded header-only CSV parser with only STL dependency.
NanoCSV is designed for CSV data with numeric values.
In development.
Not recommended to use NanoCSV in production at the moment.
thread
support)
// defined this only in **one** c++ file.
#define NANOCSV_IMPLEMENTATION
#include "nanocsv.h"
int main(int argc, char **argv)
{
if (argc < 2) {
std::cout << "csv_parser_example input.csv (num_threads) (delimiter)\n";
}
std::string filename("./data/array-4-5.csv");
int num_threads = -1; // -1 = use all system threads
char delimiter = ' '; // delimiter character.
if (argc > 1) {
filename = argv[1];
}
if (argc > 2) {
num_threads = std::atoi(argv[2]);
}
if (argc > 3) {
delimiter = argv[3][0];
}
nanocsv::ParseOption<float> option;
option.delimiter = delimiter;
option.req_num_threads = num_threads;
option.verbose = true; // verbse message will be stored in `warn`.
option.ignore_header = true; // Parse header(the first line. default = true).
std::string warn;
std::string err;
nanocsv::CSV<float> csv;
bool ret = nanocsv::ParseCSVFromFile(filename, option, &csv, &warn, &err);
if (!warn.empty()) {
std::cout << "WARN: " << warn << "\n";
}
if (!ret) {
if (!err.empty()) {
std::cout << "ERROR: " << err << "\n";
}
return EXIT_FAILURE;
}
std::cout << "num records(rows) = " << csv.num_records << "\n";
std::cout << "num fields(columns) = " << csv.num_fields << "\n";
// values are 1D array of length [num_records * num_fields]
// std::cout << csv.values[4 * num_fields + 3] << "\n";
// header string is stored in `csv.header`
if (!option.ignore_header) {
for (size_t i = 0; i < csv.header.size(); i++) {
std::cout << csv.header[i] << "\n";
}
}
return EXIT_SUCCESS;
}
nanocsv supports parsing
nan
, -nan
as NaN, -NaNinf
, -inf
as Inf, -InfIn default, missing value(e.g. N/A(including invalid numeric string), NaN) are replaced by nan
, and null(empty) value(e.g. “”) are replaced by nan
.
You can control the behavior with the following parametes in ParseOption
.
replace_na
: Replace N/A, NaN value?
na_value
: The value to be replaced for N/A, NaN valuereplace_null
: Replace null(empty) value?
null_value
: The value to be replaced for null valueParsing Text CSV(each field is just a string) is also supported.
(Use differnt API. See the source code for details.)
nanocsv.h
. This is useful when you want to include Ryu header files outside of nanocsv.h
.#
)3.0 + 4.2j
)#INF
, #NAN
.
pow
) dependency.Dataset is 8192 x 4096, 800 MB in file size(generated by tools/gencsv/gen.py
)
total parsing time: 3833.33 ms
line detection : 1264.99 ms
alloc buf : 0.016351 ms
parse : 2508.83 ms
construct : 55.726 ms
total parsing time: 545.646 ms
line detection : 159.078 ms
alloc buf : 0.077979 ms
parse : 337.207 ms
construct : 46.7815 ms
Since 23 threads are faster than 32 thread for 1950x.
total parsing time: 494.849 ms
line detection : 127.176 ms
alloc buf : 0.050988 ms
parse : 314.287 ms
construct : 50.7568 ms
Roughly 7.7 times faster than signle therad parsing.
Not sure, but it should not exceed 3 * filesize, so guess 2.4 GB.
Using numpy.loadtxt
to load data takes 23.4 secs.
23 threaded naocsv parsing is Roughly 40 times faster than numpy.loadtxt
.
MIT License