Fast reading of delimited files
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
options(tibble.print_min = 3)
tm <- vroom::vroom(system.file("bench", "taxi.tsv", package = "vroom"))
versions <- vroom::vroom(system.file("bench", "session_info.tsv", package = "vroom"))
# Use the base version number for read.delim
versions$package[versions$package == "base"] <- "read.delim"
library(dplyr)
tbl <- tm %>% filter(type == "real", op == "read", reading_package %in% c("data.table", "readr", "read.delim") | manip_package == "base") %>%
rename(package = reading_package) %>%
left_join(versions) %>%
transmute(
package = package,
version = ondiskversion,
"time (sec)" = time,
speedup = max(time) / time,
"throughput" = paste0(prettyunits::pretty_bytes(size / time), "/sec")
) %>%
arrange(desc(speedup))
The fastest delimited reader for R, r filter(tbl, package == "vroom") %>% pull("throughput") %>% trimws()
.
But that’s impossible! How can it be so fast?
vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later.
The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use.
This lazy access is done automatically, so no changes to your R data-manipulation code are needed.
vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.
knitr::kable(tbl, digits = 2, align = "lrrrr")
vroom has nearly all of the parsing features of
readr for delimited and fixed width files, including
dplyr::select()
** these are additional features not in readr.
** requires num_threads = 1
.
Install vroom from CRAN with:
install.packages("vroom")
Alternatively, if you need the development version from
GitHub install it with:
# install.packages("pak")
pak::pak("tidyverse/vroom")
See getting started
to jump start your use of vroom!
vroom uses the same interface as readr to specify column types.
tibble::rownames_to_column(mtcars, "model") %>%
vroom::vroom_write("mtcars.tsv", delim = "\t")
vroom::vroom("mtcars.tsv",
col_types = list(cyl = "i", gear = "f",hp = "i", disp = "_",
drat = "_", vs = "l", am = "l", carb = "i")
)
unlink("mtcars.tsv")
vroom natively supports reading from multiple files (or even multiple
connections!).
First we generate some files to read by splitting the nycflights dataset by
airline.
For the sake of the example, we’ll just take the first 2 lines of each file.
library(nycflights13)
purrr::iwalk(
split(flights, flights$carrier),
~ { .x$carrier[[1]]; vroom::vroom_write(head(.x, 2), glue::glue("flights_{.y}.tsv"), delim = "\t") }
)
Then we can efficiently read them into one tibble by passing the filenames directly to vroom.
The id
argument can be used to request a column that reveals the filename that each row originated from.
files <- fs::dir_ls(glob = "flights*tsv")
files
vroom::vroom(files, id = "source")
fs::file_delete(files)
The speed quoted above is from a real r format(fs::fs_bytes(tm$size[[1]]))
dataset with r format(tm$rows[[1]], big.mark = ",")
rows and r tm$cols[[1]]
columns,
see the benchmark article
for full details of the dataset and
bench/ for the code
used to retrieve the data and perform the benchmarks.
In addition to the arguments to the vroom()
function, you can control the
behavior of vroom with a few environment variables. Generally these will not
need to be set by most users.
VROOM_TEMP_PATH
- Path to the directory used to store temporary files whentempdir()
).VROOM_THREADS
- The number of processor threads to use when indexing andparallel::detectCores()
.VROOM_SHOW_PROGRESS
- Whether to show the progress bar when indexing.VROOM_CONNECTION_SIZE
- The size (in bytes) of the connection buffer whenVROOM_WRITE_BUFFER_LINES
- The number of lines to use for each buffer whenThere are also a family of variables to control use of the Altrep framework.
For versions of R where the Altrep framework is unavailable (R < 3.5.0) they
are automatically turned off and the variables have no effect. The variables
can take one of true
, false
, TRUE
, FALSE
, 1
, or 0
.
VROOM_USE_ALTREP_NUMERICS
- If set use Altrep for all numeric typesfalse
).There are also individual variables for each type. Currently only
VROOM_USE_ALTREP_CHR
defaults to true
.
VROOM_USE_ALTREP_CHR
VROOM_USE_ALTREP_FCT
VROOM_USE_ALTREP_INT
VROOM_USE_ALTREP_BIG_INT
VROOM_USE_ALTREP_DBL
VROOM_USE_ALTREP_NUM
VROOM_USE_ALTREP_LGL
VROOM_USE_ALTREP_DTTM
VROOM_USE_ALTREP_DATE
VROOM_USE_ALTREP_TIME
RStudio’s environment pane calls object.size()
when it refreshes the pane, which
for Altrep objects can be extremely slow. RStudio 1.2.1335+ includes the fixes
(RStudio#4210,
RStudio#4292) for this issue,
so it is recommended you use at least that version.
data.table::fread()
is blazing fast and great motivation to see how fast we could go faster!