Tools for Data Manipulation in R
An implementation of common higher order functions with syntactic
sugar for anonymous function. Provides also a link to ‘dplyr’ and
‘data.table’ for common transformations on data frames to work around non
standard evaluation by default.
remotes::install_github("wahani/dat")
install.packages("dat")
R CMD check
. And you don’t like that.dplyr
is not respecting the class of the object it operates on; the classdplyr
nor data.table
are playing nice with S4, but you really,rlist
and purrr
.The examples are from the introductory vignette of dplyr
. You still work with
data frames: so you can simply mix in dplyr features whenever you need them.
library("nycflights13")
library("dat")
We can use mutar
to select rows. When you
reference a variable in the data frame, you can indicate this by using a one
sided formula.
mutar(flights, ~ month == 1 & day == 1)
mutar(flights, ~ 1:10)
And for sorting:
mutar(flights, ~ order(year, month, day))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
You can use characters, logicals, regular expressions and functions to select
columns. Regular expressions are indicated by a leading “^”.
flights %>%
extract(c("year", "month", "day")) %>%
extract("^day$") %>%
extract(is.numeric)
The main difference between dplyr::mutate
and mutar
is that you use a ~
instead of =
.
mutar(
flights,
gain ~ arr_delay - dep_delay,
speed ~ distance / air_time * 60
)
Grouping data is handled within mutar
:
mutar(flights, n ~ .N, by = "month")
mutar(flights, delay ~ mean(dep_delay, na.rm = TRUE), by = "month")
You can also provide additional arguments to a formula. This is especially
helpful when you want to pass arguments from a function to such expressions. The
additional augmentation can be anything which you can use to select columns
(character, regular expression, function) or a named list where each element is
a character.
mutar(
flights,
.n ~ mean(.n, na.rm = TRUE) | "^.*delay$",
.x ~ mean(.x, na.rm = TRUE) | list(.x = "arr_time"),
by = "month"
)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <dbl> <int>
## 1 2013 1 1 517 515 10.0 1523. 819
## 2 2013 1 1 533 529 10.0 1523. 830
## 3 2013 1 1 542 540 10.0 1523. 850
## 4 2013 1 1 544 545 10.0 1523. 1022
## 5 2013 1 1 554 600 10.0 1523. 837
## 6 2013 1 1 554 558 10.0 1523. 728
## 7 2013 1 1 555 600 10.0 1523. 854
## 8 2013 1 1 557 600 10.0 1523. 723
## 9 2013 1 1 557 600 10.0 1523. 846
## 10 2013 1 1 558 600 10.0 1523. 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Using this package you can create S4 classes to contain a data frame (or a
data.table) and use the interface to dplyr
. Both dplyr
and data.table
do
not support integration with S4. The main function here is mutar
which is
generic enough to link to subsetting of rows and cols as well as mutate and
summarise. In the background dplyr
s ability to work on a data.table
is being
used.
library("data.table")
setClass("DataTable", "data.table")
DataTable <- function(...) {
new("DataTable", data.table::data.table(...))
}
setMethod("[", "DataTable", mutar)
dtflights <- do.call(DataTable, nycflights13::flights)
dtflights[1:10, c("year", "month", "day")]
dtflights[n ~ .N, by = "month"]
dtflights[n ~ .N, sby = "month"]
dtflights %>%
filtar(~month > 6) %>%
mutar(n ~ .N, by = "month") %>%
sumar(n ~ data.table::first(n), by = "month")
Inspired by rlist
and purrr
some low level operations on vectors are
supported. The aim here is to integrate syntactic sugar for anonymous functions.
Furthermore the functions should support the use of pipes.
map
and flatmap
as replacements for the apply functionsextract
for subsettingreplace
for replacing elements in a vectorWhat we can do with map:
map(1:3, ~ .^2)
flatmap(1:3, ~ .^2)
map(1:3 ~ 11:13, c) # zip
dat <- data.frame(x = 1, y = "")
map(dat, x ~ x + 1, is.numeric)
What we can do with extract:
extract(1:10, ~ . %% 2 == 0) %>% sum
extract(1:15, ~ 15 %% . == 0)
l <- list(aList = list(x = 1), aAtomic = "hi")
extract(l, "^aL")
extract(l, is.atomic)
What we can do with replace:
replace(c(1, 2, NA), is.na, 0)
replace(c(1, 2, NA), rep(TRUE, 3), 0)
replace(c(1, 2, NA), 3, 0)
replace(list(x = 1, y = 2), "x", 0)
replace(list(x = 1, y = 2), "^x$", 0)
replace(list(x = 1, y = "a"), is.character, NULL)