yg: an r package on file based database management.
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
yg implements a set of database management tools for communicating with SQLite or ODBC server such as Microsoft Azure SQL.
data.table::CJ
to data.table. CJ.dt(X, Y)
require both X and Y be data.table and generate a data.table with full combination of each row of X w.r.t each row of Y.library(yg)
X <- data.table(a = c(TRUE, FALSE, FALSE), b = c(11L, 22L, 34L))
Y <- data.table(c = c(5.80, 13.21, 34.55), d = c("a", "B", "C"))
(Z <- CJ.dt(X, Y))
NA
in data.table in situ, allow fill different values for numeric NA_real_
, integer NA_integer_
and character NA_character_
.X <- data.table(a = c(TRUE, FALSE, FALSE), b = c(11L, NA_integer_, 34L))
Y <- data.table(c = c(5.80, 13.21, NA_real_), d = c("a", NA_character_, "C"))
(Z <- CJ.dt(X, Y))
dtFillNA(dt = Z, fillNA = 34.54, fillNA_I = 23L, fillNA_C = "b")
dtRfrhTb: update primary data.table (dt) with a reference data.table (tb) based on matching id value on id columns, and update primary data.table vd value on vd columns using refernce data.table vd valude on vd columns.
One can specify when id in dt not find a match in tb, should the vd keep intact or use some pre-defined value. Similary, one can specify when id not in dt but find in tb, should it be added into dt or not. Both id and vd can be multiple columns. Value update operations are in situ in default for space and efficiency, e.g., the dt input is being changed in place without copying. This feature can be turned off with in_situ = FALSE. Insertion operations when nofound = TRUE are not in situ, so one still need to assign the result back - would prefer also in situ?
This function is inspired by a task on updating sale and inventory. The primary data.table (dt) keeps historical sale and inventory up to yesterday, and the reference data.table (tb) contains today’s sale and inventory. Every day, matched id should get updated sale and inventory vd from tb. When id in dt does not find a match in tb, sale should be set to 0 with nomatch = 0, and inventory should keep intact. When id not in dt is shown in tb, new products added in catalogue, and new id should be inserted into primary dt with nofound = TRUE.
# dtRfrhTb
# master table (dHist): fulfillment center code (fcc), stock keeping unit (sku),
# begin quantity (bgn), end quantity (end), and sales as shipped quantity (spt).
# note: bgn != end + spt because change in inventory such as new stock comes in,
# write off and etc. can often happen.
set.seed(285714L)
(dHist <- data.table(
fcc = c(rep("NJ", 3L), rep("TX", 4L)),
sku = c(paste0("A", 1L:3L), paste0("A", 1L:4L)),
bgn = sample(40L, 7L),
end = sample(20L, 7L),
spt = sample(10L, 7L)
))
# refresh master table with new day sale and inventory change data - day by day.
# sale and inventory might not match as order - packing - shipping in processing
(dSale <- data.table(
fcc = c("NJ", "NJ", "TX", "TX"),
sku = c("A1", "A4", "A1", "A4"),
spt = c(1L, 2L, 3L, 4L)
))
(dInvt <- data.table(
fcc = c("NJ", "NJ", "TX", "TX"),
sku = c("A1", "A4", "A4", "A5"),
bgn = sample(10L, 4L),
end = sample(10L, 4L)
))
# refresh master tbl: use fcc and sku together as id when id nomatch set vd to 0
# as no sale, and when id nofound in dt (dHist) will be added as new sku in fcc.
(dHist <- dtRfrhTb(
dt = dHist, tb = dSale, id = c("fcc", "sku"), vd = "spt",
nomatch = 0L, nofound = TRUE, in_situ = TRUE
))
(dHist <- dtRfrhTb(
dt = dHist, tb = dInvt, id = c("fcc", "sku"), vd = c("bgn", "end"),
nomatch = NULL, nofound = TRUE, in_situ = TRUE
))
dtFillNA(dt = dHist, fillNA = 0)
sqlite_<>
tools:sqlite_<>
tools realize all database managements on a file based concept - one need only provide a
sqlite_refreshtb
and sqlite_subsetidx
worth a separate mention:
sqlite_refreshtb
: refreshtb will update value when primary key exist, and also insert new row when primary key not exist.
sqlite_subsetidx
: given a r data.table (dt) with id column, fetch from sqlite database (db) table (tb) based on id for all rows - works efficient when dt and tb both have millions or hundreds of millions of rows.
sqodbc_<>
tools:sqodbc_<>
tools realize all database managements on file based concept - one need only provide a
# sqodbc db list prototype
sqodbc_dblist <-function() {
list(
# dsn = 'databaseSeverNameInWinODBC',
srv = 'one-sql.database.windows.net',
usr = 'username',
pwd = 'password',
dbn = 'database-in-sql-server'
)
}
sqodbc_dblist_with_winAuth <-function() {
list(
# dsn = 'databaseSeverNameInWinODBC',
srv = 'one-sql.database.windows.net',
dbn = 'database-in-sql-server',
winAuth = TRUE
)
}
# list all tables in database in sql-server
sqodbc_executeqy(
qy = "select * from INFORMATION_SCHEMA.TABLES", db = sqodbc_dblist()
)
# use windows authentication for connection
sqodbc_executeqy(
qy = "select * from INFORMATION_SCHEMA.TABLES", db = sqodbc_dblist_with_winAuth()
)
sqodbc_refreshtb
and sqodbc_subsetidx
are corresponding implementation of sqlite_refreshtb
and sqlite_subsetidx
.sqodbc_<>
equivalents:bcp_azure_table
: wrapper around bcp out - download sql table into local file. An input db should be a list of srv, usr and pwd or winAuth, and dbn - same as db list used in sqodbc tools.
bcp_azure_query
: wrapper around bcp queryout - download query result into local file.
bcp_azure_inrdt
: wrapper around bcp in - upload r data.table or data.frame into sql server. R data.table and data.frame should not have columns of lists. Function can auto detect R data type, integer, numeric and character and specify corresponding format when uploading.
TODO: should add logical in auto data type determination: det_col_type
.