yg: an r package on file based database management.


output:
md_document:
variant: markdown_github

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

yg

yg implements a set of database management tools for communicating with SQLite or ODBC server such as Microsoft Azure SQL.

data.table tools:

  • CJ.dt: extends data.table::CJ to data.table. CJ.dt(X, Y) require both X and Y be data.table and generate a data.table with full combination of each row of X w.r.t each row of Y.
library(yg)

X <- data.table(a = c(TRUE, FALSE, FALSE), b = c(11L, 22L, 34L))
Y <- data.table(c = c(5.80, 13.21, 34.55), d = c("a", "B", "C"))

(Z <- CJ.dt(X, Y))
  • dtFillNA: fill NA in data.table in situ, allow fill different values for numeric NA_real_, integer NA_integer_ and character NA_character_.
X <- data.table(a = c(TRUE, FALSE, FALSE), b = c(11L, NA_integer_, 34L))
Y <- data.table(c = c(5.80, 13.21, NA_real_), d = c("a", NA_character_, "C"))

(Z <- CJ.dt(X, Y))

dtFillNA(dt = Z, fillNA = 34.54, fillNA_I = 23L, fillNA_C = "b")
  • dtRfrhTb: update primary data.table (dt) with a reference data.table (tb) based on matching id value on id columns, and update primary data.table vd value on vd columns using refernce data.table vd valude on vd columns.

    One can specify when id in dt not find a match in tb, should the vd keep intact or use some pre-defined value. Similary, one can specify when id not in dt but find in tb, should it be added into dt or not. Both id and vd can be multiple columns. Value update operations are in situ in default for space and efficiency, e.g., the dt input is being changed in place without copying. This feature can be turned off with in_situ = FALSE. Insertion operations when nofound = TRUE are not in situ, so one still need to assign the result back - would prefer also in situ?

    This function is inspired by a task on updating sale and inventory. The primary data.table (dt) keeps historical sale and inventory up to yesterday, and the reference data.table (tb) contains today’s sale and inventory. Every day, matched id should get updated sale and inventory vd from tb. When id in dt does not find a match in tb, sale should be set to 0 with nomatch = 0, and inventory should keep intact. When id not in dt is shown in tb, new products added in catalogue, and new id should be inserted into primary dt with nofound = TRUE.


# dtRfrhTb

# master table (dHist): fulfillment center code (fcc), stock keeping unit (sku), 
# begin quantity (bgn), end quantity (end), and sales as shipped quantity (spt).
# note: bgn != end + spt because change in inventory such as new stock comes in, 
# write off and etc. can often happen.

set.seed(285714L)

(dHist <- data.table(
  fcc = c(rep("NJ", 3L), rep("TX", 4L)),
  sku = c(paste0("A", 1L:3L), paste0("A", 1L:4L)),
  bgn = sample(40L, 7L),
  end = sample(20L, 7L),
  spt = sample(10L, 7L)
))

# refresh master table with new day sale and inventory change data - day by day.
# sale and inventory might not match as order - packing - shipping in processing

(dSale <- data.table(
  fcc = c("NJ", "NJ", "TX", "TX"),
  sku = c("A1", "A4", "A1", "A4"),
  spt = c(1L, 2L, 3L, 4L)
))

(dInvt <- data.table(
  fcc = c("NJ", "NJ", "TX", "TX"),
  sku = c("A1", "A4", "A4", "A5"),
  bgn = sample(10L, 4L),
  end = sample(10L, 4L)
))

# refresh master tbl: use fcc and sku together as id when id nomatch set vd to 0 
# as no sale, and when id nofound in dt (dHist) will be added as new sku in fcc.

(dHist <- dtRfrhTb(
  dt = dHist, tb = dSale, id = c("fcc", "sku"), vd = "spt", 
  nomatch = 0L, nofound = TRUE, in_situ = TRUE
))

(dHist <- dtRfrhTb(
  dt = dHist, tb = dInvt, id = c("fcc", "sku"), vd = c("bgn", "end"),
  nomatch = NULL, nofound = TRUE, in_situ = TRUE
))

dtFillNA(dt = dHist, fillNA = 0)

sqlite_<> tools:

  • sqlite_<> tools realize all database managements on a file based concept - one need only provide a .db file as db for input, instead of a connection to dbfile.

  • sqlite_refreshtb and sqlite_subsetidx worth a separate mention:

    • sqlite_refreshtb: refreshtb will update value when primary key exist, and also insert new row when primary key not exist.

    • sqlite_subsetidx: given a r data.table (dt) with id column, fetch from sqlite database (db) table (tb) based on id for all rows - works efficient when dt and tb both have millions or hundreds of millions of rows.

sqodbc_<> tools:

  • sqodbc_<> tools realize all database managements on file based concept - one need only provide a .db list as db for input, instead of a connection to RODBC server. A .db list should contains all information for establishing a connection, e.g., it should be a list of srv (sql server address/ip), usr (username) and pwd (password), or winAuth (default is FALSE and require usr and pwd, specify TRUE when connection will be established using windows authentication and not require usr and pwd), and dbn (database name in sql server, e.g., master). A dsn (database server name - as in windows odbc data source administrator) was used in earlier version, but removed in current version, instead using srv, usr, pwd (or winAuth) and dbn - in this way, it is self-contained in R, no need to specify dsn in system settings.

# sqodbc db list prototype

sqodbc_dblist <-function() {

  list(
    
    # dsn = 'databaseSeverNameInWinODBC',

    srv = 'one-sql.database.windows.net',

    usr = 'username',

    pwd = 'password',

    dbn = 'database-in-sql-server'

  )

}

sqodbc_dblist_with_winAuth <-function() {

  list(

    # dsn = 'databaseSeverNameInWinODBC',
    
    srv = 'one-sql.database.windows.net',

    dbn = 'database-in-sql-server',
    
    winAuth = TRUE

  )

}


# list all tables in database in sql-server
sqodbc_executeqy(
  qy = "select * from INFORMATION_SCHEMA.TABLES", db = sqodbc_dblist()
)

# use windows authentication for connection
sqodbc_executeqy(
  qy = "select * from INFORMATION_SCHEMA.TABLES", db = sqodbc_dblist_with_winAuth()
)
  • sqodbc_refreshtb and sqodbc_subsetidx are corresponding implementation of sqlite_refreshtb and sqlite_subsetidx.

bcp tools - often much faster than sqodbc_<> equivalents:

  • bcp_azure_table: wrapper around bcp out - download sql table into local file. An input db should be a list of srv, usr and pwd or winAuth, and dbn - same as db list used in sqodbc tools.

  • bcp_azure_query: wrapper around bcp queryout - download query result into local file.

  • bcp_azure_inrdt: wrapper around bcp in - upload r data.table or data.frame into sql server. R data.table and data.frame should not have columns of lists. Function can auto detect R data type, integer, numeric and character and specify corresponding format when uploading.

    TODO: should add logical in auto data type determination: det_col_type.