mvgam

{mvgam} R 📦 to fit Dynamic Bayesian Generalized Additive Models for multivariate modeling and forecasting


output: github_document

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  dev = "png",
  dpi = 150,
  fig.height = 6,
  fig.width = 9,
  out.width = "100%"
)

mvgam R package logoStan Logo

mvgam

MultiVariate (Dynamic) Generalized Additive Models

R-CMD-check
Coverage status
Documentation
Methods in Ecology & Evolution
CRAN Version
CRAN Downloads

The mvgam 📦 fits Bayesian Dynamic Generalized Additive Models (DGAMs) that can include highly flexible nonlinear predictor effects, latent variables and multivariate time series models. The package does this by relying on functionalities from the impressive brms{target=“_blank”} and mgcv{target=“_blank”} packages. Parameters are estimated using the probabilistic programming language Stan, giving users access to the most advanced Bayesian inference algorithms available. This allows mvgam to fit a very wide range of models, including:

Installation

Install the stable package version from CRAN using: install.packages('mvgam'), or install the latest development version from GitHub using: devtools::install_github("nicholasjclark/mvgam"). You will also need a working version of Stan installed (along with either rstan and/or cmdstanr). Please refer to installation links for Stan with rstan here{target=“_blank”}, or for Stan with cmdstandr here{target=“_blank”}.

Cheatsheet

 usage cheatsheet

A simple example

We can explore the package’s primary functions using one of it’s built-in datasets. Use plot_mvgam_series() to inspect features for the four time series from the Portal Project{target=“_blank”}, which represent captures of four desert rodent species over time (see ?portal_data for more details)

library(mvgam)
library(ggplot2)
theme_set(theme_bw(base_size = 12))
library(mvgam)
data(portal_data)
plot_mvgam_series(
  data = portal_data, 
  y = 'captures',
  series = 'all'
)
plot_mvgam_series(
  data = portal_data, 
  y = 'captures',
  series = 1
)
plot_mvgam_series(
  data = portal_data, 
  y = 'captures',
  series = 4
)

These plots show that the time series are count-responses, with missing data, seasonality and temporal autocorrelation all present. These features make time series analysis and forecasting very difficult if using conventional software and models. But mvgam shines in these tasks.

For most forecasting exercises, we’ll want to split the data into training and testing folds

data_train <- portal_data %>%
  dplyr::filter(time <= 60)
data_test <- portal_data %>%
  dplyr::filter(time > 60 &
                  time <= 65)

Formulate an mvgam model; this model fits a State-Space GAM in which each species has its own intercept, linear association with ndvi_ma12 and potentially nonlinear association with mintemp. These effects are estimated jointly with a full time series model for the temporal dynamics (in this case an Vector Autoregressive process). We assume the outcome follows a Poisson distribution and will condition the model in Stan using MCMC sampling with the Cmdstan interface:

mod <- mvgam(
  # Observation model is empty as we don't have any
  # covariates that impact observation error
  formula = captures ~ 0,
  
  # Process model contains varying intercepts, 
  # varying slopes of ndvi_ma12 and varying smooths 
  # of mintemp for each series. 
  # Temporal dynamics are modelled with a Vector 
  # Autoregression (VAR(1))
  trend_formula = ~ 
    trend +
    s(trend, bs = 're', by = ndvi_ma12) +
    s(mintemp, bs = 'bs', by = trend) - 1,
  trend_model = VAR(cor = TRUE),
  
  # Obvservations are conditionally Poisson
  family = poisson(),
  priors = c(prior(normal(0, 2),
                 class = b),
             prior(exponential(2.5),
                   class = sigma)),
  
  # Condition on the training data
  data = data_train,
  control = list(adapt_delta = 0.99),
  burnin = 1500
)
mod <- mvgam(
  # Observation model is empty as we don't have any
  # covariates that impact observation error
  formula = captures ~ 0,
  
  # Process model contains varying intercepts, 
  # varying slopes of ndvi_ma12 and varying smooths 
  # of mintemp for each series. 
  # Temporal dynamics are modelled with a Vector 
  # Autoregression (VAR(1))
  trend_formula = ~ 
    trend +
    s(trend, bs = 're', by = ndvi_ma12) +
    s(mintemp, bs = 'bs', by = trend) - 1,
  trend_model = VAR(cor = TRUE),
  
  # Obvservations are conditionally Poisson
  family = poisson(),

  # Condition on the training data
  data = data_train,
  backend = 'cmdstanr'
)

Using print() will return a quick summary of the object:

mod

Split Rhat and effective sample size diagnostics show good convergence of the model estimates

mcmc_plot(mod, 
          type = 'rhat_hist')
mcmc_plot(mod, 
          type = 'neff_hist')

Use conditional_effects() for a quick visualisation of the main terms in model formulae

conditional_effects(mod, 
                    type = 'link')

If you have the gratia package installed, it can also be used to plot partial effects of smooths

require(gratia)
draw(mod, 
     trend_effects = TRUE)

Or design more targeted plots using plot_predictions() from the marginaleffects package

plot_predictions(
  mod,
  condition = c('ndvi_ma12',
                'series',
                'series'),
  type = 'link'
)
plot_predictions(
  mod,
  condition = c('mintemp',
                'series',
                'series'),
  type = 'link'
)

We can also view the model’s posterior predictions for the entire series (testing and training). These forecasts can be scored using a range of proper scoring rules. See ?score.mvgam_forecast for more details

fcs <- forecast(mod, 
                newdata = data_test)
plot(fcs, series = 1) +
  plot(fcs, series = 2) +
  plot(fcs, series = 3) +
  plot(fcs, series = 4)

For Vector Autoregressions fit in mvgam, we can inspect impulse response functions and forecast error variance decompositions{target=“_blank”}. The irf() function runs an Impulse Response Function (IRF) simulation whereby a positive “shock” is generated for a target process at time t = 0. All else remaining stable, it then monitors how each of the remaining processes in the latent VAR would be expected to respond over the forecast horizon h. The function computes impulse responses for all processes in the object and returns them in an array that can be plotted using the S3 plot() function. Here we will use the generalized IRF, which makes no assumptions about the order in which the series appear in the VAR process, and inspect how each process is expected to respond to a sudden, positive pulse from the other processes over a horizon of 12 timepoints.

irfs <- irf(mod, 
            h = 12, 
            orthogonal = FALSE)
plot(irfs, 
     series = 1)
plot(irfs, 
     series = 3)

Using the same logic as above, we can inspect forecast error variance decompositions (FEVDs) for each process using the fevd() function. This type of analysis asks how orthogonal shocks to all process in the system contribute to the variance of forecast uncertainty for a focal process over increasing horizons. In other words, the proportion of the forecast variance of each latent time series can be attributed to the effects of the other series in the VAR process. FEVDs are useful because some shocks may not be expected to cause variations in the short-term but may cause longer-term fluctuations

fevds <- fevd(mod, 
              h = 12)
plot(fevds)

This plot shows that the variance of forecast uncertainty for each process is initially dominated by contributions from that same process (i.e. self-dependent effects) but that effects from other processes become more important over increasing forecast horizons. Given what we saw from the IRF plots above, these long-term contributions from interactions among the processes makes sense.

Plotting randomized quantile residuals over time for each series can give useful information about what might be missing from the model. We can use the highly versatile pp_check() function to plot these:

pp_check(
  mod, 
  type = 'resid_ribbon_grouped',
  group = 'series',
  x = 'time',
  ndraws = 200
)

When describing the model, it can be helpful to use the how_to_cite() function to generate a scaffold for describing the model and sampling details in scientific communications

description <- how_to_cite(mod)
description
cat("Methods text skeleton\n")
cat(insight::format_message(description$methods_text))
cat("\nPrimary references\n")
for (i in seq_along(description$citations)) {
  cat(insight::format_message(description$citations[[i]]))
  cat('\n')
}
cat("\nOther useful references\n")
for (i in seq_along(description$other_citations)) {
  cat(insight::format_message(description$other_citations[[i]]))
  cat('\n')
}

The post-processing methods we have shown above are just the tip of the iceberg. For a full list of methods to apply on fitted model objects, type methods(class = "mvgam").

Extended observation families

mvgam was originally designed to analyse and forecast non-negative integer-valued data. But further development of mvgam has resulted in support for a growing number of observation families. Currently, the package can handle data for the following:

  • gaussian() for real-valued data
  • student_t() for heavy-tailed real-valued data
  • lognormal() for non-negative real-valued data
  • Gamma() for non-negative real-valued data
  • betar() for proportional data on (0,1)
  • bernoulli() for binary data
  • poisson() for count data
  • nb() for overdispersed count data
  • binomial() for count data with known number of trials
  • beta_binomial() for overdispersed count data with known number of trials
  • nmix() for count data with imperfect detection (unknown number of trials)

See ??mvgam_families for more information. Below is a simple example for simulating and modelling proportional data with Beta observations over a set of seasonal series with independent Gaussian Process dynamic trends:

set.seed(100)
data <- sim_mvgam(
  family = betar(),
  T = 80,
  trend_model = GP(),
  prop_trend = 0.5,
  seasonality = "shared"
)
plot_mvgam_series(
  data = data$data_train, 
  series = "all"
)
mod <- mvgam(
  y ~ s(season, bs = "cc", k = 7) +
    s(season, by = series, m = 1, k = 5),
  trend_model = GP(),
  data = data$data_train,
  newdata = data$data_test,
  family = betar()
)
mod <- mvgam(
  y ~ s(season, bs = "cc", k = 7) +
    s(season, by = series, m = 1, k = 5),
  trend_model = GP(),
  data = data$data_train,
  newdata = data$data_test,
  family = betar()
)

Inspect the summary to see that the posterior now also contains estimates for the Beta precision parameters $\phi$.

summary(mod, 
        include_betas = FALSE)

Plot the hindcast and forecast distributions for each series

library(patchwork)
fc <- forecast(mod)
wrap_plots(
  plot(fc, series = 1),
  plot(fc, series = 2),
  plot(fc, series = 3),
  ncol = 2
)
library(patchwork)
fc <- forecast(mod)
wrap_plots(
  plot(fc, series = 1),
  plot(fc, series = 2),
  plot(fc, series = 3),
  ncol = 2
)

There are many more extended uses of mvgam, including the ability to fit hierarchical State-Space GAMs that include dynamic and spatially varying coefficient models, dynamic factors, Joint Species Distribution Models and much more. See the package documentation{target=“_blank”} for more details. The package can also be used to generate all necessary data structures, initial value functions and modelling code necessary to fit DGAMs using Stan. This can be helpful if users wish to make changes to the model to better suit their own bespoke research / analysis goals. The Stan Discourse{target=“_blank”} is a helpful place to troubleshoot.

Citing mvgam and related software

When using any software please make sure to appropriately acknowledge the hard work that developers and maintainers put into making these packages available. Citations are currently the best way to formally acknowledge this work (but feel free to ⭐ this repo as well), so we highly encourage you to cite any packages that you rely on for your research.

When using mvgam, please cite the following:

Clark, N.J. and Wells, K. (2023). Dynamic Generalized Additive Models (DGAMs) for forecasting discrete ecological time series. Methods in Ecology and Evolution. DOI: https://doi.org/10.1111/2041-210X.13974

As mvgam acts as an interface to Stan, please additionally cite:

Carpenter B., Gelman A., Hoffman M. D., Lee D., Goodrich B., Betancourt M., Brubaker M., Guo J., Li P., and Riddell A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software. 76(1). DOI: https://doi.org/10.18637/jss.v076.i01

mvgam relies on several other R packages and, of course, on R itself. To find out how to cite R and its packages, use citation(). There are some features of mvgam which specifically rely on certain packages. The most important of these is the generation of data necessary to estimate smoothing splines and Gaussian Processes, which rely on the mgcv, brms and splines2 packages. The rstan and cmdstanr packages together with Rcpp makes Stan conveniently accessible in R. If you use some of these features, please also consider citing the related packages.

Getting help

If you encounter a clear bug, please file an issue with a minimal reproducible example on GitHub. Please also feel free to use the mvgam Discussion Board to hunt for or post other discussion topics related to the package, and do check out the mvgam Changelog for any updates about recent upgrades that the package has incorporated.

Other resources

A series of vignettes cover data formatting, forecasting and several extended case studies of DGAMs{target=“_blank”}. A number of other examples, including some step-by-step introductory webinars, have also been compiled:

Interested in contributing?

I’m actively seeking PhD students and other researchers to work in the areas of ecological forecasting, multivariate model evaluation and development of mvgam. Please reach out if you are interested (n.clark’at’uq.edu.au). Other contributions are also very welcome, but please see The Contributor Instructions for general guidelines. Note that by participating in this project you agree to abide by the terms of its Contributor Code of Conduct.

License

The mvgam project is licensed under an MIT open source license