Title: | Sparse Reluctant Interaction Modeling |
---|---|
Description: | An implementation of a computationally efficient method to fit large-scale interaction models based on the reluctant interaction selection principle. The method and its properties are described in greater depth in Yu, G., Bien, J., and Tibshirani, R.J. (2019) "Reluctant interaction modeling", which is available at <arXiv:1907.08414>. |
Authors: | Guo Yu [aut, cre] |
Maintainer: | Guo Yu <[email protected]> |
License: | GPL-3 |
Version: | 0.9.1 |
Built: | 2025-02-01 03:27:10 UTC |
Source: | https://github.com/hugogogo/sprintr |
The main cross-validation function to select the best sprinter fit for a path of tuning parameters.
cv.sprinter( x, y, square = FALSE, num_keep = NULL, lambda1 = NULL, lambda3 = NULL, cv_step1 = FALSE, nlam1 = 10, nlam3 = 100, lam_min_ratio = ifelse(nrow(x) < ncol(x), 0.01, 1e-04), nfold = 5, foldid = NULL, verbose = FALSE, ... )
cv.sprinter( x, y, square = FALSE, num_keep = NULL, lambda1 = NULL, lambda3 = NULL, cv_step1 = FALSE, nlam1 = 10, nlam3 = 100, lam_min_ratio = ifelse(nrow(x) < ncol(x), 0.01, 1e-04), nfold = 5, foldid = NULL, verbose = FALSE, ... )
x |
An |
y |
A response vector of size |
square |
Indicator of whether squared effects should be fitted in Step 1. Default to be FALSE. |
num_keep |
A user specified number of candidate interactions to keep in Step 2. If |
lambda1 |
Tuning parameter values for Step 1. |
lambda3 |
Tuning parameter values for Step 3. |
cv_step1 |
Indicator of whether cross-validation of |
nlam1 |
the number of values in |
nlam3 |
the number of values in each column of |
lam_min_ratio |
The ratio of the smallest and the largest values in |
nfold |
Number of folds in cross-validation. Default value is 5. If each fold gets too view observation, a warning is thrown and the minimal |
foldid |
A vector of length |
verbose |
If |
... |
other arguments to be passed to the |
An object of S3 class "sprinter
".
n
The sample size.
p
The number of main effects.
square
The square
parameter passed into sprinter.
a0_step3
Estimate of intercept corresponding to the CV-selected model.
compact
A compact representation of the selected variables. compact
has three columns, with the first two columns representing the indices of a selected variable (main effects with first index = 0), and the last column representing the estimate of coefficients.
fit
The whole glmnet
fit object.
fitted
fitted value of response corresponding to the CV-selected model.
num_keep
The value of num_keep
.
cvm
The averaged estimated prediction error on the test sets over K folds.
cvse
The standard error of the estimated prediction error on the test sets over K folds.
foldid
Fold assignment. A vector of length n
.
i_lambda1_best
The index in lambda1
that is chosen by CV by minimizing cvm.
i_lambda3_best
The index in lambda3
that is chosen by CV by minimizing cvm.
lambda1_best
The value of lambda1
that is chosen by CV by minimizing cvm.
lambda3_best
The value of lambda3
that is chosen by CV by minimizing cvm.
call
Function call.
n <- 100 p <- 100 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- cv.sprinter(x = x, y = y)
n <- 100 p <- 100 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- cv.sprinter(x = x, y = y)
An implementation of the two-stage lasso studied in Hao et, al (2018).
hier_lasso( x, y, lambda = NULL, nlam = 100, lam_choice = "min", lam_min_ratio = ifelse(nrow(x) < ncol(x), 0.01, 1e-04), nfold = 5, foldid = NULL, ... )
hier_lasso( x, y, lambda = NULL, nlam = 100, lam_choice = "min", lam_min_ratio = ifelse(nrow(x) < ncol(x), 0.01, 1e-04), nfold = 5, foldid = NULL, ... )
x |
An |
y |
A response vector of size |
... |
other arguments to be passed to the |
An object of S3 class "cv.hier
".
n
The sample size.
p
The number of main effects.
fit
The whole cv.glmnet
fit object.
compact
A compact representation of the selected variables. compact
has three columns, with the first two columns representing the indices of a selected variable (main effects with first index = 0), and the last column representing the estimate of coefficients.
set.seed(123) n <- 100 p <- 200 # dense input x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- hier_lasso(x = x, y = y)
set.seed(123) n <- 100 p <- 200 # dense input x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- hier_lasso(x = x, y = y)
This function produces plots of cross-validation for cv.sprinter.
## S3 method for class 'cv.sprinter' plot(fit)
## S3 method for class 'cv.sprinter' plot(fit)
fit |
A " |
The orange pairs on the top of the plot shows the number of non-zero (main effects, interactions) selected by each value of lambda. Adopted from the function plot.cv.rgam
from package relgam
by Kenneth Tay and Robert Tibshirani.
set.seed(123) n <- 100 p <- 200 # dense input x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- cv.sprinter(x = x, y = y) plot(mod)
set.seed(123) n <- 100 p <- 200 # dense input x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- cv.sprinter(x = x, y = y) plot(mod)
Produces a two-panel plot of the sprinter object showing coefficient paths for both main effects and interactions.
## S3 method for class 'sprinter' plot(fit, which = 1, label = TRUE, index = NULL)
## S3 method for class 'sprinter' plot(fit, which = 1, label = TRUE, index = NULL)
fit |
Fitted |
which |
The tuning parameter considered in Step 2. |
label |
If |
index |
Lambda indices to plot |
A two panel plot is produced, that summarizes the main effects (left) and interaction (right) coefficients, as a function of lambda. Adopted from the function summary.rgam
from package relgam
by Kenneth Tay and Robert Tibshirani.
set.seed(123) n <- 100 p <- 100 # dense input x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) fit <- sprinter(x = x, y = y) plot(fit)
set.seed(123) n <- 100 p <- 100 # dense input x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) fit <- sprinter(x = x, y = y) plot(fit)
cv.sprinter
object.Calculate prediction from a cv.sprinter
object.
## S3 method for class 'cv.sprinter' predict(object, newdata, ...)
## S3 method for class 'cv.sprinter' predict(object, newdata, ...)
object |
a fitted |
newdata |
a design matrix of all the |
... |
additional argument (not used here, only for S3 generic/method consistency) |
The prediction of newdata
by the cv.sprinter fit object
.
n <- 100 p <- 200 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] + 2 * x[, 2] - 3 * x[, 1] * x[, 2] + rnorm(n) mod <- cv.sprinter(x = x, y = y) fitted <- predict(mod, newdata = x)
n <- 100 p <- 200 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] + 2 * x[, 2] - 3 * x[, 1] * x[, 2] + rnorm(n) mod <- cv.sprinter(x = x, y = y) fitted <- predict(mod, newdata = x)
other
object.Calculate prediction from a other
object.
## S3 method for class 'other' predict(object, newdata, ...)
## S3 method for class 'other' predict(object, newdata, ...)
object |
a fitted |
newdata |
a design matrix of all the |
... |
additional argument (not used here, only for S3 generic/method consistency) |
The prediction of newdata
by the cv.sprinter fit object
.
n <- 100 p <- 200 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] + 2 * x[, 2] - 3 * x[, 1] * x[, 2] + rnorm(n) mod <- cv.sprinter(x = x, y = y) fitted <- predict(mod, newdata = x)
n <- 100 p <- 200 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] + 2 * x[, 2] - 3 * x[, 1] * x[, 2] + rnorm(n) mod <- cv.sprinter(x = x, y = y) fitted <- predict(mod, newdata = x)
sprinter
object.Calculate prediction from a sprinter
object.
## S3 method for class 'sprinter' predict(object, newdata, ...)
## S3 method for class 'sprinter' predict(object, newdata, ...)
object |
a fitted |
newdata |
a design matrix of all the |
... |
additional argument (not used here, only for S3 generic/method consistency) |
The prediction of newdata
by the sprinter fit object
.
n <- 100 p <- 200 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] + 2 * x[, 2] - 3 * x[, 1] * x[, 2] + rnorm(n) mod <- sprinter(x = x, y = y) fitted <- predict(mod, newdata = x)
n <- 100 p <- 200 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] + 2 * x[, 2] - 3 * x[, 1] * x[, 2] + rnorm(n) mod <- sprinter(x = x, y = y) fitted <- predict(mod, newdata = x)
Print a summary of the cross-validation information for running cv.sprinter.
## S3 method for class 'cv.sprinter' print(fit, digits = max(3, getOption("digits") - 3), ...)
## S3 method for class 'cv.sprinter' print(fit, digits = max(3, getOption("digits") - 3), ...)
fit |
A fitted |
digits |
Significant digits in printout. |
This function takes in a cv.sprinter
object and produces summary of the cross-validation informationabout the tuning parameters (in Step 3) selected by lambda.min
and lambda.1se
.
Adopted from the function print.cv.rgam
from package relgam
by Kenneth Tay and Robert Tibshirani.
cv.sprinter
, print.printer
.
set.seed(123) n <- 100 p <- 100 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) fit.cv <- cv.sprinter(x = x, y = y) print(fit.cv)
set.seed(123) n <- 100 p <- 100 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) fit.cv <- cv.sprinter(x = x, y = y) print(fit.cv)
Print a summary of the sprinter fit at each step along the path of tuning parameters used in Step 3, for any given tuning parameter in Step 1.
## S3 method for class 'sprinter' print(fit, which = 1, digits = max(3, getOption("digits") - 3), ...)
## S3 method for class 'sprinter' print(fit, which = 1, digits = max(3, getOption("digits") - 3), ...)
fit |
A |
which |
Which tuning parameter of Step 1 to print. Default is 1. |
digits |
Significant digits in printout. |
... |
Additional print arguments. |
The function produces a three-column matrix with tuning parameter values (in Step 3), number of nonzero main effects, and the number of nonzero interactions.
Adopted from the function print.rgam
from package relgam
by Kenneth Tay and Robert Tibshirani.
set.seed(123) n <- 100 p <- 100 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) fit <- sprinter(x = x, y = y) print(fit, which = 3)
set.seed(123) n <- 100 p <- 100 x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) fit <- sprinter(x = x, y = y) print(fit, which = 3)
Sure Independence Screening in Step 2
screen_cpp(x, y, num_keep, square = FALSE, main_effect = FALSE)
screen_cpp(x, y, num_keep, square = FALSE, main_effect = FALSE)
x |
a n-by-p matrix of main effects, with i.i.d rows, and each row represents a vector of observations of p main-effects |
y |
a vector of length n. In sprinter, y is the residual from step 1 |
num_keep |
the number of candidate interactions in Step 2. Default to be n / [log n] |
square |
An indicator of whether squared effects should be considered in Step 1 (NOT Step 2!). square == TRUE if squared effects have been considered in Step 1, i.e., squared effects will NOT be considered in Step 2. |
main_effect |
An indicator of whether main effects should also be screened. Default to be false. The functionality of main_effect = true is not used in sprinter, but for SIS_lasso. |
an matrix of 3 columns, representing the index pair of the selected interactions, and the corresponding absolute correlation with the residual.
Sure Independence Screening in Step 2 for sparse design matrix
screen_sparse_cpp(x, y, num_keep, square = FALSE, main_effect = FALSE)
screen_sparse_cpp(x, y, num_keep, square = FALSE, main_effect = FALSE)
x |
a n-by-p sparse matrix of main effects |
y |
a vector of length n. In sprinter, y is the residual from step 1 |
num_keep |
the number of candidate interactions in Step 2. Default to be n / [log n] |
square |
An indicator of whether squared effects should be considered in Step 1 (NOT Step 2!). square == TRUE if squared effects have been considered in Step 1, i.e., squared effects will NOT be considered in Step 2. |
main_effect |
An indicator of whether main effects should also be screened. Default to be false. The functionality of main_effect = true is not used in sprinter, but for SIS_lasso. |
an matrix of 3 columns, representing the index pair of the selected interactions, and the corresponding absolute correlation with the residual.
Sure independence screening followed by lasso
sis_lasso( x, y, num_keep = NULL, lam_min_ratio = ifelse(nrow(x) < ncol(x), 0.01, 1e-04), nfold = 5, foldid = NULL, ... )
sis_lasso( x, y, num_keep = NULL, lam_min_ratio = ifelse(nrow(x) < ncol(x), 0.01, 1e-04), nfold = 5, foldid = NULL, ... )
x |
An |
y |
A response vector of size |
num_keep |
Number of variables to keep in the screening phase |
... |
other arguments to be passed to the |
An object of S3 class "cv.hier
".
n
The sample size.
p
The number of main effects.
fit
The whole cv.glmnet
fit object.
compact
A compact representation of the selected variables. compact
has three columns, with the first two columns representing the indices of a selected variable (main effects with first index = 0), and the last column representing the estimate of coefficients.
set.seed(123) n <- 100 p <- 200 # dense input x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- hier_lasso(x = x, y = y)
set.seed(123) n <- 100 p <- 200 # dense input x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- hier_lasso(x = x, y = y)
This is the main function that fits interaction models with a path of tuning parameters (for Step 3).
sprinter( x, y, square = FALSE, num_keep = NULL, lambda1 = NULL, lambda3 = NULL, cv_step1 = FALSE, nlam1 = 10, nlam3 = 100, lam_min_ratio = ifelse(nrow(x) < ncol(x), 0.01, 1e-04), ... )
sprinter( x, y, square = FALSE, num_keep = NULL, lambda1 = NULL, lambda3 = NULL, cv_step1 = FALSE, nlam1 = 10, nlam3 = 100, lam_min_ratio = ifelse(nrow(x) < ncol(x), 0.01, 1e-04), ... )
x |
An |
y |
A response vector of size |
square |
Indicator of whether squared effects should be fitted in Step 1. Default to be FALSE. |
num_keep |
A user specified number of candidate interactions to keep in Step 2. If |
lambda1 |
Tuning parameter values for Step 1. |
lambda3 |
Tuning parameter values for Step 3. |
cv_step1 |
Indicator of whether cross-validation of |
nlam1 |
the number of values in |
nlam3 |
the number of values in each column of |
lam_min_ratio |
The ratio of the smallest and the largest values in |
... |
other arguments to be passed to the |
An object of S3 class "sprinter
".
square
The square
parameter passed into sprinter
n
The number of observations in the dataset
p
The number of main effects
step1
The output from fitting Step 1
lambda1
The path of tuning parameters passed into / computed for fitting Step 1
step2
The output from the screening Step 2
num_keep
The path of tuning parameters for Step 2
step3
The output from fitting Step 3
lambda3
The path of tuning parameters passed into / computed for fitting Step 3
main_center
Column centers of the input main effects
main_scale
Column scales of the input main effects
call
Function call.
set.seed(123) n <- 100 p <- 100 # dense input x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- sprinter(x = x, y = y) # sparse input library(Matrix) x <- Matrix::Matrix(0, n, p) idx <- cbind(sample(seq(n), size = 10, replace = TRUE), sample(seq(p), size = 10, replace = TRUE)) x[idx] <- 1 y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- sprinter(x = x, y = y)
set.seed(123) n <- 100 p <- 100 # dense input x <- matrix(rnorm(n * p), n, p) y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- sprinter(x = x, y = y) # sparse input library(Matrix) x <- Matrix::Matrix(0, n, p) idx <- cbind(sample(seq(n), size = 10, replace = TRUE), sample(seq(p), size = 10, replace = TRUE)) x[idx] <- 1 y <- x[, 1] - 2 * x[, 2] + 3 * x[, 1] * x[, 3] - 4 * x[, 4] * x[, 5] + rnorm(n) mod <- sprinter(x = x, y = y)