The natural
package
contains the implementations of two methods that estimate the error
variance of a high-dimensional linear model, namely, the natural lasso
and the organic lasso. The details of the methods can be found in Yu, Bien (2017) Estimating
the error variance in a high-dimensional linear model. In
particular, given a data matrix X ∈ ℝn × p,
with each row an observation of p features, and a vector of response
y ∈ ℝn,
this package implements two penalized maximum likelihood-based
approaches for jointly estimating β and σ2 in a linear model
y = Xβ + ε, ε ∼ N(0, σ2).
This document serves as an introduction of using the package.
To reproduce the simulation study in the paper, the package also
contains a function to generate random samples from a linear model with
user-specified model parameters. In particular,
make_sparse_model
generates a sparse linear model as above,
with X ∼ N(0, Σ) such
that Σjj = 1
and Σij = ρ.
The value of columnwise correlation ρ is set by the function argument
rho
. To generate β, we set the number of nonzero
elements to be ⌈nα⌉, where
α is set by the argument
alpha
, and each nonzero element is drawn from a Laplace
distribution of rate 1. For a given
signal-to-noise ratio, as specified by snr
, we have error
variance σ2 = βTΣβ/snr.
The main functions implementing natural lasso are
nlasso_path
and nlasso_cv
.
nlasso_path
computes the natural lasso estimates of the
error variance along a path of tuning parameters, and
nlasso_cv
selects the tuning parameter using K-fold
cross-validation.
nlasso_path
takes the design matrix and the response .
It also requires a path of tuning parameters λ, and the function outputs the
following three estimates:
sig_obj_path
$$
\hat{\sigma} ^2(\lambda) = \frac{1}{n}||y - X \hat{\beta}_\lambda||_2^2
+ 2 \lambda ||\hat\beta_\lambda||_1,
$$ sig_naive_path
$$
\hat{\sigma} ^2_{naive}(\lambda) = \frac{1}{n}||y - X
\hat{\beta}_\lambda||_2^2,
$$ and sig_df_path
(Reid, et, al 2016) $$
\hat{\sigma}^2_{df}(\lambda) = \frac{1}{n - \hat{s}_\lambda}||y - X
\hat{\beta}_\lambda||_2^2,
$$ where $$
\hat{\beta}_\lambda = \arg\min \frac{1}{n} ||y - X \beta||_2^2 +
2\lambda ||\beta||_1
$$ is the lasso solution with tuning parameter λ, and ŝλ is the degree
of freedom of the lasso fit.
The tuning parameter path can be specified via argument
lambda
. If not provided, the algorithm will automatically
generate a path of lambda of length nlam
. The output is a
S3 object, which can be printed or plotted.
The function nlasso_cv
implements a K-fold cross-validation procedure to
select the best tuning parameter value. The value of K can be specified by the argument
nfold
. The following code does the cross-validation, plots
the estimate of prediction error on the test set, and selects the best
tuning parameter.
The return of nlasso_cv
is a list of objects. See
?nlasso_cv
for more details. In particular,
sig_obj
, sig_naive
, and sig_df
are the cross-validated estimates.
glmnet
outputThe function nlasso_path
calls glmnet
internally to solve lasso problems. In many use cases, one might have
already called glmnet
(and/or cv.glmnet
)
before calling nlasso_path
and/or nlasso_cv
.
To avoid redundant computation, one can pass the output from
glmnet
into nlasso_path
using the argument
glmnet_output
. By doing so, arguments like
lambda
, nlam
, flmin
, etc, will be
ignored, and the function will compute estimates of σ from glmnet_output
directly. It is suggested that glmnet_output
should be from
glmnet
call with argument standardize = TRUE
(which is by default) to align with what nlasso_path
is
doing internally when glmnet_output = NULL
(by
default).
## Loading required package: Matrix
## Loaded glmnet 4.1-8
g_o <- glmnet(x = sim$x, y = sim$y[, 1], lambda = nl$lambda)
nl2 <- nlasso_path(x = sim$x, y = sim$y[, 1], glmnet_output = g_o)
Similarly, one can pass the output from cv.glmnet
into
nlasso_cv
with argument glmnet_output
.
Organic lasso is a companion method to the natural lasso. The main novelty is that the choice of tuning parameter is pivotal, in that it does not depend on any unknown parameter. The organic lasso estimate of the error variance is defined as $$ \tilde{\sigma}_\lambda^2 = \min_{\beta} \frac{1}{n} ||y - X\beta||_2^2 + 2 \lambda ||\beta||_1^2. $$
The main functions implementing organic lasso are
olasso_path
, olasso_cv
, and
olasso
. In particular, olasso_path
computes
the organic lasso estimates of the error variance along a path of tuning
parameters, and olasso_cv
selects the optimal tuning
parameter using a K-fold
cross-validation procedure. The usages are the same as
nlasso_path
and nlasso_cv
. Please see
?olasso_cv
and ?olasso_path
for more
details.
The function olasso
computes the organic lasso estimate
of σ corresponding to two
pre-specified values of tuning parameters. In particular, the function
outputs the organic lasso estimates with $\lambda_1 = \frac{\log p}{n}$, and λ2, which is a
Monte-Carlo estimate of the quantity n−2||XTe||∞2,
where e is an n-dimensional
vector of independent standard normals. We show in the following example
that both of them give close estimates of the true error variance. For
completeness of the comparison, we also include the outputs of
olasso_cv
and nlasso_cv
.
err_o_mat <- matrix(NA, nrow = nsim, ncol = 6)
colnames(err_o_mat) <- c("olasso(1)", "olasso(2)", "olasso(cv)", "nlasso", "naive", "df")
for(i in seq(nsim)){
cur_ol <- olasso(x = sim$x, y = sim$y[, i])
err_o_mat[i, 1] <- (cur_ol$sig_obj_1 / sim$sigma - 1)^2
err_o_mat[i, 2] <- (cur_ol$sig_obj_2 / sim$sigma - 1)^2
cur_ol_cv <- olasso_cv(x = sim$x, y = sim$y[, i])
err_o_mat[i, 3] <- (cur_ol_cv$sig_obj / sim$sigma - 1)^2
cur_nl_cv <- nlasso_cv(x = sim$x, y = sim$y[, i])
err_o_mat[i, 4] <- (cur_nl_cv$sig_obj / sim$sigma - 1)^2
err_o_mat[i, 5] <- (cur_nl_cv$sig_naive / sim$sigma - 1)^2
err_o_mat[i, 6] <- (cur_nl_cv$sig_df / sim$sigma - 1)^2
}
boxplot(err_o_mat, ylim = c(0, 0.4), ylab = "Mean squared error")