The Adaptive Lasso and Its Oracle Properties

Hui Zou

Abstract

The lasso is a popular technique for simultaneous estimation and variable selection. Lasso variable selection has been shown to be consistent under certain conditions. In this work we derive a necessary condition for the lasso variable selection to be consistent. Consequently, there exist certain scenarios where the lasso is inconsistent for variable selection. We then propose a new version of the lasso, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the penalty. We show that the adaptive lasso enjoys the oracle properties; namely, it performs as well as if the true underlying model were given in advance. Similar to the lasso, the adaptive lasso is shown to be near-minimax optimal. Furthermore, the adaptive lasso can be solved by the same efficient algorithm for solving the lasso. We also discuss the extension of the adaptive lasso in generalized linear models and show that the oracle properties still hold under mild regularity conditions. As a byproduct of our theory, the nonnegative garotte is shown to be consistent for variable selection.

Adaptive LASSO is just normal lasso with some additional weights on the regressor penalization based on the inverse coefficients coming from some other initial procedure, such as OLS. Gives LASSO the oracle property. Equivalently, you can divide the regressors by their respective coefficients in the first stage.

Standard LASSO tends to use shrinkage parameter that is tuned for prediction, making it too small for variable selection purposes.

The coefficients from standard LASSO tend to be too small (i.e. biased) for variable that should in fact have large coefficients (over shrinkage), and small true parameters have a tendency to be zeroed out.

High multicollinearity between predictors tends to lead to poor variable selection by LASSO.

asgl is a good Python library for adaptive LASSO (GitHub)

Some example R code from Stack Overflow:

# get data
y <- train[, 11]
x <- train[, -11]
x <- as.matrix(x)
n <- nrow(x)
 
# standardize data
ymean <- mean(y)
y <- y-mean(y)  
xmean <- colMeans(x)
xnorm <- sqrt(n-1)*apply(x,2,sd)
x <- scale(x, center = xmean, scale = xnorm)
 
# fit ols 
lm.fit <- lm(y ~ x)
beta.init <- coef(lm.fit)[-1] # exclude 0 intercept
 
# calculate weights
w  <- abs(beta.init)  
x2 <- scale(x, center=FALSE, scale=1/w)  
 
# fit adaptive lasso
require(glmnet)
lasso.fit <- cv.glmnet(x2, y, family = "gaussian", alpha = 1, standardize = FALSE, nfolds = 10)
beta <- predict(lasso.fit, x2, type="coefficients", s="lambda.min")[-1]
 
# calculate estimates
beta <- beta * w / xnorm # back to original scale
beta <- matrix(beta, nrow=1)
xmean <- matrix(xmean, nrow=10)
b0 <- apply(beta, 1, function(a) ymean - a %*% xmean) # intercept
coef <- cbind(b0, beta)

“We have shown that the lasso cannot be an oracle procedure.” (Zou, 2006, p. 3) (pdf)

“We now define the adaptive lasso. Suppose that is a root-–consistent estimator to ; for example, we can use (ols). Pick a , and define the weight vector .” (Zou, 2006, p. 3) (pdf)

“We have shown that although the lasso variable selection can be inconsistent in some scenarios, the adaptive lasso enjoys the oracle properties by utilizing the adaptively weighted penalty.” (Zou, 2006, p. 8) (pdf)


References

paperonline