Lasso Inference for High-Dimensional Time Series
Robert Adamek, Stephan Smeekes, Ines Wilms Link
Abstract
In this paper we develop valid inference for high-dimensional time series. We extend the desparsified lasso to a time series setting under Near-Epoch Dependence (NED) assumptions allowing for non-Gaussian, serially correlated and heteroskedastic processes, where the number of regressors can possibly grow faster than the time dimension. We first derive an error bound under weak sparsity, which, coupled with the NED assumption, means this inequality can also be applied to the (inherently misspecified) nodewise regressions performed in the desparsified lasso. This allows us to establish the uniform asymptotic normality of the desparsified lasso under general conditions, including for inference on parameters of increasing dimensions. Additionally, we show consistency of a long-run variance estimator, thus providing a complete set of tools for performing inference in high-dimensional linear time series models. Finally, we perform a simulation exercise to demonstrate the small sample properties of the desparsified lasso in common time series settings.
The key aim of the authors is to do inference in a lasso context, which is complicated by the fact that the variable selection feature of lasso renders standard inference invalid. Basically, when you do variable selection on the outcome variable, you potentially drop relevant covariates for properly measuring the coefficient on a “treatment” variable of interest. This makes inference on the parameters invalid. Therefore one needs post-selection inference methods. @belloniHighDimensionalMethodsInference2014 is an example of this, leveraging orthogonalization via Frisch-Waugh partialling out (post-double-selection) which ensures that covariates relevant to the treatment and/or outcome variable are kept in the regression. The debiased/desparsified lasso of @vandegeerAsymptoticallyOptimalConfidence2014 is another approach.
In the desparsified lasso approach, you effectively add back a small amount of “noise” to the zeros from the original lasso estimates. This undoes model selection, but in return, you get valid inference, normally distributed uncertainty, etc. By definition the estimator is no longer sparse, but now you can make scientific statements about the coefficients of interest. One can think of this as “weak sparsity”.
We also need to address the IID issue. Another key tension that these authors address is the fact that lasso cannot be applied off the shelf to time series data, as the data are not IID. There is other research that also explores this, so this is not the first entry in the literature. They extend the desparsified lasso to the time series context. Their approach also allows the errors to be non-Gaussian and heteroskedastic.
Unfortunately my sense is that the methods in this paper only apply to stationary data, or at least that is an assumption of the method, per these quotes (non-stationary data does not have constant or even finite moments). One may be able to get around this by first regressing non-stationary variables on a lag and using the residuals as regressors:
Assumption 1(i) ensures that the error terms are contemporaneously uncorrelated with each of the regressors, and that the process has finite and constant unconditional moments. … We provide a complete set of tools for uniformly valid inference in high-dimensional stationary time series settings, where the number of regressors N can possibly grow at a faster rate than the time dimension T .
From some example DGPs they provide:
Also assume that the vector of exogenous variables wt is stationary and geometrically β-mixing as well with finite 2 ̄ m moments. … the K × K matrices Φi satisfy appropriate stationarity and 2 ̄ m-th order summability conditions.
Here’s my attempt at explaining desparsified lasso in simple terms:
- The idea is to unsparsify the original lasso estimates, with the goal of removing the bias generated by standard lasso, which enables valid inference
- Take the initial lasso estimates and add to them some “noise” equal to the covariance of the residualized covariate in question with the residualized outcome variable (i.e. the regression scores) scaled by a factor, where the residualization is done via standard lasso