Sparse Trend Estimation
Richard K. Crump, Nikolay Gospodinov, Hunter Wieman – 2023
Abstract
The low-frequency movements of many economic variables play a prominent role in policy analysis and decision-making. We develop a robust estimation approach for these slow-moving trend processes which is guided by a judicious choice of priors and is characterized by sparsity. We present some novel stylized facts from longer-run survey expectations that inform the structure of the estimation procedure. The general version of the proposed Bayesian estimator with a slab-and-spike prior accounts explicitly for cyclical dynamics. The practical implementation of the method is discussed in detail and we show that it performs well in simulations against some relevant benchmarks. We report empirical estimates of trend growth for U.S. output (and its components), productivity and annual mean temperature. These estimates allow policy makers to assess shortfalls and overshoots in these variables from their economic and ecological targets.
The authors examine the evolution of long-horizon forecasts of the US economy and note that these forecast change quite slowly over time. Thus the first differences of these forecasts are usually zero, with rarer deviations in either direction when the forecast does change. The same is true of the second differences of these forecasts. These distributions don’t appear to be consistent with continuous distributions like the normal distribution. They are more consistent with mixture distributions.
This feels like a relatively “doable” paper to replicate
The March 2024 revision of the paper is definitely cleaner and easier to understand, though I’m biased from having read the first version of the paper.
One thing to notice is that though they omit any multivariate designation for the normals that define the distributions of the observed data and trend, you can tell they are multivariate by the fact that they are parameterized with vectors for the means and matrices for the variance.
Replication
Some interesting reflections after attempting to replicate the paper multiple times, with the most success this was recent attempt:
- When people say that a Laplace distribution is a scale mixture of normals with exponential mixing density, what they literally mean is that the variances are exponentially distributed, meaning they get less frequent the larger they are. How the infrequency scales with the variances depends on the parameter, i.e. the variance of the original Laplace. The higher the variance of the original Laplace, the lower the , the higher variance the exponential distribution, the less high variance normals are effectively “penalized”.
- You are effectively sampling various normals from a single exponential distribution. It’s not that there are multiple different exponentials – there is only one, characterized by , from which you sample all the variances, i.e. all the normals. It just so happens that when you sample a bunch of numbers from an exponential, call those things “variances” and use them to parameterize a bunch of normals centered around the same mean, you get a Laplace. That just happens by definition, that’s just the math. The rarity of higher variances comes for free via the exponential distribution. When combined with those relative frequencies, you get a Laplace.
- The normals are integrated over to get the final Laplace distribution. The exact way this is done is fairly complicated in its full form, but it’s basically integrating over the normals conditional on their variances, then integrating over the variances themselves. So if the normals are and the exponentials are where is the variance of the normals and is the scale parameter of the exponentials, then the integral for the density is which evaluates out to the Laplace density
- You need to use a mixture to represent the spike and slab density, since the whole point is that a single parameterized Laplace distribution can’t accommodate the stylized facts we see in the data – lots of zeros and a small number of larger magnitude changes. This was originally confusing because I thought there was only one mixture, the one generating the Laplace if you use the hierarchical setup. But no, you need to setup a mixture, because although there are “2 Laplaces” you are only pulling a single sample for each time period. It’s not a separate sample for each Laplace for each time period. The only way to set that up is via a mixture distribution, which gives you one effective distribution. Now, because you use a Bernoulli to pick between them, any given time period is only effectively using one Laplace or the other.
- I’m still not sure if its critical to use the mixture of normals approach. My sense is that they use that so that they can use a Gibbs sampler, but I don’t know enough about Gibbs sampling to know if I’m right on that. For my purposes it’s much simpler to just use Laplace distributions directly, though now with my new understanding maybe it wouldn’t be so bad to rewrite using normals and exponentials… I need to make sure that the “scale mixture of normals” can be represented in PyMC using the
NormalMixture
function or if there’s some subtle distinction there. I think the weights in that function are different than what comes out of an exponential distribution. I think the weights for theNormalMixture
should actually be the relative frequency of the variances you are sampling from the exponential, not the variances themselves which is what I was previously doing. - Something seems to just completely blow up when you expand the sample size. The time to take to run the routine grows dramatically if you add just 5 more years of data, something like 50x. Strange.
References
- Really good discussion of why The Bayesian Lasso is not very good in practice
- Similar to why this paper even exists, the Bayesian Lasso can’t properly balance between big and small values, so it ends up trying to compromise which shrinks the larger values too much and doesn’t shrink the small values enough
- The core reason for this is that a single Laplace prior distribution doesn’t place enough probability mass near the spike at zero relative to the mass in the tails to ensure that you get a reasonable level of sparsity