Trend filtering

Trend filtering is the process of extracting a signal from a noisy time series. $y_{t} = signal_{t} + noise_{t}$ How “signal” and “noise” are defined is inherent to the particular trend filtering method in question. Any given method implicitly defines some notion of trend, which is to say the deterministic mean of the time series at a given point, and cycle, typically modeled as an autoregressive process in the general case and gaussian white noise in the zero autocorrelation limit. Importantly, this means the cycle will be stationary, while the trend may or may not be. Most trend filtering methods provide some way of trading off various concerns, primarily the degree of flexibility in the trend, summarized by the magnitude of its second differences, and the magnitude of the noise terms.

It’s helpful to interpret the trend filtering process in Bayesian terms. $p (signal_{1 : T} ∣ y_{1 : T}) \approx p (y_{1 : T} ∣ signal_{1 : T}) \times p (signal_{1 : T})$ We can think of the tradeoffs highlighted above as being analogous (in fact, equivalent to) the tradeoff between the likelihood and prior in calculating the Bayes posterior, where the likelihood represents the noise term (where the data is modeled as being generated by the trend plus autoregressive cycle and where maximum likelihood → small noise), the prior represents our beliefs about the characteristics of the signal (primarily its shape / flexibility), and the posterior represents our belief about the location of the trend after observing the data (the trend is the “parameter” for which we are conducting inference). By enforcing strong priors about the signal, we can extract signals that “look like” what we’d expect based on those prior beliefs.

In econometrics, researchers often reach for the Hodrick-Prescott Filter which assumes a white noise cycle and offers a penalty parameter that can be used to trade off between the size of the squared second differences and the squared cycle terms: $min_{signal_{1 : T}} \sum_{t = 1}^{T} noise_{t}^{2} + λ (Δ^{2} signal_{t})^{2}$ It would perhaps be more accurate to say that the HP filter doesn’t necessarily assume a white noise cycle but rather is indifferent to the autocorrelation structure of the cycle and instead focuses merely on its magnitude.

However, this “non-assumption” has important implications for the trend extracted by the HP filter. By assuming that the cycle is white noise when in fact it may not be, the filter implicitly penalizes the whole value of the cycle rather than accounting for the autocorrelation, which would increase the deterministic component of the cycle and reduce the noise component. Penalizing the entire cycle pushes the model to estimate a small cycle, leaving more of the data to be explained by the trend. This has the effect of increasing the variance of the trend, making it less smooth and less stable. This leads to the common artifact of HP filtered trends where the trend seems to lightly “wobble” up and down, even during periods of relative linearity in the source time series, creating the risk of spurious cycles (covered from a different angle in @hamiltonWhyYouShould2017). Increasing the penalty can help, but this runs the risk of introducing non-stationarity into the cycle, which is undesirable.

It should also be noted that the HP filter implicitly assumes normally distributed noise terms and second differences of the signal, as can be inferred from the “sum squared error” structure of the optimization problem. In that sense it is ridge regression in the time dimension, shrinking the second differences of the trend toward zero.

Now, it is well-known that the ridge penalty shrinks towards zero but not all the way, leaving many non-zero terms. It is also known that the shrinkage happens fairly uniformly, essentially acting as a global shrinkage prior, which can lead to overshrinkage of truly large parameters and undershrinkage of small or zero magnitude parameters. The effect of this in the context of trend filtering is that the second differences will nearly always have some magnitude, even when zero may be more appropriate and will tend to be too small when large values are in fact more fitting.

Why does this matter? If one takes the slope of the trend to be some measure of the “long-run” growth rate of the time series and expects this long-run growth rate to change infrequently (i.e. be relatively constant), then the second differences of the time series should be zero most of the time, i.e. sparse. The normal prior on second differences unfortunately fails to achieve this. A Laplace prior / lasso penalization is also insufficient because although it is more likely to assign exactly zero, its tails are not heavy enough, leading to overshrinkage of large values. Further, a fully Bayesian approach won’t yield exact zeros, as the prior assigns infinitesimal probability to an exact zero value.

@crumpSparseTrendEstimation2023 suggests alternative priors, opting for a spike-and-slab approach. Alternatives include the Normal-gamma prior (for which Laplace is a special case) and the horseshoe prior (@bhadraHorseshoeEstimatorUltraSparse2015, @bhadraLassoMeetsHorseshoe2019). Any will do – the main requirement is a large spike in probability mass at zero and reasonably heavy tails.

References

e online

Nnamdi's Notes

All Notes

Trend filtering

References

Backlinks

All Notes