Priors in Time: Missing Inductive Biases for Language Model Interpretability

Ekdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J. Bigelow, Johnny Lin, Demba Ba, Martin Wattenberg, Fernanda Viegas, Melanie Weber, Aaron Mueller – 2025

Seems like a super important acknowledgement that SAEs until this point have assumed iid data, with no temporal dependency or non-stationarity. A handful of plots across representative language datasets disproves this notion entirely. So a core prior embedded within the SAE approach is essentially false. Now, all priors are false to some extent, so that alone is not necessarily a critical error. SAEs themselves embed a prior of sparsity, which is probably correct, but we can’t know the exact right level of that sparsity, so that prior will be “wrong” in some sense. However, that kind of error seems much more benign than falsely assuming the data is iid when it is not.

I think some concepts need to be disentangled here (no pun intended):

  • Whether the data is iid
  • Whether the data is non-stationary

The data could be non-iid, but conditionality iid once you control for past information. Or, the data could have temporal dependency and dynamics but still be stationary (standard AR(1) with persistence < 1, for example).

I think this paper is saying both that the data is not iid and that the data is non-stationary. But again, those are two separate and independently important points, so one should be careful not to conflate them.

In the end, the authors settle on a methodology they call Temporal Feature Analysis.

In computational neuroscience, when analyzing data from dynamical domains (e.g., audio, language, or video), a commonly made assumption is that there is contextual information present in the recent history that informs the next state—this part of the signal is deemed predictable, slow-changing, invariant, or dense. Meanwhile, the remaining signal corresponds to new bits of information added by the observed state at the next timestep—this part can be deemed novel, fast-changing, variant, or sparse with respect to the context. We argue LM activations are amenable to a similar generative model.

I love the additional words they use to describe something that I’m mostly familiar with from my reading of the time series econometrics literature. There’s some risk of mixing up concepts that mean different things in different contexts, but in general I think this added richness in vocabulary is helpful.

Various ways of thinking about this in econometrics:

  • Deterministic component vs random component
  • (Slow-moving) State variables
  • Permanent vs transitory component

Again, I’m sensitive to the fact that these aren’t all actually the same thing. But they scratch at similar ideas and help provide more of a 360 view on a certain cluster of ideas — within a single step of a time series, there exists a component that is more “durable”, “slow to change”, “predictable”, “temporally dependent”, “endogenous”, “persistent”, potentially “non-stationary”, etc. And there’s another component that is “fast moving”, “exogenous”, “transitory”, “stationary”, “iid”, “white noise”-like, “random”, “unpredictable”, “noisy”, etc.

Sidebar: the points made in this paper help bring home the idea that I’ve encountered elsewhere that applying sparsity methods to time series data is tricky and requires care. See @adamekLassoInferenceHighDimensional2022 (“lasso cannot be applied off the shelf to time series data, as the data are not IID.”)


References

paperreadonline