Exogeneity

The easiest way to think about exogeneity is in terms of correlations with error terms. To be exogenous is to be uncorrelated with the error term. There are other definitions as well but I find them to be confusing in various scenarios. The correlation with the error term definition is the only one I’ve found to be universally applicable in a non-confusing way. If you can come up with a plausible story for how your regressor of interest is uncorrelated with the error term, you’re fine.

And by error term – you really don’t need to think of anything more complicated than “other things that effect the outcome variable that aren’t included among my regressors”. Could be literally anything. Once you frame it this way it becomes clear how difficult of a standard this really is and how flawed most research must therefore be. Nothing correlates with your regressor of interest than correlates with the outcome variable? Damn – must be a very special variable (or just total noise)!

Exogeneity is often written as: which I used to find super confusing. An easy way of thinking about this is to remember the formula for covariance: Notice that the second term multiplies the two respective expected values. If either is zero, that term goes to zero, and the covariance is simply the expectation of the product of the two variables. In an OLS context, the residual has an expectation of zero by definition, so the formula for exogeneity is equivalent to the covariance between the regressor and the error term. So, exogeneity quite literally means that the covariance of the variable and the error term is zero.

Note that the exogeneity condition can also be expressed on conditional terms – so it’s fine if the raw regressor is not exogenous, as long as it is after conditioning on controls.

Now, some people will write exogeneity this way: Where is a set of controls. This is called the conditional mean assumption. What this is saying is that the expectation of the error term does not depend on the conditioning set. If it did, those variables would appear somewhere on the RHS. Instead, the expectation is a constant (zero in this case, as you’d estimate in a regression), which under linearity by definition means it doesn’t depend on these other variables.

Another thing to note about all these definitions – they deal with the actual (sometimes unobserved), true values of all the variables in question, not their estimated quantities, which is why you don’t see any “hats” above the variables. This is of course not verifiable in practice when the variables are unobserved, so exogeneity much of the time is an assumption you are making, and these are ways of being explicit about what that assumption entails. They are not necessarily testable assumptions.

It’s interesting to note that exogeneity is particular to the dependent variable in question. The same RHS variable can be exogenous or endogenous depending on the context, i.e. depending on the outcome variable. Thus exogeneity is “context dependent”. This is important because it means that if if you have a random variable that it is difficult to forecast or that seems to evolve independently of most other things, then by definition most RHS variables you might want to use will be exogenous.

It’s interesting to think about exogeneity in the time series context. On some level, not controlling for lags of a outcome variable that as temporal autocorrelation creates an omitted variable bias issue to the degree that those lags are correlated with the independent variables in your regression.

One way of thinking about exogeneity that I find quite intuitive is think about two variables that are causally influenced by various structural shocks. Consider one particular structural shock – if that single shock affects both variables, then those variables are endogenous to one another. Why? They both share underlying structural causes (other than themselves). Whenever two variables share underlying causes, they are endogenous. When you think about it this way, it becomes quite clear that most variables you’d think to analyze will be endogenous, because most things that you would ever think to compare share some underlying causes. It’s very similar to a SVAR, where you have a matrix mapping structural shocks to the observed variables in the system. If that matrix is such that at least one structural shock affects multiple variables in the system, then those variables are endogenous. The only exception to this is when one of the variables in question is itself a structural / exogenous shock, in which case it’s fine that it affects multiple variables (itself plus others). It just needs to be the case that no other structural shock affect it.


Another way to think about exogeneity is in terms of systems. Things that are exogenous are those that come from outside the system and are not affected by the system itself. Otherwise, that object is endogenous – it comes from the system itself, is affected by the system, it is determined within the system, etc.

We can think about any collection of objects as being clustered into two categories – one broad system of related variables where there are direct or indirect two-way cause and effect relationships, and another cluster which doesn’t have these characteristics either between themselves or to the variables in the first cluster. Any relationships between an objects in the first cluster (“the system”) and the second cluster are unidirectional – things outside the system affect the things inside the system, but not the other way around. And once again, things outside the system also do not affect each other.

Transclude of Exogeneity-2025-02-09-12.42.12.excalidraw

When we write down a regression equation, we are essentially defining and separating out the system from the things outside the system (assuming we want accurate coefficient estimates): The portion represents the system, the portion represents the things outside the system. If we set things up this way, with everything endogenous represented in and everything exogenous represented in , then we get good estimates of via OLS (which is just a simple manipulation of the regression equation, solving for ). When we fail to do this, we get poor estimates.

Note that this is relevant when the in question is itself endogenous. If we want to estimate the effect of an exogenous variable we can do so straightforwardly because by definition that variable is exogenous to everything else (both things within the system and things outside it). We don’t actually need to control for anything in the simplest case, though there can be advantages to precision by doing so.

This also relates the notion of deterministic vs stochastic. In some sense the system itself is deterministic – with precise enough calculations, its evolution over time could be predicted perfectly, in the same way that the movement of a collection of gravitational objects could hypothetically be perfectly predicted based on the laws of physics. There exists, from the the perspective of the system, random sources of noise, variation, inputs to the system that perturb it and make its eventual dynamics not totally predictable ex ante. That these things can be separated is to some extent an assumption, most obviously laid out and explicit in the linear context (the term, which yields , the “deterministic” portion of ).


References

eonline