The Gauss-Markov Theorem (pt. 1)

Anyone who runs linear regression on two-variables nowadays probably clicks a button or writes some code in a command line in some application—R, S plus, Gretl, or Microsoft Excel to name a few—and lets the method of ordinary least squares (OLS) do its magic. Seldom do we remember why we can use OLS to approximate our dependent variable. This post today discusses the why.[1]

The math behind the following was made available to me through Damodar N. Gujarati’s [2].

First, some background: Let variables Y and X be given such that their population regression function (PRF) is of the form $E(Y|X_i) = \beta_1 + \beta_2 X_i$, where $\beta_1$ is the y-intercept of the line and $\beta_2$ is the slope. For each value of $X_i$, $Y_i$ is found using the PRF for these two variables, $Y_i =E(Y|X_i) + u_i = \beta_1 + \beta_2X_i + u_i.$ That is to say, the value of $Y_i$ is its expected value conditioned on $X_i$ (positioned on the line), plus its residual value $u_i$ (the distance from the population line to the actual $Y_i$ value).

In reality, we do not deal with entire populations of data. Rather, we spend our time working with samples of populations. Our task then is to estimate the PRF using the sample regression function (SRF), $\hat Y_i = \hat\beta_1 + \hat\beta_2X_i$, where $\hat Y_i$ is an estmate of $E(Y|X_i)$, $\hat \beta_1$ is an estimate of $\beta_1$, and $\hat \beta_2$ is an estimate of $\beta_2$.

Similarly, $Y_i$ can be found: $Y_i = \hat \beta_1 + \hat \beta_2X_i + \hat u_i$, where $\hat u_i$ is the residual value between the SRF and the value of $Y_i$.

Notice that $\hat u_i = Y_i - \hat Y_i$. Thus our goal is to make $\hat u_i$ as small as possible to come as close as possible to the actual value of $Y_i$. It turns out however that we cannot simply make the sum of all residuals as small as possible. But, we gain a pretty good estimate of the PRF if $\sum{\hat u_i ^2}$ is as small as possible. Realize that $\sum{\hat u_i ^2} = \sum{(Y_i - \hat Y_i)^2} = \sum{(Y_i - \hat \beta_1 - \hat \beta_2 X_i)^2}$. Therefore, $\sum{\hat u_i^2} = f(\hat \beta_1, \hat \beta_2)$; that is, $\sum{\hat u_i^2}$ is some function of $\hat \beta_1$ and $\hat \beta_2$. So informally, the key to getting a good SRF lies in choosing good values for $\hat \beta_1$ and $\hat \beta_2$.

For reasons omitted from this post (which can be found in [2]), the $\hat \beta_1$ and $\hat \beta_2$ values we choose are: $\hat \beta_1 = \bar Y - \hat \beta_2 \bar X \hspace{2pc} \hat \beta_2 = \frac{\sum{(X_i - \bar X)(Y_i - \bar Y)}}{\sum{(X_i - \bar X)^2}}$. But how do we know these are the best estimations that will yield the best SRF? This is the “why” aspect I want to illustrate today. The Gauss-Markov theorem answers the question.

Gauss-Markov Theorem: By respecting the assumptions below, the estimators $\hat \beta_1$ and $\hat \beta_2$, in the class of unbiased linear estimators, are best linear unbiased estimators of $\beta_1$ and $\beta_2$ respectively. That is, they have minimium variance.

The assumptions are as follows:

1. The regression model is linear in the parameters.
2. $X$ values are fixed in repeated sampling.
3. Given the value of $X$, the expected value of the random disturbance term $u_i$ is zero: $E(u_i|X_i)=0$.
4. Given the value of $X$, there is homoscedasticity of $u_i$: $\textnormal{var}(u_i|X_i) = \sigma^2$, where var is variance.
5. There is no autocorrelation between the disturbances: $\textnormal{cov}(u_i, u_j|X_i, X_j) = 0$, where cov is covariance.
6. There is zero covariance between $u_i$ and $X_i$: $\textnormal{cov}(u_i, X_i)=0$.
7. The number of observations must be greater than the number of the parameters to be estimated.
8. There is variability in the values of $X$: $\infty > \textnormal{var}(X)>0$.
9. The regression model is correctly specified—without specification bias or error.
10. There is no perfect linear relationships among the explanatory variables (multicollinearity).

Proof: I illustrate the $\hat \beta_2$ case (for the $\hat \beta_1$ case follows similar reasoning). That is, I show $\hat \beta_2$ is linear, unbiased, and has minimum variance. This proof will be broken into multiple posts. I prove linearity today:

Define $k_i = \frac{x_i}{(\sum{x_i^2})}$, where $x_i = (X - \bar X)$. It follows then that, $\hat \beta_2 = \sum k_i Y_i$. Clearly, $\hat \beta_2$ is linear since it is a linear function of $Y$.

EDIT: I realize that the reason as to why $\hat \beta_2$ is linear is actually not so clear. I wish to explain this a little further: We can determine what the value of $k_i$ will be, since we know $X_i$ and $\bar X$. But, $Y_i$ is assumed to be random, thus together, we can think of $\hat \beta_2 = \sum k_i Y_i$ as a linear function of $Y$.

I will continue this proof in my next post!

[1] Familiarity with the basics of linear regression is assumed for this post.

[2] Gujarati, Damodar N. Basic Econometrics, 4th ed. McGraw-Hill, 2003.