The Gauss-Markov Theorem (pt. 1)

Anyone who runs linear regression on two-variables nowadays probably clicks a button or writes some code in a command line in some application—R, S plus, Gretl, or Microsoft Excel to name a few—and lets the method of ordinary least squares (OLS) do its magic. Seldom do we remember why we can use OLS to approximate our dependent variable. This post today discusses the why.[1]

The math behind the following was made available to me through Damodar N. Gujarati’s [2].

First, some background: Let variables Y and X be given such that their population regression function (PRF) is of the form E(Y|X_i) = \beta_1 + \beta_2 X_i, where \beta_1 is the y-intercept of the line and \beta_2 is the slope. For each value of X_i, Y_i is found using the PRF for these two variables, Y_i =E(Y|X_i) + u_i = \beta_1 + \beta_2X_i + u_i. That is to say, the value of Y_i is its expected value conditioned on X_i (positioned on the line), plus its residual value u_i (the distance from the population line to the actual Y_i value).

In reality, we do not deal with entire populations of data. Rather, we spend our time working with samples of populations. Our task then is to estimate the PRF using the sample regression function (SRF), \hat Y_i = \hat\beta_1 + \hat\beta_2X_i, where \hat Y_i is an estmate of E(Y|X_i), \hat \beta_1 is an estimate of \beta_1, and \hat \beta_2 is an estimate of \beta_2.

Similarly, Y_i can be found: Y_i = \hat \beta_1 + \hat \beta_2X_i + \hat u_i, where \hat u_i is the residual value between the SRF and the value of Y_i.

Notice that \hat u_i = Y_i - \hat Y_i. Thus our goal is to make \hat u_i as small as possible to come as close as possible to the actual value of Y_i. It turns out however that we cannot simply make the sum of all residuals as small as possible. But, we gain a pretty good estimate of the PRF if \sum{\hat u_i ^2} is as small as possible. Realize that \sum{\hat u_i ^2} = \sum{(Y_i - \hat Y_i)^2} = \sum{(Y_i - \hat \beta_1 - \hat \beta_2 X_i)^2}. Therefore, \sum{\hat u_i^2} = f(\hat \beta_1, \hat \beta_2); that is, \sum{\hat u_i^2} is some function of \hat \beta_1 and \hat \beta_2. So informally, the key to getting a good SRF lies in choosing good values for \hat \beta_1 and \hat \beta_2.

For reasons omitted from this post (which can be found in [2]), the \hat \beta_1 and \hat \beta_2 values we choose are: \hat \beta_1 = \bar Y - \hat \beta_2 \bar X \hspace{2pc} \hat \beta_2 = \frac{\sum{(X_i - \bar X)(Y_i - \bar Y)}}{\sum{(X_i - \bar X)^2}}. But how do we know these are the best estimations that will yield the best SRF? This is the “why” aspect I want to illustrate today. The Gauss-Markov theorem answers the question.

Gauss-Markov Theorem: By respecting the assumptions below, the estimators \hat \beta_1 and \hat \beta_2, in the class of unbiased linear estimators, are best linear unbiased estimators of \beta_1 and \beta_2 respectively. That is, they have minimium variance.

The assumptions are as follows:

  1. The regression model is linear in the parameters.
  2. X values are fixed in repeated sampling.
  3. Given the value of X, the expected value of the random disturbance term u_i is zero: E(u_i|X_i)=0.
  4. Given the value of X, there is homoscedasticity of u_i: \textnormal{var}(u_i|X_i) = \sigma^2, where var is variance.
  5. There is no autocorrelation between the disturbances: \textnormal{cov}(u_i, u_j|X_i, X_j) = 0, where cov is covariance.
  6. There is zero covariance between u_i and X_i: \textnormal{cov}(u_i, X_i)=0.
  7. The number of observations must be greater than the number of the parameters to be estimated.
  8. There is variability in the values of X: \infty > \textnormal{var}(X)>0.
  9. The regression model is correctly specified—without specification bias or error.
  10. There is no perfect linear relationships among the explanatory variables (multicollinearity).

Proof: I illustrate the \hat \beta_2 case (for the \hat \beta_1 case follows similar reasoning). That is, I show \hat \beta_2 is linear, unbiased, and has minimum variance. This proof will be broken into multiple posts. I prove linearity today:

Define k_i = \frac{x_i}{(\sum{x_i^2})}, where x_i = (X - \bar X). It follows then that, \hat \beta_2 = \sum k_i Y_i. Clearly, \hat \beta_2 is linear since it is a linear function of Y.

EDIT: I realize that the reason as to why \hat \beta_2 is linear is actually not so clear. I wish to explain this a little further: We can determine what the value of k_i will be, since we know X_i and \bar X. But, Y_i is assumed to be random, thus together, we can think of \hat \beta_2 = \sum k_i Y_i as a linear function of Y.

I will continue this proof in my next post!

[1] Familiarity with the basics of linear regression is assumed for this post.

[2] Gujarati, Damodar N. Basic Econometrics, 4th ed. McGraw-Hill, 2003.


2 thoughts on “The Gauss-Markov Theorem (pt. 1)

  1. Pingback: The Gauss-Markov Theorem (pt. 2) | petemcneely

  2. Pingback: The Gauss-Markov Theorem (pt. 3) | petemcneely

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s