Regression Analysis: Lecture Note 2

Page 2. Galton height data

library(HistData)

dim(GaltonFamilies)

## [1] 934   8

names(GaltonFamilies)

## [1] "family"          "father"          "mother"          "midparentHeight"
## [5] "children"        "childNum"        "gender"          "childHeight"

plot(GaltonFamilies[,c(4,8)], pch=16, col="blue")

abline(lm(childHeight~midparentHeight, data=GaltonFamilies))

1. Projection Matrix

If \(P\) is a \(n \times n\) symmetric matrix and \(P\cdot P=P^2=P\), we call \(P\) a projection matrix.

\(I-P=(I-P)^2\) is also a projection matrix with \(P(I-P)=(I-P)P=0\) where \(I\) is a \(n\times n\) identity matrix.
The eigenvalue of \(P\) is 0 or 1. The number of nonzero eigenvalues is just the rank of \(P\). Hence the spectral decomposition of \(P\) is \[ P =Q^T \left[ \begin{array}{cc} I_{r\times r} & \mathbf{0} \\ \mathbf{0} & \mathbf{0}_{(n-q)\times(n-q)} \end{array} \right]Q \] where \(r =\mathrm{rank} (P)\), \(I_{r\times r}\) is the a \(r\) by \(r\) identity matrix and \(Q\) is an orthogonal matrix with \(Q^TQ= QQ^T=I_{n\times n}\).
For the \(n\times n\) symmetric matrix \(A=A^T\), for its eignvalues \(\lambda_1,\ldots, \lambda_n\), we have \[\lambda_1+\lambda_2+\cdots+\lambda_n= a_{11}+a_{22}+\cdots+a_{nn}\hat{=}\mbox{Trace}(A).\] Hence to the projection matrix, for it sum of its eigenvalues, we have \[\lambda_1+\lambda_2+\cdots+\lambda_n = \mbox{Trace}(P) =\mbox{rank}(P)\]

2. Projection matrix of \(\mathbf{X}\)

Consider \[ \mathbf{X}=\left[\begin{array}{cccc} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{array} \right] =[\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_p] \] where \(n >p\). The projection matrix of \(\mathbf{X}\) or the space expanded by the linear combinations of \(\mathbf{x}_1,\mathbf{x}_2,\ldots, \mathbf{x}_n\) is given by \[ P_{\mathbf{x}} =\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \]

If \(\mathbf{X}\) is an \(n \times p\) matrix with \(\mbox{rank}(\mathbf{X} )=p\), then \(\mbox{rank}(P)=p\).
The eigenvalues of \(\mathbf{P}_\mathbf{x}\) consist of \(p\) ones and \(n − p\) zeros, while the eigenvalues of \(I-P_{\mathbf{x}}\) consist of \(n − p\) ones and \(p\) zeros.
Let \(\mathbf{Z}=\sum_{i=1}^n a_i \mathbf{x}_i\) be any linear combination of \((\mathbf{x}_1, \mathbf{x}_n, \ldots, \mathbf{x}_n)\) where \(a_i,i=1,\ldots,n\) are any constant values. Then \(\mathbf{Z}\) is invariant under \(P_{\mathbf{x}}\): \[P_{\mathbf{x}}\mathbf{Z}=\mathbf{Z}\] and \[ (I_{n\times n}-P_{\mathbf{x}})\mathbf{Z}=0 \]

3. Least Squares Estimate of Simple Linear Regression Models (Page 15-17)

Notice the simple linear regression model can be rewrite as the matrix form in the Lecture Note 2, page 15. Then the least squares estimate for \((\beta_0,\beta_1)\) is the solution of \[\hat{\beta}=(\hat{\beta}_0,\hat{\beta}_1)^T =\mbox{argmin}_{\beta_0,\beta_1} \sum\limits_{i=1}^n (y_i-\beta_0-\beta_1 x_i)^2 =\mbox{argmin}_\beta(\mathbf{Y}-\mathbf{X} \beta)^T(\mathbf{Y}-\mathbf{X} \beta)\] Furthermore by \[\begin{eqnarray*} & \ & (\mathbf{Y}-\mathbf{X} \beta)^T(\mathbf{Y}-\mathbf{X} \beta) \\ &=& (\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y} + P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)^T(\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y} + P_{\mathbf{x}} \mathbf{Y}-\mathbf{X} \beta) \\ &=& (\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y})^T(\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y}) +(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)^T(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)\\ & \ & + (\mathbf{Y}- \mathbf{P}_{\mathbf{x}} Y)^T(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta) + (P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)^T(\mathbf{Y}- \mathbf{P}_{\mathbf{x}} \mathbf{Y}) \end{eqnarray*}\]

By the properties of the projection matrix \(P_{\mathbf{x}}\), \[P_{\mathbf{x}}(I_{n\times n}-P_{\mathbf{x}})= \mathbf{X}^T(I_{n\times n}-P_{\mathbf{x}})=0\]

Then \[ (\mathbf{Y}- \mathbf{P}_{\mathbf{x}} Y)^T(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta) = (P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)^T(\mathbf{Y}- \mathbf{P}_\mathbf{x} \mathbf{Y}) =(\mathbf{Y}^T P_{\mathbf{x}}-\beta^T \mathbf{X}^T)(I_{n\times n}-P_{\mathbf{x}})\mathbf{Y} =0 \] Hence
\[\begin{eqnarray*} & \ & (\mathbf{Y}-\mathbf{X} \beta)^T(\mathbf{Y}-\mathbf{X} \beta) =(\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y})^T(\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y}) +(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)^T(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta) \\ &\ge & (\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y})^T(\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y}), \end{eqnarray*}\] and only when \(P_{\mathbf{x}}\mathbf{Y}=\mathbf{X}\beta\), the equality is satisfied. It means when
\[ P_{\mathbf{x}}\mathbf{Y}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{Y} =\mathbf{X}\beta \] or \[\hat{\beta} =(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} \] we just minimize the value of \[(\mathbf{Y}-\mathbf{X} \beta)^T(\mathbf{Y}-\mathbf{X} \beta)\]

The above results are not limited to the simple linear regression model, for the multiple linear regression in the following lecture notes, \[ Y_i=\beta_0+x_{1i}\beta_1+x_{2i}\beta_2+\cdots+x_{p i}\beta_p +\varepsilon_i, i=1,2,\ldots,n \] those results are still corrected by replace \(\mathbf{X}\) as \[\mathbf{X}=\left[\begin{array}{ccccc} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{array} \right] \]

4. Applications in the simple linear regression model (Page 9-10)

\[ Y_i=\beta_0+\beta_1 x_i +\varepsilon_i, i=1,\ldots, n\] and notice its matrix form
\[\mathbf{Y}=\left(\begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{array} \right), \mathbf{X}= \left(\begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array} \right), \varepsilon= \left(\begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{array} \right) \] we have \[ \mathbf{X}^T\mathbf{X} =\left(\begin{array}{cc} n & \sum\limits_{i=1}^n x_i \\ \sum\limits_{i=1}^n x_i & \sum\limits_{i=1}^n x^2_i \end{array} \right) \] So by \[ \left(\begin{array}{cc} a & b \\ b & c \end{array} \right)^{-1}=\frac{1}{ac-b^2}\left(\begin{array}{cc} c & -b \\ -b & a \end{array} \right), \] we have \[(\mathbf{X}^T\mathbf{X})^{-1}=\left(\begin{array}{cc} n & \sum\limits_{i=1}^n x_i \\ \sum\limits_{i=1}^n x_i & \sum\limits_{i=1}^n Y_i \end{array} \right)^{-1} =\frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2} \left(\begin{array}{cc} \sum\limits_{i=1}^n x^2_i & -\sum\limits_{i=1}^n x_i \\ -\sum\limits_{i=1}^n x_i & n \end{array} \right) \] and \[ \mathbf{X}^T\mathbf{Y}=\left(\begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \end{array} \right) \left(\begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{array}\right) =\left(\begin{array}{c}\sum\limits_{i=1}^n Y_i \\ \sum\limits_{i=1}^n x_iY_i \end{array}\right). \] So \[\begin{eqnarray*}\hat{\beta} =\left(\begin{array}{c} \hat{\beta}_0 \\ \hat{\beta}_1 \end{array} \right) &=&(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{Y} =\frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2} \left(\begin{array}{cc} \sum\limits_{i=1}^n x^2_i & -\sum\limits_{i=1}^n x_i \\ -\sum\limits_{i=1}^n x_i & n \end{array} \right)\left(\begin{array}{c}\sum\limits_{i=1}^n Y_i \\ \sum\limits_{i=1}^n x_iY_i \end{array}\right) \\ &=&\frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2} \left(\begin{array}{c}\sum\limits_{i=1}^n x^2_i\sum\limits_{i=1}^n Y_i-\sum\limits_{i=1}^n x_i \sum\limits_{i=1}^n x_iY_i \\ n\sum\limits_{i=1}^n x_i Y_i -\sum\limits_{i=1}^n x_i \sum\limits_{i=1}^n Y_i \end{array} \right) \end{eqnarray*}\] So \[\hat{\beta}_1=\frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2} \left \{n\sum\limits_{i=1}^n x_i Y_i -\sum\limits_{i=1}^n x_i \sum\limits_{i=1}^n Y_i\right \} \]

Notice \[ n\sum\limits_{i=1}^n x_i^2- \left(\sum\limits_{i=1}^n x_i\right)^2=n\left\{ \sum\limits_{i=1}^n x_i^2-n\bar{x}^2\right\}=n\sum\limits_{i=1}^n (x_i-\bar{x})^2=nC_{XX} \] where \(\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\). By the similar way, \[ n\sum\limits_{i=1}^n x_iY_i -\sum\limits_{i=1}^n x_i \sum\limits_{i=1}^n Y_i = n \left\{ \sum\limits_{i=1}^n x_i Y_i-n \bar{x}\bar{Y}\right\}= n \sum\limits_{i=1}^n (x_i-\bar{x})(Y_i-\bar{Y})= n C_{XY} \] where \(\bar{Y}=\frac{1}{n}\sum_{i=1}^n Y_i\).

So by the above two equations, we have \[ \hat{\beta}_1= \frac{C_{XY}}{C_{XX}}\] For the estimate of \(\hat{\beta}_0\), we have \[\begin{eqnarray*} \hat{\beta}_0&=& \frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2}\left\{\sum\limits_{i=1}^n x^2_i\sum\limits_{i=1}^n Y_i-\sum\limits_{i=1}^n x_i \sum\limits_{i=1}^n x_iY_i \right\} \\ &=& \frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2} \left\{\left[\sum\limits_{i=1}^n x^2_i-\frac{1}{n}\left(\sum\limits_{i=1}^n x_i\right)^2+\frac{1}{n}\left(\sum\limits_{i=1}^n x_i\right)^2\right] \sum\limits_{i=1}^n Y_i - n\bar{x}\sum\limits_{i=1}^n x_i Y_i\right\} \\ &=& \frac{1}{n}\sum\limits_{i=1}^n Y_i + \frac{\bar{x}}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2}\left\{\sum\limits_{i=1}^n x_i\sum\limits_{i=1}^n Y_i - n \sum\limits_{i=1}^n x_i Y_i\right\}\\ &=& \bar{Y}-\hat{\beta}_1 \bar{x} \end{eqnarray*}\]

5. Properties or Residuals for the simple linear regression model (Page 12)

Following the definition of \(\mathbf{X}\) in the lecture note page 15, we have \[\sum\limits_{i=1}^n e_i= \mathbf{1}^T(\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y})=[1,0]\mathbf{X}^T (I_{n\times n}-P_{\mathbf{x}})\mathbf{Y} \] where \(\mathbf{1}\) is a \(n\times 1\) dimensional vector with all elements equal to 1. According to the definition of \(\mathbf{X}\) and the invariant property of \(P_{\mathbf{x}}\), we have \(\mathbf{X}^T(I_{n\times n}-P_{\mathbf{x}})=0\), hence \[\sum\limits_{i=1}^n e_i =0\]
Similarly, \[\sum\limits_{i=1}^n x_ie_i=[0,1]\mathbf{X}^T (\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y})=[0,1]\mathbf{X}^T(I_{n\times n}-P_{\mathbf{x}})\mathbf{Y}=0 \]
Notice that \[\sum\limits_{i=1}^n \hat{Y}_i e_i =(P_{\mathbf{x}}\mathbf{Y})^T (\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y}),\] we have \[\sum\limits_{i=1}^n \hat{Y}_i e_i=\mathbf{Y}^TP_{\mathbf{x}}(\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y})=0.\]

6. Estimate of \(\sigma^2\), the varaince of the error (Lecture Note, Page 20)

First, notice that \[\begin{eqnarray*} \mbox{SSE} &=& \sum\limits_{i=1}^n (Y_i-\hat{Y}_i)^2=\sum\limits_{i=1}^n (Y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)^2= (\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y})^T(\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y}) \\ &=& (\mathbf{X}{\beta}+\varepsilon)^T(I_{n\times n}-P_{\mathbf{x}})(\mathbf{X}{\beta}+\varepsilon) \\ &=& \varepsilon^T(I_{n\times n}-P_{\mathbf{x}})\varepsilon \end{eqnarray*}\] So \[\mbox{E} \ \mbox{SSE} =\mbox{E}\{\varepsilon^T(I_{n\times n}-P_{\mathbf{x}})\varepsilon\}\] By the Trace operation, i.e. the sum of the diagnel elements of the square matrix, and using the equation \(\mbox{Trace}(AB)=\mbox{Trace}(BA)\), we have \[ \mbox{Trace}(\varepsilon^T(I_{n\times n}-P_{\mathbf{x}})\varepsilon)= \mbox{Trace}(\{(I_{n\times n}-P_{\mathbf{x}})\varepsilon\varepsilon^T\}) \] and \[ \mbox{E} \{\mbox{Trace}(\{(I_{n\times n}-P_{\mathbf{x}})\varepsilon\varepsilon^T\})\} =\mbox{Trace}\{\mbox{E}((I_{n\times n}-P_{\mathbf{x}})\varepsilon\varepsilon^T)\}= \mbox{Trace}((I_{n\times n}-P_{\mathbf{x}})\mbox{E}\{\varepsilon\varepsilon^T\}) \] Since \(\varepsilon=(\varepsilon_1,\ldots,\varepsilon_n)\) are independent, identical distribution with mean zero and variance \(\sigma^2\) random observations according to the simple regression model assumption,
\[\mbox{E}\{\varepsilon\varepsilon^T\}=\sigma^2 I_{n\times n}\] Finally we have \[ \mbox{E} \{\mbox{Trace}(\varepsilon^T(I_{n\times n}-P_{\mathbf{x}})\varepsilon)\}= \mbox{E}\{ \mbox{Trace}(\{(I_{n\times n}-P_{\mathbf{x}})\varepsilon\varepsilon^T\})\}=\sigma^2 \mbox{Trace}(I_{n\times n}-P_{\mathbf{x}})\] By the properties of the projection matrix \(P_{\mathbf{x}}\) and \(I_{ n\times n}-P_{\mathbf{x}}\), and \(\mbox{rank}(\mathbf{X})=2\), we obtained that \[\mbox{E} \{\mbox{Trace}(\varepsilon^T(I_{n\times n}-P_{\mathbf{x}})\varepsilon)\} =\sigma^2 \mbox{Trace}(I_{n\times n}-P_{\mathbf{x}}) =(n-2)\sigma^2. \] Hence the maximum likelihood estimate of \(\sigma^2\) \(\frac{SSE}{n}\) is not an unbiased estimate of \(\sigma^2\) becasue \[ \mbox{E} \frac{SSE}{n} =\frac{n-2}{n}\sigma^2 \ne \sigma^2.\] The unbiased estimate of \(\sigma^2\) should be \[ \hat{\sigma}^2=\frac{SSE}{n-2}.\] For multiple linear regression, when \(\mbox{rank}(X)=p+1\) where \(p\) is the number of independent variables in the model, then the unbiased estimate of \(\sigma^2\) should be \[ \hat{\sigma}^2=\frac{SSE}{n-p-1}.\]

7.Different parametrisations (Page 21-22)

Objectives for Regression analysis:

Interpretations of invariant relationship between the dependent variables and independent variables among randomness phenomenons.
Based on independent variables to predict unobserved or future dependent variables.

Centralization: make the random variation more easy to analyze. Standardization: remove the scale effect to the model analysis, and compare different variation of the observations under same scale.
Those parametrisations would not change the discovery of the linear relationship between the dependent variable and independent variables, but make the model and analyzed results more easily to interpretate.