library(HistData)
dim(GaltonFamilies)
## [1] 934 8
names(GaltonFamilies)
## [1] "family" "father" "mother" "midparentHeight"
## [5] "children" "childNum" "gender" "childHeight"
plot(GaltonFamilies[,c(4,8)], pch=16, col="blue")
abline(lm(childHeight~midparentHeight, data=GaltonFamilies))
If \(P\) is a \(n \times n\) symmetric matrix and \(P\cdot P=P^2=P\), we call \(P\) a projection matrix.
\(I-P=(I-P)^2\) is also a projection matrix with \(P(I-P)=(I-P)P=0\) where \(I\) is a \(n\times n\) identity matrix.
The eigenvalue of \(P\) is 0 or 1. The number of nonzero eigenvalues is just the rank of \(P\). Hence the spectral decomposition of \(P\) is \[ P =Q^T \left[ \begin{array}{cc} I_{r\times r} & \mathbf{0} \\ \mathbf{0} & \mathbf{0}_{(n-q)\times(n-q)} \end{array} \right]Q \] where \(r =\mathrm{rank} (P)\), \(I_{r\times r}\) is the a \(r\) by \(r\) identity matrix and \(Q\) is an orthogonal matrix with \(Q^TQ= QQ^T=I_{n\times n}\).
For the \(n\times n\) symmetric matrix \(A=A^T\), for its eignvalues \(\lambda_1,\ldots, \lambda_n\), we have \[\lambda_1+\lambda_2+\cdots+\lambda_n= a_{11}+a_{22}+\cdots+a_{nn}\hat{=}\mbox{Trace}(A).\] Hence to the projection matrix, for it sum of its eigenvalues, we have \[\lambda_1+\lambda_2+\cdots+\lambda_n = \mbox{Trace}(P) =\mbox{rank}(P)\]
Consider \[ \mathbf{X}=\left[\begin{array}{cccc} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{array} \right] =[\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_p] \] where \(n >p\). The projection matrix of \(\mathbf{X}\) or the space expanded by the linear combinations of \(\mathbf{x}_1,\mathbf{x}_2,\ldots, \mathbf{x}_n\) is given by \[ P_{\mathbf{x}} =\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \]
If \(\mathbf{X}\) is an \(n \times p\) matrix with \(\mbox{rank}(\mathbf{X} )=p\), then \(\mbox{rank}(P)=p\).
The eigenvalues of \(\mathbf{P}_\mathbf{x}\) consist of \(p\) ones and \(n − p\) zeros, while the eigenvalues of \(I-P_{\mathbf{x}}\) consist of \(n − p\) ones and \(p\) zeros.
Let \(\mathbf{Z}=\sum_{i=1}^n a_i \mathbf{x}_i\) be any linear combination of \((\mathbf{x}_1, \mathbf{x}_n, \ldots, \mathbf{x}_n)\) where \(a_i,i=1,\ldots,n\) are any constant values. Then \(\mathbf{Z}\) is invariant under \(P_{\mathbf{x}}\): \[P_{\mathbf{x}}\mathbf{Z}=\mathbf{Z}\] and \[ (I_{n\times n}-P_{\mathbf{x}})\mathbf{Z}=0 \]
Notice the simple linear regression model can be rewrite as the matrix form in the Lecture Note 2, page 15. Then the least squares estimate for \((\beta_0,\beta_1)\) is the solution of \[\hat{\beta}=(\hat{\beta}_0,\hat{\beta}_1)^T =\mbox{argmin}_{\beta_0,\beta_1} \sum\limits_{i=1}^n (y_i-\beta_0-\beta_1 x_i)^2 =\mbox{argmin}_\beta(\mathbf{Y}-\mathbf{X} \beta)^T(\mathbf{Y}-\mathbf{X} \beta)\] Furthermore by \[\begin{eqnarray*} & \ & (\mathbf{Y}-\mathbf{X} \beta)^T(\mathbf{Y}-\mathbf{X} \beta) \\ &=& (\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y} + P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)^T(\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y} + P_{\mathbf{x}} \mathbf{Y}-\mathbf{X} \beta) \\ &=& (\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y})^T(\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y}) +(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)^T(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)\\ & \ & + (\mathbf{Y}- \mathbf{P}_{\mathbf{x}} Y)^T(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta) + (P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)^T(\mathbf{Y}- \mathbf{P}_{\mathbf{x}} \mathbf{Y}) \end{eqnarray*}\]
By the properties of the projection matrix \(P_{\mathbf{x}}\), \[P_{\mathbf{x}}(I_{n\times n}-P_{\mathbf{x}})= \mathbf{X}^T(I_{n\times n}-P_{\mathbf{x}})=0\]
Then \[ (\mathbf{Y}- \mathbf{P}_{\mathbf{x}} Y)^T(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta) = (P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)^T(\mathbf{Y}- \mathbf{P}_\mathbf{x} \mathbf{Y}) =(\mathbf{Y}^T P_{\mathbf{x}}-\beta^T \mathbf{X}^T)(I_{n\times n}-P_{\mathbf{x}})\mathbf{Y} =0 \] Hence
\[\begin{eqnarray*} & \ & (\mathbf{Y}-\mathbf{X} \beta)^T(\mathbf{Y}-\mathbf{X} \beta) =(\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y})^T(\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y}) +(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta)^T(P_{\mathbf{x}} \mathbf{Y}- \mathbf{X} \beta) \\
&\ge & (\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y})^T(\mathbf{Y}- P_{\mathbf{x}} \mathbf{Y}),
\end{eqnarray*}\] and only when \(P_{\mathbf{x}}\mathbf{Y}=\mathbf{X}\beta\), the equality is satisfied. It means when
\[ P_{\mathbf{x}}\mathbf{Y}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{Y} =\mathbf{X}\beta \] or \[\hat{\beta} =(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} \] we just minimize the value of \[(\mathbf{Y}-\mathbf{X} \beta)^T(\mathbf{Y}-\mathbf{X} \beta)\]
The above results are not limited to the simple linear regression model, for the multiple linear regression in the following lecture notes, \[ Y_i=\beta_0+x_{1i}\beta_1+x_{2i}\beta_2+\cdots+x_{p i}\beta_p +\varepsilon_i, i=1,2,\ldots,n \] those results are still corrected by replace \(\mathbf{X}\) as \[\mathbf{X}=\left[\begin{array}{ccccc} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{array} \right] \]
\[ Y_i=\beta_0+\beta_1 x_i +\varepsilon_i, i=1,\ldots, n\] and notice its matrix form
\[\mathbf{Y}=\left(\begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{array} \right),
\mathbf{X}= \left(\begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\
1 & x_n \end{array} \right), \varepsilon= \left(\begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{array} \right)
\] we have \[ \mathbf{X}^T\mathbf{X} =\left(\begin{array}{cc} n & \sum\limits_{i=1}^n x_i \\
\sum\limits_{i=1}^n x_i & \sum\limits_{i=1}^n x^2_i \end{array} \right)
\] So by \[ \left(\begin{array}{cc} a & b \\
b & c \end{array} \right)^{-1}=\frac{1}{ac-b^2}\left(\begin{array}{cc} c & -b \\
-b & a \end{array} \right),
\] we have \[(\mathbf{X}^T\mathbf{X})^{-1}=\left(\begin{array}{cc} n & \sum\limits_{i=1}^n x_i \\
\sum\limits_{i=1}^n x_i & \sum\limits_{i=1}^n Y_i \end{array} \right)^{-1}
=\frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2}
\left(\begin{array}{cc} \sum\limits_{i=1}^n x^2_i & -\sum\limits_{i=1}^n x_i \\
-\sum\limits_{i=1}^n x_i & n \end{array} \right)
\] and \[ \mathbf{X}^T\mathbf{Y}=\left(\begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \end{array} \right)
\left(\begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{array}\right)
=\left(\begin{array}{c}\sum\limits_{i=1}^n Y_i \\ \sum\limits_{i=1}^n x_iY_i \end{array}\right).
\] So \[\begin{eqnarray*}\hat{\beta} =\left(\begin{array}{c} \hat{\beta}_0 \\ \hat{\beta}_1 \end{array} \right)
&=&(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{Y}
=\frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2}
\left(\begin{array}{cc} \sum\limits_{i=1}^n x^2_i & -\sum\limits_{i=1}^n x_i \\
-\sum\limits_{i=1}^n x_i & n \end{array} \right)\left(\begin{array}{c}\sum\limits_{i=1}^n Y_i \\ \sum\limits_{i=1}^n x_iY_i \end{array}\right) \\
&=&\frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2} \left(\begin{array}{c}\sum\limits_{i=1}^n x^2_i\sum\limits_{i=1}^n Y_i-\sum\limits_{i=1}^n x_i \sum\limits_{i=1}^n x_iY_i \\
n\sum\limits_{i=1}^n x_i Y_i -\sum\limits_{i=1}^n x_i \sum\limits_{i=1}^n Y_i \end{array} \right)
\end{eqnarray*}\] So \[\hat{\beta}_1=\frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2} \left \{n\sum\limits_{i=1}^n x_i Y_i -\sum\limits_{i=1}^n x_i \sum\limits_{i=1}^n Y_i\right \}
\]
Notice \[ n\sum\limits_{i=1}^n x_i^2- \left(\sum\limits_{i=1}^n x_i\right)^2=n\left\{ \sum\limits_{i=1}^n x_i^2-n\bar{x}^2\right\}=n\sum\limits_{i=1}^n (x_i-\bar{x})^2=nC_{XX} \] where \(\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\). By the similar way, \[ n\sum\limits_{i=1}^n x_iY_i -\sum\limits_{i=1}^n x_i \sum\limits_{i=1}^n Y_i = n \left\{ \sum\limits_{i=1}^n x_i Y_i-n \bar{x}\bar{Y}\right\}= n \sum\limits_{i=1}^n (x_i-\bar{x})(Y_i-\bar{Y})= n C_{XY} \] where \(\bar{Y}=\frac{1}{n}\sum_{i=1}^n Y_i\).
So by the above two equations, we have \[ \hat{\beta}_1= \frac{C_{XY}}{C_{XX}}\] For the estimate of \(\hat{\beta}_0\), we have \[\begin{eqnarray*} \hat{\beta}_0&=& \frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2}\left\{\sum\limits_{i=1}^n x^2_i\sum\limits_{i=1}^n Y_i-\sum\limits_{i=1}^n x_i \sum\limits_{i=1}^n x_iY_i \right\} \\ &=& \frac{1}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2} \left\{\left[\sum\limits_{i=1}^n x^2_i-\frac{1}{n}\left(\sum\limits_{i=1}^n x_i\right)^2+\frac{1}{n}\left(\sum\limits_{i=1}^n x_i\right)^2\right] \sum\limits_{i=1}^n Y_i - n\bar{x}\sum\limits_{i=1}^n x_i Y_i\right\} \\ &=& \frac{1}{n}\sum\limits_{i=1}^n Y_i + \frac{\bar{x}}{n\sum\limits_{i=1}^n x^2_i-\left(\sum\limits_{i=1}^n x_i\right)^2}\left\{\sum\limits_{i=1}^n x_i\sum\limits_{i=1}^n Y_i - n \sum\limits_{i=1}^n x_i Y_i\right\}\\ &=& \bar{Y}-\hat{\beta}_1 \bar{x} \end{eqnarray*}\]
Following the definition of \(\mathbf{X}\) in the lecture note page 15, we have \[\sum\limits_{i=1}^n e_i= \mathbf{1}^T(\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y})=[1,0]\mathbf{X}^T (I_{n\times n}-P_{\mathbf{x}})\mathbf{Y} \] where \(\mathbf{1}\) is a \(n\times 1\) dimensional vector with all elements equal to 1. According to the definition of \(\mathbf{X}\) and the invariant property of \(P_{\mathbf{x}}\), we have \(\mathbf{X}^T(I_{n\times n}-P_{\mathbf{x}})=0\), hence \[\sum\limits_{i=1}^n e_i =0\]
Similarly, \[\sum\limits_{i=1}^n x_ie_i=[0,1]\mathbf{X}^T (\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y})=[0,1]\mathbf{X}^T(I_{n\times n}-P_{\mathbf{x}})\mathbf{Y}=0 \]
Notice that \[\sum\limits_{i=1}^n \hat{Y}_i e_i =(P_{\mathbf{x}}\mathbf{Y})^T (\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y}),\] we have \[\sum\limits_{i=1}^n \hat{Y}_i e_i=\mathbf{Y}^TP_{\mathbf{x}}(\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y})=0.\]
First, notice that \[\begin{eqnarray*} \mbox{SSE} &=& \sum\limits_{i=1}^n (Y_i-\hat{Y}_i)^2=\sum\limits_{i=1}^n (Y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)^2= (\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y})^T(\mathbf{Y}-P_{\mathbf{x}}\mathbf{Y}) \\
&=& (\mathbf{X}{\beta}+\varepsilon)^T(I_{n\times n}-P_{\mathbf{x}})(\mathbf{X}{\beta}+\varepsilon) \\
&=& \varepsilon^T(I_{n\times n}-P_{\mathbf{x}})\varepsilon
\end{eqnarray*}\] So \[\mbox{E} \ \mbox{SSE} =\mbox{E}\{\varepsilon^T(I_{n\times n}-P_{\mathbf{x}})\varepsilon\}\] By the Trace operation, i.e. the sum of the diagnel elements of the square matrix, and using the equation \(\mbox{Trace}(AB)=\mbox{Trace}(BA)\), we have \[ \mbox{Trace}(\varepsilon^T(I_{n\times n}-P_{\mathbf{x}})\varepsilon)=
\mbox{Trace}(\{(I_{n\times n}-P_{\mathbf{x}})\varepsilon\varepsilon^T\})
\] and \[ \mbox{E} \{\mbox{Trace}(\{(I_{n\times n}-P_{\mathbf{x}})\varepsilon\varepsilon^T\})\} =\mbox{Trace}\{\mbox{E}((I_{n\times n}-P_{\mathbf{x}})\varepsilon\varepsilon^T)\}=
\mbox{Trace}((I_{n\times n}-P_{\mathbf{x}})\mbox{E}\{\varepsilon\varepsilon^T\})
\] Since \(\varepsilon=(\varepsilon_1,\ldots,\varepsilon_n)\) are independent, identical distribution with mean zero and variance \(\sigma^2\) random observations according to the simple regression model assumption,
\[\mbox{E}\{\varepsilon\varepsilon^T\}=\sigma^2 I_{n\times n}\] Finally we have \[ \mbox{E} \{\mbox{Trace}(\varepsilon^T(I_{n\times n}-P_{\mathbf{x}})\varepsilon)\}= \mbox{E}\{
\mbox{Trace}(\{(I_{n\times n}-P_{\mathbf{x}})\varepsilon\varepsilon^T\})\}=\sigma^2
\mbox{Trace}(I_{n\times n}-P_{\mathbf{x}})\] By the properties of the projection matrix \(P_{\mathbf{x}}\) and \(I_{ n\times n}-P_{\mathbf{x}}\), and \(\mbox{rank}(\mathbf{X})=2\), we obtained that \[\mbox{E} \{\mbox{Trace}(\varepsilon^T(I_{n\times n}-P_{\mathbf{x}})\varepsilon)\} =\sigma^2
\mbox{Trace}(I_{n\times n}-P_{\mathbf{x}}) =(n-2)\sigma^2.
\] Hence the maximum likelihood estimate of \(\sigma^2\) \(\frac{SSE}{n}\) is not an unbiased estimate of \(\sigma^2\) becasue \[ \mbox{E} \frac{SSE}{n} =\frac{n-2}{n}\sigma^2 \ne \sigma^2.\] The unbiased estimate of \(\sigma^2\) should be \[ \hat{\sigma}^2=\frac{SSE}{n-2}.\] For multiple linear regression, when \(\mbox{rank}(X)=p+1\) where \(p\) is the number of independent variables in the model, then the unbiased estimate of \(\sigma^2\) should be \[ \hat{\sigma}^2=\frac{SSE}{n-p-1}.\]
Centralization: make the random variation more easy to analyze. Standardization: remove the scale effect to the model analysis, and compare different variation of the observations under same scale.
Those parametrisations would not change the discovery of the linear relationship between the dependent variable and independent variables, but make the model and analyzed results more easily to interpretate.