Skip to main content
Logo image

Section 3.3 Orthogonal Projections and Least Squares Approximations

We begin with the notion of orthogonal projection introduced in the previous section. We find different ways to compute it other than from the definiton, and give an application to least squares approximations.

Subsection 3.3.1 Orthonormal bases and orthogonal/unitary matrices.

Consider the inner product space \(V=F^n\) where \(F = \R\) or \(\C,\) and denote by \(\overline z\) the complex conjugate of \(z.\)
If \(v=\begin{bmatrix}a_1\\\vdots\\a_n\end{bmatrix}\) and \(w=\begin{bmatrix}b_1\\\vdots\\b_n\end{bmatrix}\) are two vectors in \(F^n\text{,}\) we defined their inner product by:
\begin{equation*} \la v,w\ra := \sum_{i=1}^n a_i \overline {b_i}. \end{equation*}
It is very convenient to recognize values of the inner product via matrix multiplication. In particular, regarding the column vectors \(v,w\) as \(n\times 1\) matrices
\begin{equation*} \la v,w\ra := \sum_{i=1}^n a_i \overline {b_i} = w^*v \end{equation*}
is the \(1\times 1\) matrix product \(w^*v\) where \(w^*\) is the \(1\times n\) conjugate-transpose matrix to \(w.\)
For vectors \(v,w \) as above, we have seen the meaning of \(w^*v\text{.}\) It is more than idle curiosity to inquire about the meaning of \(vw^*.\) We can certainly compute it, but first we note that while \(w^*v = \la v,w\ra\) is a scalar (a \(1\times 1\) matrix), in the reverse order, \(vw^*\) is an \(n\times n\) matrix, specifically:
\begin{equation*} vw^* =\begin{bmatrix}a_1\\\vdots\\a_n\end{bmatrix} \begin{bmatrix}\overline{b_1}\amp\cdots\amp \overline{b_n}\end{bmatrix}= \begin{bmatrix} a_1\overline{b_1}\amp\cdots\amp a_1\overline{b_n}\\ a_2\overline{b_1}\amp\cdots\amp a_2\overline{b_n}\\ \vdots\amp \vdots\amp\vdots\\ a_n\overline{b_1}\amp\cdots\amp a_n\overline{b_n}\\ \end{bmatrix}. \end{equation*}
It is probably a bit more useful to see how this product arises naturally in what we have done so far.
Let apply the definition of an orthogonal projection to the inner product space \(V=\C^n\text{;}\) what happens for \(V=\R^n\) will be clear.
Let \(W\) be an \(r-\)dimensional subspace of \(V\) with orthonormal basis \(\{w_1, \dots, w_r\}.\) Then Definition 3.2.13 tells us that the orthogonal projection of a vector \(v\) into the subspace \(W\) is given by:
\begin{equation*} \proj_W v := \la v,w_1\ra w_1 + \cdots+ \la v,w_r\ra w_r. \end{equation*}
Now while for a vector space \(V\) over a field \(F\text{,}\) we have defined multiplication of a scalar \(\lambda\) times a vector \(v\) as \(\lambda v\text{,}\) you might ask if we would get into trouble if we defined \(v \lambda := \lambda v\text{.}\) Since multiplication in a field is commutative, this turns out to be just fine, but in more general structures (modules over rings) there can be significant issues. So with that as preamble, let’s consider a summand \(\la v,w_j\ra w_j\) in the expression for an orthogonal projection. First we use that scalar multiplication can be thought of on the right or the left and then we use the specific nature of the inner product on \(\C^n,\) so that
\begin{equation*} \la v,w_j\ra w_j = w_j\la v,w_j\ra = w_j w_j^* v \end{equation*}
Thus as a corollary we obtain a matrix-specific characterization of an orthogonal projection to a subspace of \(\C^n.\)
Our next goal is to give a more intrinsic characterization of the matrix \(\sum_{k=1}^r w_k w_k^*.\) Let \(A\) be the \(n\times r\) matrix whose columns are the orthonormal basis \(\{w_1, \dots, w_r\}\) of the subspace \(W.\) What should the matrix \(A^*A\) look like?
Using our familiar (row-column) method of multiplying two matrices together, the \(ij\) entry of the product is
\begin{equation*} w_i^* w_j = \la w_j, w_i\ra = \delta_{ij} \text{ (Kronecker delta)}, \end{equation*}
so that \(A^*A = I_r\text{,}\) the \(r\times r\) identity matrix.
In the other order we claim that
\begin{equation*} AA^* = \sum_{k=1}^r w_k w_k^* \end{equation*}
from Corollary 3.3.1, that is, \(AA^*\) is the matrix of the orthogonal projection (with respect to the standard basis) of a vector to the subspace \(W.\)
This claim is most easily justified using the "column-row" expansion of a matrix product as given in [2]. If \(A\) is an \(n\times r\) matrix (as it is for us), and \(B\) is an \(r\times m\) matrix, then
\begin{equation*} AB = \text{col}_1(A)\text{row}_1(B)+ \cdots+ \text{col}_r(A)\text{row}_r(B). \end{equation*}

Proof.

The proof is simply a computation, but it is easy to make an error, so we do it out explicitly. Note that each summand is the product of an \(n\times 1\) matrix times an \(1\times m\) matrix.
\begin{align*} \text{col}_1(A)\text{row}_1(B)\amp + \cdots+ \text{col}_r(A)\text{row}_r(B) =\\ \amp \begin{bmatrix} a_{11}b_{11}\amp \cdots \amp a_{11} b_{1m}\\ a_{21}b_{11}\amp \cdots \amp a_{21} b_{1m}\\ \vdots\amp\vdots\amp \vdots\\ a_{n1}b_{11}\amp \cdots \amp a_{n1} b_{1m}\\ \end{bmatrix} + \cdots+ \begin{bmatrix} a_{1r}b_{r1}\amp \cdots \amp a_{1r1} b_{rm}\\ a_{2r}b_{r1}\amp \cdots \amp a_{2r} b_{rm}\\ \vdots\amp\vdots\amp \vdots\\ a_{nr}b_{r1}\amp \cdots \amp a_{nr} b_{rm}\\ \end{bmatrix}. \end{align*}
Now from the row-column rule we know that the \(ij\) entry of \(AB\) is \((AB)_{ij} = \sum_{k=1}^r a_{ik}b_{kj},\) which is exactly the sum of the \(ij\) entries from each of the \(r\) matrices above.
Now we apply this to the product of the matrices \(AA^*.\) The column-row rule immediately gives that
\begin{equation*} AA^* = w_1 w_1^* + \cdots + w_r w_r^* \end{equation*}
as claimed. We summarize this as
While this is a very pretty expression for the orthogonal projection onto a subspace \(W\text{,}\) it is predicated on having an orthonormal basis for the subspace. Of course Gram-Schmidt can be employed, but it is an interesting exercise to produce a matrix representation of the projection starting from an arbitrary basis for the subspace. We reproduce Proposition 4.18 of [3] including a proof which includes several interesting ideas.

Proof.

Given a vector \(v\text{,}\) we know its orthogonal projection, \(\proj_W v\) is an element of \(W\) so a linear combination of the basis for \(W,\) say
\begin{equation*} \proj_W v = \lambda_1 v_1 + \cdots + \lambda_r v_r. \end{equation*}
On the other hand this linear combination can be represented as the matrix product
\begin{equation*} \lambda_1 v_1 + \cdots + \lambda_r v_r = A \boldsymbol{\lambda} \end{equation*}
where
\begin{equation*} \boldsymbol\lambda = \begin{bmatrix}\lambda_1\\\vdots\\\lambda_r\end{bmatrix}. \end{equation*}
Thus we begin with the identity
\begin{equation*} \proj_W v = A \boldsymbol{\lambda}. \end{equation*}
By Theorem 3.2.10, we know that \(v - \proj_W v = v- A\boldsymbol\lambda\in W^\perp\) so that for all \(j=1,\dots,r\)
\begin{equation*} \la v - A\boldsymbol\lambda ,v_j\ra = v^*_j (v - A\boldsymbol\lambda) = 0. \end{equation*}
Writing the system of \(r\) equations as a single matrix equation, we have
\begin{equation*} A^*(v-A\boldsymbol\lambda) = 0 \text{ or equivalently } A^*v = A^* A \boldsymbol\lambda. \end{equation*}
Assuming for the moment that \(A^*A\) is invertible, we multiply both sides of \(A^*v = A^* A \boldsymbol\lambda\) by \(A(A^*A)^{-1}\) to obtain
\begin{equation*} A(A^*A)^{-1}A^*v = A(A^*A)^{-1} (A^* A) \boldsymbol\lambda = A\boldsymbol\lambda = \proj_W v, \end{equation*}
as desired.
Finally, we must check that the \(r\times r\) matrix \(A^*A\) is invertible. By the rank-nullity theorem it suffices to know that \(A^*A\) has trivial nullspace. So suppose that \(A^*A v=0.\) Since \(\la 0,v\ra = 0,\) we can write
\begin{equation*} 0 = \la A^*A v,v\ra= v^* (A^*Av) = (Av)^*(Av) = \|Av\|^2. \end{equation*}
Thus \(A^*Av = 0\) implies \(Av=0\text{,}\) but \(A\) is an \(n\times r\) matrix which defines a linear map from \(\C^r\to \C^n\text{.}\) Since \(A\) has \(r\) linearly independent columns, it has rank \(r.\) By the rank-nullity theorem, it follows that the nullity of \(A\) is zero, so \(Av=0\) implies \(v=0.\) Thus \(A^*A\) has trivial nullspace and so is invertible.
Let’s work through an example showing an orthogonal projection using the three different characterizations given above. We fix the vector space \(V=\R^3\text{,}\) and let \(w_1=\ba{r}1\\1\\-2\ea\) and \(w_2=\ba{r}5\\-1\\2\ea\text{,}\) \(W = \Span\{w_1, w_2\},\) and \(y=\begin{bmatrix}0\\0\\1\end{bmatrix}\text{.}\) We note that \(w_1\) and \(w_2\) are orthogonal, but not orthonormal and claim \(y \notin W\text{.}\)

Example 3.3.4. From the definition.

Using Definition 3.2.13, we see that
\begin{equation*} \proj_W v = \frac{\la y,w_1\ra}{\la w_1,w_1\ra} w_1 + \frac{\la y,w_2\ra}{\la w_r,w_r\ra} w_2 =\frac{-2}{6}\ba{r}1\\1\\-2\ea + \frac{2}{30}\ba{r}5\\-1\\2\ea= \ba{r}0\\-2/5\\4/5\ea. \end{equation*}
We also check that
\begin{equation*} y^\perp = y - \proj_W y = \ba{r}0\\2/5\\1/5\ea \in W^\perp. \end{equation*}

Example 3.3.5. Using a matrix with orthonormal columns.

Normalizing the vectors \(w_1\) and \(w_2\text{,}\) we obtain a matrix with orthonormal columns spanning \(W\text{:}\)
\begin{equation*} A = \ba{rr}1/\sqrt6\amp5/\sqrt{30}\\1/\sqrt6\amp-1/\sqrt{30}\\-2/\sqrt6\amp 2/\sqrt{30}\ea \end{equation*}
That \(A\) has orthonormal columns implies that \(A^*A (=A^TA)= I_2 \) (the two-by-two identity matrix), but that the matrix of \(\proj_W\) with respect to the standard basis for \(\R^3\) is
\begin{equation*} [\proj_W] = AA^* = \ba{rrr}1\amp0\amp0\\0\amp1/5\amp -2/5\\0\amp-2/5\amp4/5\ea \end{equation*}
and we check that
\begin{equation*} \proj_W y =\ba{rrr}1\amp0\amp0\\0\amp1/5\amp -2/5\\0\amp-2/5\amp4/5\ea \ba{r}0\\0\\1\ea = \ba{r}0\\-2/5\\4/5\ea. \end{equation*}

Example 3.3.6. Using the given vectors in matrix form.

Now we use Proposition 3.3.3 with the original vectors as the columns of the matrix
\begin{equation*} A = \ba{rr}1\amp 5\\1\amp -1\\-2\amp 2\ea\text{.} \end{equation*}
So the matrix of the projection is
\begin{equation*} [\proj_W] = A (A^*A)^{-1}A^*. \end{equation*}
We note that
\begin{equation*} A^*A = \ba{rr}6\amp0\\0\amp 30\ea \text{ so } (A^*A)^{-1} = \ba{rr}1/6\amp0\\0\amp 1/30\ea \end{equation*}
and
\begin{align*} [\proj_W] \amp= A (A^*A)^{-1}A^* =\\ \amp\ba{rr}1\amp 5\\1\amp -1\\-2\amp 2\ea \ba{rr}1/6\amp0\\0\amp 1/30\ea \ba{rrr}1\amp 1\amp -2\\5\amp -1\amp 2\ea = \ba{rrr}1\amp0\amp0\\0\amp1/5\amp -2/5\\0\amp-2/5\amp4/5\ea \end{align*}
as in the previous computation.

Remark 3.3.7. Which is the better method?

At first blush (maybe second too), it sure looks like the first example gives a method with the least amount of work. So why should we even consider the second or third methods?
The answer depends upon the intended application. If there is a single computation to make, the first method is mostly likely the most efficient, but if you must compute the orthogonal projection of many vectors into the same subspace, then the matrix method is far superior since you only compute the matrix once.
Examples of multiple projections include writing a computer graphics program which renders a three dimensional image on a flat screen (aka a plane).

Remark 3.3.8.

One final comment of note. Since
\begin{equation*} V = W \boxplus W^\perp, \end{equation*}
we know that the identity operator \(I_V\) can be written as
\begin{equation*} I_V = \proj_W + \proj_{W^\perp}. \end{equation*}
This means that
\begin{equation*} \proj_W v = v - \proj_{W^\perp} v, \end{equation*}
so if the dimension of \(W^\perp\) is smaller than that of \(W,\) it may make more sense to compute \(\proj_{W^\perp}\) and subtract it from the identity to obtain the desired projection.

Example 3.3.9. Point closest to a plane.

Let’s do another example illustrating some of the concepts above. Let \(V=\R^3\) and \(W\) be the subpace described by \(3x-y-5z=0.\) Let’s find the point on the plane closest to the point \(v=(1,1,1)\text{.}\)
We know that the plane \(W\) is spanned by any two linearly independent vectors in \(W,\) say
\begin{equation*} v_1 =\ba{r}1\\3\\0\ea \text{ and } v_2 = \ba{r}0\\-5\\1\ea. \end{equation*}
We form the matrix whose columns are \(v_1\) and \(v_2,\) and use Proposition 3.3.3 to compute the matrix of the projection (with respect to the standard basis) as
\begin{equation*} [\proj_W] = \ba{rrr} \frac{26}{35} \amp \frac{3}{35} \amp \frac{3}{7} \\ \frac{3}{35} \amp \frac{34}{35} \amp -\frac{1}{7} \\ \frac{3}{7} \amp -\frac{1}{7} \amp \frac{2}{7}\ea \end{equation*}
Thus
\begin{equation*} \proj_W v = \ba{rrr} \frac{26}{35} \amp \frac{3}{35} \amp \frac{3}{7} \\ \frac{3}{35} \amp \frac{34}{35} \amp -\frac{1}{7} \\ \frac{3}{7} \amp -\frac{1}{7} \amp \frac{2}{7}\ea \ba{r}1\\1\\1\ea = \frac{1}{35}\ba{r}44\\32\\20\ea. \end{equation*}
On the other hand, we could arrive at the answer via \(\proj_{W^\perp}\text{.}\) Since \(W^\perp\) is spanned by \(v_3=\ba{r}3\\-1\\-5\ea\)
\begin{equation*} \proj_{W^\perp} v = \frac{\la v,v_3\ra}{\la v_3,v_3\ra} v_3 = \frac{1}{35}\ba{r}-9\\3\\15\ea, \end{equation*}
so
\begin{equation*} \proj_W v = v - \proj_{W^\perp} v = \frac{1}{35}\ba{r}44\\32\\20\ea. \end{equation*}

Subsection 3.3.2 Sage Compuations

In this section, we use Sage to make some of the computations in the above examples. In those examples, we started with an orthogonal basis spanning the subspace \(W\) in \(V=\R^3\text{,}\) given by \(w_1=\ba{r}1\\1\\-2\ea\) and \(w_2=\ba{r}5\\-1\\2\ea\text{.}\)
Of course, more typically we have an arbitrary basis and need to use Gram-Schmidt to produce an orthogonal one. Also recall that Gram-Schmidt simply accepts the first of the given vectors as the first in the orthogonal basis. So let’s start with the basis \(w_1=\ba{r}1\\1\\-2\ea\) and \(w_2'=\ba{r}6\\0\\0\ea = w_1 + w_2\) (so that the basis is not already orthogonal).
So we build a matrix \(A\) whose row vectors are \(w_1\) and \(w_2'\text{.}\) The Gram-Schmidt algorithm in Sage returns two matrices: \(G\) is a the matrix whose rows are an orthogonal basis, and \(M\) is the matrix which tells the linear combinations of the given rows used to produce the orthogonal rows. As you will see, we return to our original orthogonal basis.
Next we compute the orthogonal projection of the vector \(v = [0,0,1]\) onto the plane \(W\) using the definition of the orthogonal projection. Notice that the rows of the matrix \(A\) are now the orthogonal basis for \(W.\)
Finally here we make \(A\) into a matrix with orthogonal columns to coincide with Proposition 3.3.3. We then compute the matrix of the \(\proj_W\) with respect to the standard basis.

Subsection 3.3.3 More on orthogonal projections

We return to a motivating example: how to deal with inconsistent linear system \(Ax=b.\) Since the system is inconsistent, we know that \(\|Ax-b\| > 0\) for every \(x\) in the domain. Want we want to do is find a vector \(\hat x\) which minimizes the error, that is for which
\begin{equation*} \|A \hat x - b\| \le \|Ax -b\| \end{equation*}
for every \(x\) in the domain.
So we let \(W\) be the column space of the \(m\times n\) matrix \(A\) and let \(\hat b = \proj_W b.\) Since \(\hat b\) is in the column space, the system \(Ax=\hat b\) is consistent. With \(\hat x\) any solution to \(Ax=\hat b\) Corollary 3.2.15 says that
\begin{equation*} \|A \hat x - b\| \le \|Ax -b\| \end{equation*}
for every \(x\) in the domain.
To compute this solution, there are multiple paths. Of course, we could compute the orthogonal projection, \(\hat b\) and solve the consistent system \(Ax=\hat b\text{,}\) but what if we could solve it without finding the orthogonal projection? That would be a significant time-saver.
Let’s start from the premise that we have found the orthogonal projection, \(\hat b\) of \(b\) into \(W = C(A)\text{,}\) and a solution \(\hat x\) to \(Ax=\hat b.\) Now by Theorem 3.2.10, \(b = \hat b + b^\perp\) where
\begin{equation*} b^\perp = b-\hat b = b-A\hat x\in W^\perp. \end{equation*}
Since \(W\text{,}\) the column space of \(A\text{,}\) is the image (range) of the linear map \(x\mapsto Ax,\) we deduce that
\begin{equation*} \la Ax, A\hat x-b\ra_m = 0. \end{equation*}
By Proposition 3.2.16, we deduce
\begin{equation*} \la Ax, A\hat x-b\ra_m = \la x, A^*(A\hat x -b)\ra_n =0, \end{equation*}
for every \(x\in \C^n.\) By the positivity property of any inner product, that means that \(A^*(A\hat x -b) =0.\) Thus to find \(\hat x,\) we need only find a solution to the new linear system
\begin{equation*} A^*A \hat x = A^* b. \end{equation*}
We summarize this as

Checkpoint 3.3.11. Does it matter which solution \(\hat x\) we pick?

In a theoretical sense the answer is no, but in a computational sense, the answer is probably. Of course if the system has a unique solution, the issue is resolved, but if it has more than one solution, there are infinitely many since any two differ by something in the nullspace of \(A.\) How should one choose?

Subsection 3.3.4 Least Squares Examples

A common problem is determining a curve which best describes a set of observed data points. The curve may be a polynomial, exponential, logarithmic, or something else. Below we investigate how to produce a polynomial which represents a least squares approximation to a set of data points. We begin with the simplest example, linear regression.
Consider the figure below in which two observed data points are plotted at \((x_i,y_i)\) and \((x_j,y_j).\) The goal is to find an equation of a line of the \(y=mx+b\) which “best describes” the given data, but what does that mean? Since in general, not all data points will lie on any chosen line, each choice of line will produce some error in approximation. Our first job is to decide on a method to measure the error. Looking at this generally, suppose we have observed data \(\{(x_i,y_i)\mid i=1\dots n\}\) and we are trying the find the best function \(y=f(x)\) which fits the data.
Figure 3.3.12. The concept of a least squares approximation
We could set the error to be
\begin{equation*} E = \sum_{i=1}^n (y_i - f(x_i)), \end{equation*}
but we immediately see this is a poor choice for when \(y_i \gt f(x_i)\) the error is counted as positive while when \(y_i \lt f(x_i)\text{,}\) the error is counted as negative, so it would be possible for a really poor approximation to produce a small error by having positive errors balanced by negative ones. Of course one solution would be simply to take absolute values, but they are often a bit challenging to work with, so for this and reasons connected to the inner product on \(\R^n\text{,}\) we choose a sum of squares of the errors:
\begin{equation*} E = \sum_{i=1}^n (y_i - f(x_i))^2, \end{equation*}
so for our linear model the error is
\begin{equation*} E = \sum_{i=1}^n (y_i - mx_i-b)^2. \end{equation*}
So where is the linear algebra? It might occur to you in staring that the expression for the error that if we had two vectors
\begin{equation*} Y=\ba{r}y_1\\\vdots\\y_n\ea \text{ and } Z= \ba{r}mx_1+b\\\vdots\\mx_n+b\ea \end{equation*}
that our error is
\begin{equation*} E = \|Y-Z\|^2 = \la Y-Z,Y-Z\ra. \end{equation*}
It is clear where the vector \(Y\) comes from, but let’s see if we can get a matrix involved to describe \(Z.\) Let
\begin{equation*} A = \ba{rr}x_1\amp 1\\\vdots\\x_n\amp 1\ea. \end{equation*}
Then
\begin{equation*} Z=\ba{r}mx_1+b\\\vdots\\mx_n+b\ea= A\ba{r}m\\b\ea. \end{equation*}
What would it mean if all the data points were to lie on the line? Of course it would mean the error is zero, but to move us in the direction of work we have already done, it would mean that
\begin{equation*} A\ba{r}m\\b\ea = \ba{r}y_1\\\vdots\\y_n\ea, \end{equation*}
in other words the linear system
\begin{equation*} AX=Y \end{equation*}
is solvable with solution \(X=\ba{r}m\\b\ea.\)
When the data points do not all lie on the line, the original system is inconsistent, but Corollary 3.3.10 tells us how to find the best solution \(\hat X=\ba{r}m\\b\ea\) for which
\begin{equation*} \|A\hat X - Y\| \le \|AX -Y\| \end{equation*}
for all \(X\in \R^2.\) Recalling that our error \(E=\|A\hat X - Y\|^2\text{,}\) this will solve our problem.
A simple example.
Suppose we have collected the following data points \((x,y)\text{:}\)
\begin{equation*} \{(2,1),(5,2),(7,3),(8,3)\}. \end{equation*}
We construct the matrix
\begin{equation*} A = \ba{rr}x_1\amp 1\\\vdots\\x_n\amp 1\ea=\ba{rr}2\amp 1\\5\amp 1\\7\amp 1\\8\amp 1\ea, \end{equation*}
and
\begin{equation*} Y = \ba{r}y_1\\\vdots\\y_n\ea = \ba{r}1\\2\\3\\3\ea \end{equation*}
Using Corollary 3.3.10, we solve the consistent linear system
\begin{equation*} A^*A \hat X = A^* Y: \end{equation*}
\begin{equation*} A^*A = \ba{rrrr}2\amp 5\amp 7\amp 8\\1\amp 1\amp 1\amp 1 \ea\ba{rr}2\amp 1\\5\amp 1\\7\amp 1\\8\amp 1\ea = \ba{rr}142\amp 22\\22\amp 4\ea \end{equation*}
and
\begin{equation*} A^*Y = \ba{rrrr}2\amp 5\amp 7\amp 8\\1\amp 1\amp 1\amp 1 \ea\ba{r}1\\2\\3\\3\ea=\ba{r}57\\9\ea \end{equation*}
We note that \(A^*A\) is invertible, so that
\begin{equation*} \ba{rr}142\amp 22\\22\amp 4\ea \ba{r}m\\b\ea = \ba{r}57\\9\ea \end{equation*}
has a unique solution:
\begin{equation*} \hat X = \ba{r}m\\b\ea=\ba{r}5/14\\2/7\ea, \end{equation*}
that is the desired line is \(y = 5/14x+2/7.\) We plot the data points and the line of regression below. Note that the first point lies on the line.
Figure 3.3.13. A simple linear regression
We now consider higher degree polynomial approximations. For background, we know that two points determine a line so we need to use linear regression as soon as we have more than two points. Lagrange interpolation tells us that given \(n\) points in the plane, no two on a vertical line, there is a unique polynomial of degree \(n-1\) which passes through them. We consider the quadratic case. So as soon as there are more than 3 points, we are no longer guaranteed a unique quadratic curve passing through them, so we desire a least squares approximation.
Now we are looking for coefficients \(b_0, b_1, b_2\) so that \(y = b_2x^2 + b_1x + b_0\) best approximates the data. As before assume that we have observed data
\begin{equation*} (x_i,y_i) = (60,3.1),(61,3.6),(62,3.8),(63,4),(65,4.1), i = 1\dots 5. \end{equation*}
In our quadratic model we have five equations of the form:
\begin{equation*} y_i = b_2x_i^2 + b_1x_i+b_0+\varepsilon_i \end{equation*}
where \(\varepsilon_i\) is the difference between the observed value and the value predicted by the quadratic. As before we have a matrix equation of the form
\begin{equation*} Y = AX + \varepsilon(X) \end{equation*}
where
\begin{equation*} A = \ba{rrr}60^2\amp 60\amp 1\\61^2\amp 61\amp 1\\62^2\amp 62\amp 1\\63^2\amp 63\amp 1\\65^2\amp 65\amp 1\ea, \quad X=\ba{r}b_2\\b_1\\b_0\ea, \text{ and } Y=\ba{r}3.1\\3.6\\3.8\\4\\4.1\ea. \end{equation*}
Again, we seek an \(\hat X\) so that
\begin{equation*} \|Y=A\hat X\| \le \|y-AX\|=\|\varepsilon(X)\| \quad (E(X) = \|\varepsilon(X)\|^2). \end{equation*}
So we want to solve the consistent system
\begin{equation*} A^*A \hat X = A^* Y. \end{equation*}
We have
\begin{equation*} A^*A = \ba{rrr}75185763\amp 1205981\amp 19359\\1205981\amp 19359\amp 311\\19359\amp 311\amp 5\ea, A^*Y = \ba{r}\frac{723613}{10}\\\\\frac{11597}{10}\\\\\frac{93}{5}\ea,\text{ and } \hat X = \ba{r}-\frac{141}{2716}\\\\\frac{90733}{13580}\\\\-\frac{715864}{3395}\ea. \end{equation*}
So the quadratic is
\begin{equation*} y = -\frac{141}{2716}x^2 + \frac{90733}{13580}x - \frac{715864}{3395}. \end{equation*}
The points together with the approximating quadratic are displayed below.
Figure 3.3.14. A quadratic least squares approximation