Section5.3Orthogonal Projections and Least Squares Approximations
We begin with the notion of orthogonal projection introduced in the previous section. We find different ways to compute it other than from the definiton, and give an application to least squares approximations.
Subsection5.3.1Orthonormal bases and orthogonal/unitary matrices.
Consider the inner product space \(V=F^n\) where \(F =
\R\) or \(\C,\) and denote by \(\overline z\) the complex conjugate of \(z.\)
If \(v=\begin{bmatrix}a_1\\\vdots\\a_n\end{bmatrix}\) and \(w=\begin{bmatrix}b_1\\\vdots\\b_n\end{bmatrix}\) are two vectors in \(F^n\text{,}\) we defined their inner product by:
It is very convenient to recognize values of the inner product via matrix multiplication. In particular, regarding the column vectors \(v,w\) as \(n\times 1\) matrices
is the \(1\times 1\) matrix product \(w^*v\) where \(w^*\) is the \(1\times n\) conjugate-transpose matrix to \(w.\)
For vectors \(v,w \) as above, we have seen the meaning of \(w^*v\text{.}\) It is more than idle curiosity to inquire about the meaning of \(vw^*.\) We can certainly compute it, but first we note that while \(w^*v = \la v,w\ra\) is a scalar (a \(1\times 1\) matrix), in the reverse order, \(vw^*\) is an \(n\times n\) matrix, specifically:
It is probably a bit more useful to see how this product arises naturally in what we have done so far.
Let apply the definition of an orthogonal projection to the inner product space \(V=\C^n\text{;}\) what happens for \(V=\R^n\) will be clear.
Let \(W\) be an \(r-\)dimensional subspace of \(V\) with orthonormal basis \(\{w_1, \dots, w_r\}.\) Then Definition 5.2.13 tells us that the orthogonal projection of a vector \(v\) into the subspace \(W\) is given by:
Now while for a vector space \(V\) over a field \(F\text{,}\) we have defined multiplication of a scalar \(\lambda\) times a vector \(v\) as \(\lambda v\text{,}\) you might ask if we would get into trouble if we defined \(v \lambda := \lambda
v\text{.}\) Since multiplication in a field is commutative, this turns out to be just fine, but in more general structures (modules over rings) there can be significant issues. So with that as preamble, let’s consider a summand \(\la v,w_j\ra w_j\) in the expression for an orthogonal projection. First we use that scalar multiplication can be thought of on the right or the left and then we use the specific nature of the inner product on \(\C^n,\) so that
where we note that the last expression is the matrix multiplication of an \(n\times
n\) matrix times the \(n\times 1\) vector \(v.\)
Our next goal is to give a more intrinsic characterization of the matrix \(\sum_{k=1}^r w_k w_k^*.\) Let \(A\) be the \(n\times r\) matrix whose columns are the orthonormal basis \(\{w_1, \dots, w_r\}\) of the subspace \(W.\) What should the matrix \(A^*A\) look like?
Using our familiar (row-column) method of multiplying two matrices together, the \(ij\) entry of the product is
from Corollary 5.3.1, that is, \(AA^*\) is the matrix of the orthogonal projection (with respect to the standard basis) of a vector to the subspace \(W.\)
This claim is most easily justified using the "column-row" expansion of a matrix product as given in [2]. If \(A\) is an \(n\times r\) matrix (as it is for us), and \(B\) is an \(r\times m\) matrix, then
\begin{equation*}
AB = \text{col}_1(A)\text{row}_1(B)+ \cdots+
\text{col}_r(A)\text{row}_r(B).
\end{equation*}
Proof.
The proof is simply a computation, but it is easy to make an error, so we do it out explicitly. Note that each summand is the product of an \(n\times 1\) matrix times an \(1\times m\) matrix.
Now from the row-column rule we know that the \(ij\) entry of \(AB\) is \((AB)_{ij} = \sum_{k=1}^r
a_{ik}b_{kj},\) which is exactly the sum of the \(ij\) entries from each of the \(r\) matrices above.
Now we apply this to the product of the matrices \(AA^*.\) The column-row rule immediately gives that
Let \(W\) be a subspace of \(\C^n\) with orthonormal basis \(\{w_1, \dots, w_r\},\) and let \(A\) be the \(n\times r\) matrix whose columns are those orthonormal basis vectors. Then for any vector \(v\in \C^n,\)
\begin{equation*}
\proj_W v := AA^*v \text{ and } A^*A = I_r\text{.}
\end{equation*}
While this is a very pretty expression for the orthogonal projection onto a subspace \(W\text{,}\) it is predicated on having an orthonormal basis for the subspace. Of course Gram-Schmidt can be employed, but it is an interesting exercise to produce a matrix representation of the projection starting from an arbitrary basis for the subspace. We reproduce Proposition 4.18 of [3] including a proof which includes several interesting ideas.
Proposition5.3.3.
Let \(W\) be a subspace of \(\C^n\) (or \(\R^n\)) with arbitrary basis \(\{v_1, \dots,
v_r\}.\) Let \(A\) be the \(n\times r\) matrix with columns \(v_1, \dots, v_r.\) Then the matrix of the orthogonal projection, \(\proj_W\text{,}\) with respect to the standard basis is
\begin{equation*}
A (A^*A)^{-1}A^*.
\end{equation*}
Proof.
Given a vector \(v\text{,}\) we know its orthogonal projection, \(\proj_W v\) is an element of \(W\) so a linear combination of the basis for \(W,\) say
Finally, we must check that the \(r\times r\) matrix \(A^*A\) is invertible. By the rank-nullity theorem it suffices to know that \(A^*A\) has trivial nullspace. So suppose that \(A^*A v=0.\) Since \(\la 0,v\ra = 0,\) we can write
Thus \(A^*Av = 0\) implies \(Av=0\text{,}\) but \(A\) is an \(n\times r\) matrix which defines a linear map from \(\C^r\to \C^n\text{.}\) Since \(A\) has \(r\) linearly independent columns, it has rank \(r.\) By the rank-nullity theorem, it follows that the nullity of \(A\) is zero, so \(Av=0\) implies \(v=0.\) Thus \(A^*A\) has trivial nullspace and so is invertible.
Let’s work through an example showing an orthogonal projection using the three different characterizations given above. We fix the vector space \(V=\R^3\text{,}\) and let \(w_1=\ba{r}1\\1\\-2\ea\) and \(w_2=\ba{r}5\\-1\\2\ea\text{,}\)\(W =
\Span\{w_1, w_2\},\) and \(y=\begin{bmatrix}0\\0\\1\end{bmatrix}\text{.}\) We note that \(w_1\) and \(w_2\) are orthogonal, but not orthonormal and claim \(y \notin W\text{.}\)
\begin{equation*}
y^\perp = y - \proj_W y = \ba{r}0\\2/5\\1/5\ea \in W^\perp.
\end{equation*}
Example5.3.5.Using a matrix with orthonormal columns.
Normalizing the vectors \(w_1\) and \(w_2\text{,}\) we obtain a matrix with orthonormal columns spanning \(W\text{:}\)
\begin{equation*}
A =
\ba{rr}1/\sqrt6\amp5/\sqrt{30}\\1/\sqrt6\amp-1/\sqrt{30}\\-2/\sqrt6\amp
2/\sqrt{30}\ea
\end{equation*}
That \(A\) has orthonormal columns implies that \(A^*A (=A^TA)=
I_2 \) (the two-by-two identity matrix), but that the matrix of \(\proj_W\) with respect to the standard basis for \(\R^3\) is
At first blush (maybe second too), it sure looks like the first example gives a method with the least amount of work. So why should we even consider the second or third methods?
The answer depends upon the intended application. If there is a single computation to make, the first method is mostly likely the most efficient, but if you must compute the orthogonal projection of many vectors into the same subspace, then the matrix method is far superior since you only compute the matrix once.
Examples of multiple projections include writing a computer graphics program which renders a three dimensional image on a flat screen (aka a plane).
Remark5.3.8.
One final comment of note. Since
\begin{equation*}
V = W \boxplus W^\perp,
\end{equation*}
we know that the identity operator \(I_V\) can be written as
\begin{equation*}
\proj_W v = v - \proj_{W^\perp} v,
\end{equation*}
so if the dimension of \(W^\perp\) is smaller than that of \(W,\) it may make more sense to compute \(\proj_{W^\perp}\) and subtract it from the identity to obtain the desired projection.
Example5.3.9.Point closest to a plane.
Let’s do another example illustrating some of the concepts above. Let \(V=\R^3\) and \(W\) be the subpace described by \(3x-y-5z=0.\) Let’s find the point on the plane closest to the point \(v=(1,1,1)\text{.}\)
We know that the plane \(W\) is spanned by any two linearly independent vectors in \(W,\) say
\begin{equation*}
v_1 =\ba{r}1\\3\\0\ea \text{ and }
v_2 = \ba{r}0\\-5\\1\ea.
\end{equation*}
We form the matrix whose columns are \(v_1\) and \(v_2,\) and use Proposition 5.3.3 to compute the matrix of the projection (with respect to the standard basis) as
\begin{equation*}
\proj_W v = v - \proj_{W^\perp} v =
\frac{1}{35}\ba{r}44\\32\\20\ea.
\end{equation*}
Subsection5.3.2Sage Compuations
In this section, we use Sage to make some of the computations in the above examples. In those examples, we started with an orthogonal basis spanning the subspace \(W\) in \(V=\R^3\text{,}\) given by \(w_1=\ba{r}1\\1\\-2\ea\) and \(w_2=\ba{r}5\\-1\\2\ea\text{.}\)
Of course, more typically we have an arbitrary basis and need to use Gram-Schmidt to produce an orthogonal one. Also recall that Gram-Schmidt simply accepts the first of the given vectors as the first in the orthogonal basis. So let’s start with the basis \(w_1=\ba{r}1\\1\\-2\ea\) and \(w_2'=\ba{r}6\\0\\0\ea = w_1 + w_2\) (so that the basis is not already orthogonal).
So we build a matrix \(A\) whose row vectors are \(w_1\) and \(w_2'\text{.}\) The Gram-Schmidt algorithm in Sage returns two matrices: \(G\) is a the matrix whose rows are an orthogonal basis, and \(M\) is the matrix which tells the linear combinations of the given rows used to produce the orthogonal rows. As you will see, we return to our original orthogonal basis.
Next we compute the orthogonal projection of the vector \(v =
[0,0,1]\) onto the plane \(W\) using the definition of the orthogonal projection. Notice that the rows of the matrix \(A\) are now the orthogonal basis for \(W.\)
Finally here we make \(A\) into a matrix with orthogonal columns to coincide with Proposition 5.3.3. We then compute the matrix of the \(\proj_W\) with respect to the standard basis.
Subsection5.3.3More on orthogonal projections
We return to a motivating example: how to deal with inconsistent linear system \(Ax=b.\) Since the system is inconsistent, we know that \(\|Ax-b\| > 0\) for every \(x\) in the domain. Want we want to do is find a vector \(\hat x\) which minimizes the error, that is for which
\begin{equation*}
\|A \hat x - b\| \le \|Ax -b\|
\end{equation*}
for every \(x\) in the domain.
So we let \(W\) be the column space of the \(m\times n\) matrix \(A\) and let \(\hat b = \proj_W b.\) Since \(\hat b\) is in the column space, the system \(Ax=\hat
b\) is consistent. With \(\hat x\) any solution to \(Ax=\hat b\)Corollary 5.2.15 says that
\begin{equation*}
\|A \hat x - b\| \le \|Ax -b\|
\end{equation*}
for every \(x\) in the domain.
To compute this solution, there are multiple paths. Of course, we could compute the orthogonal projection, \(\hat
b\) and solve the consistent system \(Ax=\hat b\text{,}\) but what if we could solve it without finding the orthogonal projection? That would be a significant time-saver.
Let’s start from the premise that we have found the orthogonal projection, \(\hat b\) of \(b\) into \(W =
C(A)\text{,}\) and a solution \(\hat x\) to \(Ax=\hat b.\) Now by Theorem 5.2.10, \(b = \hat b +
b^\perp\) where
\begin{equation*}
b^\perp = b-\hat b = b-A\hat x\in
W^\perp.
\end{equation*}
Since \(W\text{,}\) the column space of \(A\text{,}\) is the image (range) of the linear map \(x\mapsto Ax,\) we deduce that
for every \(x\in \C^n.\) By the positivity property of any inner product, that means that \(A^*(A\hat x -b) =0.\) Thus to find \(\hat x,\) we need only find a solution to the new linear system
\begin{equation*}
A^*A \hat x = A^* b.
\end{equation*}
We summarize this as
Corollary5.3.10.
Let \(A \in M_{m\times n}(\C),\) and \(b \in \C^m.\) Then there is an \(\hat x \in \C^n\) so that
for all \(x\in \C^n.\) Moreover, the solution(s), \(\hat x\) are acquired by solving the consistent linear system \(A^*A \hat x = A^* b.\)
Checkpoint5.3.11.Does it matter which solution \(\hat x\) we pick?
In a theoretical sense the answer is no, but in a computational sense, the answer is probably. Of course if the system has a unique solution, the issue is resolved, but if it has more than one solution, there are infinitely many since any two differ by something in the nullspace of \(A.\) How should one choose?
Subsection5.3.4Least Squares Examples
A common problem is determining a curve which best describes a set of observed data points. The curve may be a polynomial, exponential, logarithmic, or something else. Below we investigate how to produce a polynomial which represents a least squares approximation to a set of data points. We begin with the simplest example, linear regression.
Consider the figure below in which two observed data points are plotted at \((x_i,y_i)\) and \((x_j,y_j).\) The goal is to find an equation of a line of the \(y=mx+b\) which “best describes” the given data, but what does that mean? Since in general, not all data points will lie on any chosen line, each choice of line will produce some error in approximation. Our first job is to decide on a method to measure the error. Looking at this generally, suppose we have observed data \(\{(x_i,y_i)\mid i=1\dots n\}\) and we are trying the find the best function \(y=f(x)\) which fits the data.
Figure5.3.12.The concept of a least squares approximation
We could set the error to be
\begin{equation*}
E = \sum_{i=1}^n (y_i - f(x_i)),
\end{equation*}
but we immediately see this is a poor choice for when \(y_i \gt
f(x_i)\) the error is counted as positive while when \(y_i \lt
f(x_i)\text{,}\) the error is counted as negative, so it would be possible for a really poor approximation to produce a small error by having positive errors balanced by negative ones. Of course one solution would be simply to take absolute values, but they are often a bit challenging to work with, so for this and reasons connected to the inner product on \(\R^n\text{,}\) we choose a sum of squares of the errors:
\begin{equation*}
E = \sum_{i=1}^n (y_i - f(x_i))^2,
\end{equation*}
so for our linear model the error is
\begin{equation*}
E = \sum_{i=1}^n (y_i - mx_i-b)^2.
\end{equation*}
So where is the linear algebra? It might occur to you in staring that the expression for the error that if we had two vectors
\begin{equation*}
Y=\ba{r}y_1\\\vdots\\y_n\ea \text{ and } Z=
\ba{r}mx_1+b\\\vdots\\mx_n+b\ea
\end{equation*}
that our error is
\begin{equation*}
E = \|Y-Z\|^2 = \la Y-Z,Y-Z\ra.
\end{equation*}
It is clear where the vector \(Y\) comes from, but let’s see if we can get a matrix involved to describe \(Z.\) Let
\begin{equation*}
A = \ba{rr}x_1\amp 1\\\vdots\\x_n\amp 1\ea.
\end{equation*}
What would it mean if all the data points were to lie on the line? Of course it would mean the error is zero, but to move us in the direction of work we have already done, it would mean that
When the data points do not all lie on the line, the original system is inconsistent, but Corollary 5.3.10 tells us how to find the best solution \(\hat X=\ba{r}m\\b\ea\) for which
\begin{equation*}
\|A\hat X - Y\| \le \|AX -Y\|
\end{equation*}
for all \(X\in \R^2.\) Recalling that our error \(E=\|A\hat X - Y\|^2\text{,}\) this will solve our problem.
A simple example.
Suppose we have collected the following data points \((x,y)\text{:}\)
\begin{equation*}
\hat X =
\ba{r}m\\b\ea=\ba{r}5/14\\2/7\ea,
\end{equation*}
that is the desired line is \(y = 5/14x+2/7.\) We plot the data points and the line of regression below. Note that the first point lies on the line.
Figure5.3.13.A simple linear regression
We now consider higher degree polynomial approximations. For background, we know that two points determine a line so we need to use linear regression as soon as we have more than two points. Lagrange interpolation tells us that given \(n\) points in the plane, no two on a vertical line, there is a unique polynomial of degree \(n-1\) which passes through them. We consider the quadratic case. So as soon as there are more than 3 points, we are no longer guaranteed a unique quadratic curve passing through them, so we desire a least squares approximation.
Now we are looking for coefficients \(b_0, b_1, b_2\) so that \(y = b_2x^2 + b_1x + b_0\) best approximates the data. As before assume that we have observed data
\begin{equation*}
(x_i,y_i) = (60,3.1),(61,3.6),(62,3.8),(63,4),(65,4.1), i = 1\dots
5.
\end{equation*}
In our quadratic model we have five equations of the form:
where \(\varepsilon_i\) is the difference between the observed value and the value predicted by the quadratic. As before we have a matrix equation of the form
\begin{equation*}
Y = AX + \varepsilon(X)
\end{equation*}