Skip to main content

Worksheet 3.5 Worksheet: Least squares approximation

In many applied scenarios, it is not practical (or even, perhaps, possible) to find an exact solution to a problem. Sometimes we may be working with imperfect data. Other times, we may be dealing with a problem that is overdetermined,  1  such as a system of linear equations with more equations than variables. (Quite often, both of these issues may be present.)
An overdetermined system is quite likely to be inconsistent. That is, our problem requires finding a solution to a system \(A\xx=\mathbf{b}\text{,}\) where no such solution exists! When no solution is possible, we can ask whether it is instead possible to find a best approximation.
What would a best approximation look like in this case? Let \(U=\csp(A)\) denote the column space of \(A\text{,}\) which we know is a subspace of \(\R^n\) (assuming \(A\) is an \(m\times n\) matrix). The subspace \(U\) is precisely the set of all vectors \(\yy\) such that \(A\xx=\yy\) has a solution. Among these vectors, we would like to find the one that is closest to \(\mathbf{b}\text{,}\) in the sense that \(\len{\yy-\mathbf{b}}\) is as small as possible.
But we know exactly what this vector \(\yy\) should be: the orthogonal projection of \(\mathbf{b}\) onto \(U\text{.}\)
The presentation in this worksheet is based on the one given in the text by Nicholson (see Section 5.6). Further details may be found there, or, for the more statistically inclined, on Wikipedia 2 .
Given an inconsistent system \(A\xx = \mathbf{b}\text{,}\) we have two problems to solve:
  1. Find the vector \(\yy = \proj{U}{\mathbf{b}}\text{,}\) where \(U=\csp(A)\)
  2. Find a vector \(\zz\) such that \(A\zz=\yy\)
The vector \(\zz\) is then our approximate solution.
This can be done directly, of course:
  1. Find a basis for \(U\)
  2. Use the Gram-Schmidt Orthonormalization Algorithm to construct an orthogonal basis for \(U\)
  3. Use this orthogonal basis to compute \(\yy=\proj{U}{\mathbf{b}}\)
  4. Solve the system \(A\zz=\yy\text{.}\)
But there is another way to proceed: we know that \(\mathbf{b}-\yy=\mathbf{b}-A\zz\in U^\bot\text{,}\) so for any vector \(A\xx\in U\text{,}\) \((A\xx)\dotp (\mathbf{b}-A\zz)=0\text{.}\) Therefore,
\begin{align*} (A\xx)\dotp (\mathbf{b}-A\zz) \amp =0\\ (A\xx)^T(\mathbf{b}-A\zz) \amp =0\\ \xx^TA^T(\mathbf{b}-A\zz) \amp =0\\ \xx^T(A^T\mathbf{b}-A^TA\zz) \amp =0\text{,} \end{align*}
for any vector \(\xx\in\R^n\text{.}\)
Therefore, \(A^T\mathbf{b}-A^TA\zz=0\text{,}\) or
\begin{equation} A^TA\zz = A^T\mathbf{b}\text{.}\tag{3.5.1} \end{equation}
Solving this system, called the normal equations for \(\zz\text{,}\) will yield our approximate solution.
To begin, let’s compare the two methods discussed above for finding an approximate solution. Consider the system of equations \(A\xx=\mathbf{b}\text{,}\) where
\begin{equation*} A = \bbm 3\amp -1 \amp 0\amp 5\\ -2\amp 7\amp -3\amp 0\\ 4\amp -1\amp 2\amp 3\\ 0\amp 3\amp 9\amp -1\\ 7\amp -2\amp 4\amp -5\\ 1\amp 0\amp 3\amp -8\ebm \text{ and } \mathbf{b} = \bbm 4\\0\\1\\2\\-5\\-1\ebm\text{.} \end{equation*}

1.

Confirm that the system has no solution.

2.

Find an orthogonal basis for the column space of \(A\text{.}\)

3.

Compute the projection \(\yy\) of \(\mathbf{b}\) onto the column space of \(A\text{.}\)

4.

Solve the system \(A\zz=\yy\) for \(\zz\text{.}\)

5.

Solve the normal equations (3.5.1) for \(\zz\text{.}\)
Next, we want to consider a problem found in many introductory science labs: finding a line of best fit. The situation is as follows: in some experiment, data points \((x_1,y_1), (x_2,y_2),\ldots, (x_n,y_n)\) have been found. We would like to find a function \(y=f(x)=ax+b\) such that for each \(i=1,2,\ldots, n\text{,}\) the value of \(f(x_i)\) is as close as possible to \(y_i\text{.}\)
Note that we have only two parameters available to tune: \(a\) and \(b\text{.}\) We assume that some reasoning or experimental evidence has led us to conclude that a linear fit is reasonable. The challenge here is to make precise what we mean by “as close as possible”. We have \(n\) differences (sometimes called residuals) \(r_i=y_i-f(x_i)\) that we want to make small, by adjusting \(a\) and \(b\text{.}\) But making one of the \(r_i\) smaller might make another one larger!
A measure of the overall error in the fit of the line is given by the sum of squares
\begin{equation*} S = r_1^2+r_2^2+\cdots + r_n^2\text{,} \end{equation*}
and this is the quantity that we want to minimize. (Hence the name, “least squares”.)
Let \(\vv=\bbm a\\b\ebm\text{,}\) and note that \(f(x)=a+bx = \bbm 1\amp x\ebm\vv\text{.}\) Set \(\yy = \bbm y_1\\y_2\\\vdots\\y_n\ebm\) and \(\mathbf{r}=\bbm r_1\\r_2\\\vdots\\r_n\ebm\text{.}\) Then
\begin{equation*} \mathbf{r}=\yy - A\vv\text{,} \end{equation*}
where \(A = \bbm 1\amp x_1\\1\amp x_2\\\vdots \amp \vdots \\1\amp x_n\ebm\text{.}\) (Note that we are using \(\yy\) to denote a different sort of vector than on the previous page.)
We can safely assume that an exact solution \(A\vv=\yy\) is impossible, so we search for an approximate one, with \(\mathbf{r}\) as small as possible. (Note that the magnitude of \(\mathbf{r}\) satisfies \(\len{\mathbf{r}}^2=S\text{.}\)) But a solution \(\zz\) that makes \(\yy-A\zz\) as small as possible is exactly the sort of approximate solution that we just learned to calculate! Solving the normal equations for \(\zz=\bbm a\\b\ebm\text{,}\) we find that
\begin{equation*} \zz = (A^TA)^{-1}(A^T\yy)\text{.} \end{equation*}

6.

Find the equation of the best fit line for the following set of data points:
\begin{equation*} (1,2.03), (2, 2.37), (3, 2.91), (4, 3.58), (5, 4.11), (6, 4.55), (7, 4.93), (8, 5.44), (9, 6.18)\text{.} \end{equation*}

7.

Suppose we were instead trying to find the best quadratic fit for a dataset. What would our parameters be? What would the matrix \(A\) look like? Illustrate with an example of your own.
The term “overdetermined” is common in statistics. In other areas, such as physics, the term “over-constrained” is used instead.
en.wikipedia.org/wiki/Least_squares