10 Ordinary Least Squares

“When you arise in the morning think of what a privilege it is to be alive, to think, to enjoy, to love …”
– Marcus Aurelius

The (ordinary) least squares is a popular technique to fit a line (more generally a hyperplane¹) to a collection of possibly noisy data points.

¹ A hyperplane generalizes the notion of a two-dimensional plane in three-dimensional space to arbitrary dimension.

10.1 A Motivating Example

We present a univariate (only one predictor), motivating example using the famous Palmer Penguins dataset. The dataset contains m=344 data points. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.

Among many variable available in the data, we select body_weight as our only predictor and flipper_length_mm as our response variable. A scatter plot is shown below.

viewof m = Inputs.range([0, penguins.length], {value: 10, step: 1, label: tex`\text{Sample size }m`}) 
data = penguins.slice(0, m)
x_min = d3.min(data, d => d.body_mass_g)
x_max = d3.max(data, d => d.body_mass_g)
y_min = d3.min(data, d => d.flipper_length_mm)
y_max = d3.max(data, d => d.flipper_length_mm)
Plot.plot({
  x: {domain: [2500, 6800]},
  y: {domain: [160, 240]},
  marks: [
    Plot.line([[2500, 0.005 * slope * 2500 + intercept], [6500, 0.005 * slope * 6500 +intercept]], { opacity: line? 1 : 0}),
    Plot.ruleX(data, {
      x: "body_mass_g",
      y1: (d) => .005 * slope * d.body_mass_g + intercept,
      y2: "flipper_length_mm",
      stroke: "red",
      opacity: error? 0.7 : 0,
      strokeWidth: 2
    }),
    Plot.dot(data, {x: "body_mass_g", y: "flipper_length_mm", fill: 'gray', r: 3})
  ]
})

viewof intercept = Inputs.range([140, 250], {value: 160, step: 1, label: tex`\text{Intercept }\beta_0`})
viewof slope = Inputs.range([0, 3], {value: 1, step: .1, label: tex`\text{Slope }\beta_1`}) 
viewof line = Inputs.toggle({label: 'Show line'})
viewof error = Inputs.toggle({label: 'Show error'})

10.2 The Mathematical Formulation

Let us assume that each data point has n predictors (or features) and one response². The predictor vector is written as a row vector \mathbf{x}=\begin{bmatrix}x_1&x_2&\ldots&x_n\end{bmatrix} in \mathbb{R}^n and the corresponding response is a scalar y\in\mathbb{R}. A single data point is then written as (\mathbf{x}, y), with the understanding that the first component is a (column) vector of predictors and the second component is the response scalar. Consequently, a dataset of size m is a collection of such data points: D = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_m, y_m)\}. \tag{10.1} For the ith data point, \mathbf{x}_i is a column vector of predictors: \mathbf{x}_i=\begin{bmatrix}x_{i1}&x_{i2}&\ldots&x_{in}\end{bmatrix}.

² Here, all variables are real-values, as opposed to boolean or ordinal.

Regressor Matrix

For a dataset D, we extract the predictor values to form the regressor matrix \mathbf{X} and the response (column) vector \mathbf{y}. \mathbf{X}=\begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1n}\\ 1 & x_{21} & x_{22} & \ldots & x_{2n}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & x_{m1} & x_{m2} & \ldots & x_{mn}\\ \end{bmatrix}_{m\times(n+1)}\text{ and } \mathbf{y}=\begin{bmatrix} y_1\\ y_2 \\ \vdots\\ y_m \end{bmatrix}_{m\times1}

Note that the size of the regressor matrix is m\times(n+1); the first column has been added (with each entry 1) for mathematical convenience.

The goal is to find an n-dimensional hyperplane in \mathbb{R}^{n+1} that best fit the data. More precisely, we find the hyperplane that minimizes error incurred by projecting the data points on a hyperplane.

Recall the equation of an n-dimensional hyperplane \mathcal{H} in \mathbb{R}^{n+1}: Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_n X_n, \tag{10.2} where \beta_0 is the intercept and \beta_1,\beta_2,\ldots,\beta_n are called the slopes for the corresponding predictor variables. The \beta’s uniquely determine a hyperplane. So, finding a hyperplane amounts to finding the (n+1) unknowns: \beta_0,\beta_1,\ldots,\beta_n.

We solve for the unknowns so that the following loss function—known as squared error—is minimized.

Loss Function

For any given dataset D (Equation 10.1) and hyperplane \mathcal{H} (Equation 10.2), we define the squared error function as follows: \mathcal{E}\coloneqq\frac{1}{m}\sum_{i=1}^m\left[y_i-(\beta_0+\beta_1x_{i1}+\ldots+\beta_nx_{in})\right]^2. \tag{10.3}

Exercise 10.1 Given a dataset and the equation of a hyperplane, write code to calculate the squared error.

10.2.1 OLS Solution

The OLS is one of the few rare problems in data science for which a closed-form explicit solution is available. Indeed, the solution or the slopes \beta_i’s of the hyperplane that minimizes the loss function SSE (Equation 10.3) can be explicitly written down in terms of the input data.

In order to solve the minimization problem, we first denote by e_i the error³ (sometimes called the residual) of the i^\text{th} observation in the data. More precisely, e_i\coloneqq y_i-(\beta_0+\beta_1x_{i1}+\ldots+\beta_1x_{in}). We can then “stack” all m of e_is to form a system of equations, \begin{aligned} e_1&=y_1-(\beta_0+\beta_1x_{11}+\ldots+\beta_1x_{1n})\\ e_2&=y_2-(\beta_0+\beta_1x_{21}+\ldots+\beta_1x_{2n})\\ &\quad\quad\vdots\\ e_m&=y_m-(\beta_0+\beta_1x_{m1}+\ldots+\beta_1x_{mn}). \end{aligned} In the matrix notation, write the system as follows: \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_m \end{bmatrix} =\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} -\begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1n}\\ 1 & x_{21} & x_{22} & \ldots & x_{2n}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & x_{m1} & x_{m2} & \ldots & x_{mn}\\ \end{bmatrix} \begin{bmatrix} \beta_0\\ \beta_1 \\ \vdots\\ \beta_n \end{bmatrix}. Equivalently, \mathbf{e}=\mathbf{y}-\mathbf{X}\boldsymbol{\beta}, where \mathbf{X} and \mathbf{y} are the regressor matrix and the response vector of the given data; \mathbf{e}=\begin{bmatrix}e_1 \\ e_2 \\ \vdots \\ e_m\end{bmatrix} is the residual vector and \boldsymbol{\beta}=\begin{bmatrix}\beta_0 \\ \beta_1 \\ \vdots \\ \beta_n\end{bmatrix} is the vector of the unknown slope coefficients.

³ the red vertical lines in the motivating example above.

Also, note from Equation 10.3 that the SSE can be written in matrix notation as \frac{1}{m}\sum_{i=1}^m e_i^2=\frac{1}{m}\|\mathbf{e}\|^2=\frac{1}{m}\mathbf{e}^T\mathbf{e}=\frac{1}{m}(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta}).

So, find the OLS solution is equivalent to the following minimization problem:

Minimization Problem

Given the regressor matrix \mathbf{X} and response \mathbf{y} of the data, find slope coefficients \boldsymbol\beta that minimizes: (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta}).

10.2.2 Normal Equations

The objective function for the minimization problem above is a real-valued function of n+1 variables (inside \boldsymbol{\beta}). The first-order partial derivative w.r.t. \beta_0 is \begin{aligned} \frac{\partial\mathcal{E}}{\partial\beta_0}&=\sum_{i=1}^m2(y_i-(\beta_0+\beta_1x_{i1}+\ldots+\beta_1x_{in}))\frac{\partial\mathcal{E}}{\partial\beta_0}(y_i-(\beta_0+\beta_1x_{i1}+\ldots+\beta_1x_{in}))\\ &=\sum_{i=1}^m2(y_i-(\beta_0+\beta_1x_{i1}+\ldots+\beta_1x_{in}))(-1)\\ &=-2\sum_{i=1}^m(y_i-(\beta_0+\beta_1x_{i1}+\ldots+\beta_1x_{in})). \end{aligned} And, for any other 1\leq j\leq n, the partial derivative w.r.t. \beta_j is \begin{aligned} \frac{\partial\mathcal{E}}{\partial\beta_j}&=\sum_{i=1}^m2(y_i-(\beta_0+\beta_1x_{i1}+\ldots+\beta_1x_{in}))\frac{\partial\mathcal{E}}{\partial\beta_j}(y_i-(\beta_0+\beta_1x_{i1}+\ldots+\beta_1x_{in}))\\ &=\sum_{i=1}^m2(y_i-(\beta_0+\beta_1x_{i1}+\ldots+\beta_1x_{in}))(-x_{ij})\\ &=-2\sum_{i=1}^mx_{ij}(y_i-(\beta_0+\beta_1x_{i1}+\ldots+\beta_1x_{in})). \end{aligned} Let \widehat{\boldsymbol{\beta}}=\begin{bmatrix}\widehat\beta_0\\\widehat\beta_1\\\vdots\\\widehat\beta_n\end{bmatrix} be an optimal solution. By the necessary condition⁴ of a local minimizer, we must have that all the partial derivatives \frac{\partial\mathcal{E}}{\partial\beta_j} for all j are zero at the solution, i.e., \begin{aligned} -2\sum_{i=1}^m(y_i-(\widehat\beta_0+\widehat\beta_1x_{i1}+\ldots+\widehat\beta_1x_{in}))&=0\\ -2\sum_{i=1}^mx_{i1}(y_i-(\widehat\beta_0+\widehat\beta_1x_{i1}+\ldots+\widehat\beta_1x_{in}))&=0\\ -2\sum_{i=1}^mx_{i2}(y_i-(\widehat\beta_0+\widehat\beta_1x_{i1}+\ldots+\widehat\beta_1x_{in}))&=0\\ \vdots\quad\quad&\\ -2\sum_{i=1}^mx_{in}(y_i-(\widehat\beta_0+\widehat\beta_1x_{i1}+\ldots+\widehat\beta_1x_{in}))&=0. \end{aligned} In matrix notation, we summarize the above n+1 equations as follows: -2\mathbf{X}^T(\mathbf{y}-\mathbf{X}\widehat{\boldsymbol{\beta}})=\mathbf{0}. This is called the normal equations. Solving for \widehat{\boldsymbol{\beta}} yields \begin{aligned} -\mathbf{X}^T(\mathbf{y}-\mathbf{X}\widehat{\boldsymbol{\beta}})&=\mathbf{0}\\ -\mathbf{X}^T\mathbf{y}+\mathbf{X}^T\mathbf{X}\widehat{\boldsymbol{\beta}}&=\mathbf{0}\\ \mathbf{X}^T\mathbf{X}\widehat{\boldsymbol{\beta}}&=\mathbf{X}^T\mathbf{y}. \end{aligned} Now, multiplying both sides by (\mathbf{X}^T\mathbf{X})^{-1}, we get \widehat{\boldsymbol{\beta}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}. In order to verify that this is the solution, we can easily check the sufficient condition⁵ \frac{\partial^2\mathcal{E}}{\partial\boldsymbol\beta} is positive-definite.

⁴ If a point \widehat x is a local minimum of a (differentiable) real-valued function f:\mathbb R\to\mathbb R, then f'(\widehat{x})=0.

⁵ If a point \widehat x satisfies f'(\widehat x)=0 and f''(\widehat x)>0, then it is a local minimum of a (differentiable) real-valued function f:\mathbb R\to\mathbb R.

The Normal Equations

The solution hyperplane is obtained by solving normal equations for the unknowns \boldsymbol{\beta}: \mathbf{X}^T\mathbf{X}\widehat{\boldsymbol{\beta}}=\mathbf{X}^T\mathbf{y}. \tag{10.4} This is a square system in n+1 unknowns.

Exercise 10.2 Show that the coefficient matrix of Equation 10.4 is square, symmetric.

10.2.3 Solvability

In order to have a unique solution to OLS, the normal equations (Equation 10.4) must be solvable uniquely. In other words, the coefficient matrix \mathbf{X}^T\mathbf{X} of the system must have linearly independent column.

We note the following guarantee of solution for the normal equations.

Non-Singularity of \mathbf{X}^T\mathbf{X}

The matrix \mathbf{X}^T\mathbf{X} has the same nullspace as \mathbf{X}.

As a consequence, if \mathbf{X} has independent columns, then \mathbf{X}^T\mathbf{X} is square, symmetric, and invertible.

If the regressor matrix \mathbf{X} has dependent columns, then OLS does not exhibit a solution. The situation is known as multi-collinearity.

10.2.4 Implementation

The normal equations (Equation 10.4) can be solved using any of the solvers we have seen so far in the course. However, the \mathbf{QR} factorization facilitates an effective implementation.

If \mathbf{X}_{m\times(n+1)} has linearly independent column vectors, then recall from the previous chapter that \mathbf{X} can be factored as \mathbf{QR}, where \mathbf{Q} is an orthogonal matrix of size m\times(n+1) and \mathbf{R} an upper triangular, square, invertible matrix of size (n+1)\times(n+1).

Thus, from the factorization \mathbf{X}=\mathbf{QR}, Equation 10.4 becomes \begin{aligned} (\mathbf{QR})^T(\mathbf{QR})\widehat{\boldsymbol{\beta}}&=(\mathbf{QR})^T\mathbf{y}\\ \mathbf{R}^T(\mathbf{Q}^T\mathbf{Q})\mathbf{R}\widehat{\boldsymbol{\beta}}&=\mathbf{R}^T\mathbf{Q}^T\mathbf{y}. \end{aligned} Since \mathbf{Q} is orthogonal, we have \mathbf{Q}^T\mathbf{Q}=I. So, we can further simplify the normal equations: \begin{aligned} \mathbf{R}^T\mathbf{R}\widehat{\boldsymbol{\beta}}&=\mathbf{R}^T\mathbf{Q}^T\mathbf{y}\\ \mathbf{R}\widehat{\boldsymbol{\beta}}&=\mathbf{Q}^T\mathbf{y}. \end{aligned} The last equation is due to the fact that \mathbf{R} is non-singular.

Pseudocode

If \mathbf X has independent columns, the unique solution to OLS can be obtained using the following steps:

Factorize \mathbf{X} into \mathbf{X}=\mathbf{QR}.
- Complexity: \mathcal{O}(mn^2)
Solve the triangular system \mathbf{R}\widehat{\boldsymbol{\beta}}=\mathbf{Q}^T\mathbf{y} for \widehat{\boldsymbol{\beta}} using back substitution.
- Complexity: \mathcal{O}(mn)

The overall computational complexity is \mathcal{O}(mn^2).