12 Singular Value Decomposition
“Study hard what interests you the most in the most undisciplined, irreverent and original manner possible.”
– Richard Feynmann
The singular value decomposition (SVD) is the most widely used factorization in numerical linear algebra. The technique spawns a variety of applications in various domains of science and engineering. The SVD also provides the foundations for many important techniques such as Principal Component Analysis (PCA). The PCA is just one of the many techniques used in a much larger domain of research known as dimensionality reduction, motivated to obtain low-dimensional approximations to very high-dimensional data.
In this chapter, we discuss the basic algorithm behind the SVD factorization and some of its applications like row-rank approximations and PCA.
12.1 Definition
Let us assume that we are given a large data set in the form of an m\times n matrix: \mathbf{X}=\begin{bmatrix}\begin{matrix}\Big|\\\mathbf{x}_1\\\Big|\end{matrix} & \begin{matrix}\Big|\\\mathbf{x}_2\\\Big|\end{matrix} &\vdots& \begin{matrix}\Big|\\\mathbf{x}_n\\\Big|\end{matrix}\end{bmatrix}. Each column \mathbf{x}_i of \mathbf{X} is a vector of \mathbb{R}^m.
The SVD is a unique matrix decomposition that exists for any matrix \mathbf{X} such that \mathbf{X}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T, where
- \mathbf{U}_{m\times m},\mathbf{V}_{n\times n} are square, orthogonal matrices1
- the columns of \mathbf{U} and \mathbf{V} are called left singular vectors and right singular vectors, respectively
- \mathbf{\Sigma}_{m\times n} is a diagonal matrix with non-negative diagonal entries
- the diagonal entries of \mathbf{\Sigma} are called the singular values and they are order from largest to smallest.
1 \mathbf{U}\mathbf{U}^T=\mathbf{U}^T\mathbf{U}=\mathbf{I}.
The rank of \mathbf{X} is equal to the number of its non-zero singular values.
12.1.1 Reduced or Economy SVD
Note that the sandwiched matrix \mathbf{\Sigma} holding the singular values is very sparse2. In order to save space—without losing much information—the economy SVD is usually output in a commercial software.
2 Most of the elements are zero; Wiki.
12.1.1.1 Short-fat
In this case, the matrix \mathbf{X} for decomposition has more column than rows, i.e., m\leq n. \begin{bmatrix}\quad\\\quad\quad\quad{\Large\mathbf{X}}\quad\quad\quad\\\quad\end{bmatrix} =\begin{bmatrix}\\\quad\quad{\Large\mathbf{U}}\quad\quad\\\quad\end{bmatrix} \begin{bmatrix} \begin{array}{c|c} \begin{matrix} \ddots & &\\ & {\large\mathbf{\hat\Sigma}} & \\ & & \ddots \end{matrix} & \quad{\huge\mathbf{0}}\quad \end{array} \end{bmatrix} \begin{bmatrix} \begin{array}{c|c}{\Large\mathbf{\hat V}}&{\Large\mathbf{\hat V}}^\perp\end{array} \end{bmatrix}^T . In the economy version, the SVD outputs \hat{\mathbf{V}}, whose size is only n\times m (as opposed to n\times n).
12.1.2 Tall-skinny
In this case, the matrix \mathbf{X} for decomposition has more rows than columns, i.e., m\geq n. \begin{bmatrix}\quad\\\quad\\\quad{\Large\mathbf{X}}\quad\\\quad\\\quad\end{bmatrix} =\begin{bmatrix} \begin{array}{c|c}{\Large\hat{\mathbf{U}}} & {\Large\hat{\mathbf{U}}}^\perp \end{array} \end{bmatrix} \begin{bmatrix} \begin{array}{cc} \begin{matrix} \ddots & & \\ & {\large\hat{\mathbf{\Sigma}}} & \\ & & \ddots \end{matrix}\\\hline \begin{matrix}\\ \quad{\huge\mathbf{0}}\quad \\\end{matrix} \end{array} \end{bmatrix} \begin{bmatrix}\quad\\\quad\\\quad\quad\quad{\Large\mathbf{V}}\quad\quad\quad\\\quad\\\quad\end{bmatrix}^T. In the economy version, the SVD outputs \hat{\mathbf{U}}, whose size is only m\times n (as opposed to m\times m).
12.1.3 Numpy Implementation
Since the algorithm of SVD is quite involved, we skip the algorithm here. Instead, we describe how to perform the factorization using the Python implementation in numpy.linalg.svd
3.
3 read documentation.
12.2 Matrix Approximation
We now present a significant application of SDV in approximating a (very large) matrix by an optimal row-rank matrix. For any desired rank r, a rank-r approximation is obtained by keeping the leading r singular values and vector, while disregarding the rest.
Recall that SVD arranges the singular values in \mathbf{\Sigma} in their decreasing order of importance: \sigma_1\geq\sigma_2\geq\ldots\geq\sigma_{\min\{m,n\}}\geq0. If the matrix \mathbf{X} is not full-rank to begin with, some singular values can still be zero. Sometimes the magnitude of the the singular values decrease so rapidly that only first few significant singular values contains much of the information of \mathbf{X}.
A notable property of SVD is it allows us to write the input matrix as a sum of rank-1 matrices: \mathbf{X}=\sum_{k=1}^{\min\{m,n\}}\sigma_k\mathbf{u}_k\mathbf{v}_k^T, \tag{12.1} where \sigma_k is the k^{\text{th}} diagonal entry of \mathbf{\Sigma}, and \mathbf{u}_k and \mathbf{v}_k are the k^{\text{th}} columns of \mathbf{U} and \mathbf{U}, respectively.
For any given rank r\leq the rank of \mathbf{X}, we can form a rank-r approximation, denoted \tilde{\mathbf{X}}, by truncating the summation (Equation 12.1) after r terms. This is, in fact, the optimal4 rank-r approximation of \mathbf{X} in the l^2-sense.
4 Eckart-Young Theorem; find more.
12.2.1 Image Compression
We demonstrate the idea of matrix approximation using image compression. A gray-scale image is modeled as an m\times n matrix, where m and n are vertical and horizontal pixel directions, respectively. Each of the mn pixels contains a gray-scale value, depending on chosen color-depth: 8-bit, 16-bit, etc.
12.3 Principal Component Analysis (PCA)
The PCA is one of the central applications of the SVD. Let \mathbf{X} be an m\times n matrix, each row contains an observation of some number of features. So, the columns of \mathbf{X} represent features and the rows represent observations5. The principal components provide an alternative, orthogonal coordinate system about the mean of the data. Along these components the data reveal their maximum variation.
5 This is how data are entered in a spreadsheet.
12.3.1 The Algorithm
We first compute the average row \overline{\mathbf{x}}, i.e., the mean of all rows: \overline{x}_j=\frac{1}{n}\sum_{i=1}^m x_{ij}.
The mean matrix is \overline{\mathbf{X}}=\begin{bmatrix}1\\1\\\vdots\\1\end{bmatrix}_{m\times 1}\overline{\mathbf{x}}_{1\times n}= \begin{bmatrix} \overline{x}_1 & \overline{x}_2 & \ldots & \overline{x}_n\\ \overline{x}_1 & \overline{x}_2 & \ldots & \overline{x}_n\\ \vdots & \vdots & \ddots & \vdots\\ \overline{x}_1 & \overline{x}_2 & \ldots & \overline{x}_n\\ \end{bmatrix}_{m\times n}. We then subtract the mean matrix \overline{\mathbf{X}} from the data matrix \mathbf{X} to get the mean-subtracted data \mathbf{B}=\overline{\mathbf{X}}-\overline{\mathbf{X}}.
The matrix \mathbf{B} is then factorized using SVD to get: \mathbf{B}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T. In applications, the matrix \mathbf{B} is skinny-tall, i.e., m\geq n. If the singular values are \sigma_1\geq\sigma_2\geq\ldots\geq\sigma_n, then the principal values are: \frac{\sigma_k^2}{m-1}, and the corresponding columns of \mathbf{V} are called the principal components. The principal components provide an orthogonal coordinate system for the data, and the principle components are the variances of the data in these coordinates.