Published on
28/08/2019

Linear Algebra Concepts for Data Science and Machine Learning

Linear Algebra is a broad topic that one blog cannot cover in its entirety. We will start with review of Matrix operations, including multiplication, vector spaces, and other related concepts. Then, explore the more interesting aspects of linear algebra for machine learning. The point of this blog is not to present formal mathematics, there are plenty and plenty of books and resources out there that will do that for you if that is your interest.

Vector and Matrix Multiplication:

Some of the key properties of vectors and matrices worth remembering before even looking at the vector/matrix multiplication operations are; Non-commutative ie ABBA\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}; Associative ie matrix multiplication is associative hence (AB)C=A(BC)(\mathbf{A}\mathbf{B})\mathbf{C} = \mathbf{A}(\mathbf{B}\mathbf{C}); Distributive ie matrix multiplication is distributive over matrix addition A(B+C)=AB+AC\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}.

  1. Vector-Vector Multiplication:

Given u,vRn\mathbf{u}, \mathbf{v} \in \mathbb{R}^n with the same dimension, uv=vu\mathbf{u}^\top \mathbf{v} = \mathbf{v}^\top \mathbf{u} and both are guaranteed to always result in same scalar value.

However, when with vectors are of different dimensions (uRm\mathbf{u} \in \mathbb{R}^m and vRn\mathbf{v} \in \mathbb{R}^n), their product results into matrix uv\mathbf{u}\mathbf{v}^\top that is refered as an outer product of the vectors.

uvRm×n=[u1v1u1v2u1vnu2v1u2v2u2vnumv1umv2umvn]\mathbf{u}\mathbf{v}^\top \in \mathbb{R}^{m \times n} = \begin{bmatrix} u_1 v_1 & u_1 v_2 & \cdots & u_1 v_n \\ u_2 v_1 & u_2 v_2 & \cdots & u_2 v_n \\ \vdots & \vdots & \ddots & \vdots \\ u_m v_1 & u_m v_2 & \cdots & u_m v_n \end{bmatrix}

These two operations and elementwise product are summarised in the table below:

OperationExpressionResultShapeInner ProductuvScalar1×1Outer ProductuvMatrixm×nElementwise ProductuvVectorn×1\begin{array}{cccc} \text{Operation} & \text{Expression} & \text{Result} & \text{Shape} \\ \hline \text{Inner Product} & \mathbf{u}^\top \mathbf{v} & \text{Scalar} & 1 \times 1 \\ \text{Outer Product} & \mathbf{u}\mathbf{v}^\top & \text{Matrix} & m \times n \\ \text{Elementwise Product} & \mathbf{u} \odot \mathbf{v} & \text{Vector} & n \times 1 \end{array}
  1. Matrix- Vector Multiplication:

Given a matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} and a vector xRn\mathbf{x} \in \mathbb{R}^n, their product Ax\mathbf{A}\mathbf{x} results in a new vector bRm\mathbf{b} \in \mathbb{R}^m. Each element bib_i is computed as the dot product of the ithi^{th} row of A\mathbf{A} (denoted as ai:\mathbf{a}_{i:}) with the vector x\mathbf{x}. This operation is the fundamental for a linear transformation:

bi=ai:x=j=1nAijxj(1)b_i = \mathbf{a}_{i:}^\top \mathbf{x} = \sum_{j = 1}^{n} A_{ij} x_j \tag{1}

Which we can further visualize as full transformation from Rn\mathbb{R}^n to Rm\mathbb{R}^m:

[A11A1nAm1Amn][x1xn]=[b1bm](2)\begin{array}{ccc} \begin{bmatrix} A_{11} & \cdots & A_{1n} \\ \vdots & \ddots & \vdots \\ A_{m1} & \cdots & A_{mn} \end{bmatrix} & \begin{bmatrix} x_1 \\ \vdots \\ x_n \end{bmatrix} & = & \begin{bmatrix} b_1 \\ \vdots \\ b_m \end{bmatrix} \end{array} \tag{2}
  1. Matrix-Matrix Multiplication:

Given two matrices ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} and BRn×p\mathbf{B} \in \mathbb{R}^{n \times p}, their product C=AB\mathbf{C} = \mathbf{A}\mathbf{B} results in a new matrix CRm×p\mathbf{C} \in \mathbb{R}^{m \times p}. Each element CijC_{ij} is computed as the dot product of the ithi^{th} row of A\mathbf{A} and the jthj^{th} column of B\mathbf{B}:

Cij=ai:b:j=k=1nAikBkj(3)C_{ij} = \mathbf{a}_{i:}^\top \mathbf{b}_{:j} = \sum_{k = 1}^{n} A_{ik} B_{kj} \tag{3}
  1. Kronecker Product:

Given matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} and BRp×q\mathbf{B} \in \mathbb{R}^{p \times q}, their Kronecker product is an ABRmp×nq\mathbf{A} \otimes \mathbf{B} \in \mathbb{R}^{mp \times nq} matrix. For example:

[1324][5768]=[1[5768]3[5768]2[5768]4[5768]]\begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \otimes \begin{bmatrix} 5 & 7 \\ 6 & 8 \end{bmatrix} = \begin{bmatrix} 1 \begin{bmatrix} 5 & 7 \\ 6 & 8 \end{bmatrix} & 3 \begin{bmatrix} 5 & 7 \\ 6 & 8 \end{bmatrix} \\ 2 \begin{bmatrix} 5 & 7 \\ 6 & 8 \end{bmatrix} & 4 \begin{bmatrix} 5 & 7 \\ 6 & 8 \end{bmatrix} \end{bmatrix} =[5715216818241014202812162432]R4×4= \begin{bmatrix} 5 & 7 & 15 & 21 \\ 6 & 8 & 18 & 24 \\ 10 & 14 & 20 & 28 \\ 12 & 16 & 24 & 32 \end{bmatrix} \in \mathbb{R}^{4 \times 4}

This is usually formally generalized as:

AB=[a11Ba1nBam1BamnB]Rmp×nq(4)\mathbf{A} \otimes \mathbf{B} = \begin{bmatrix} a_{11}\mathbf{B} & \dots & a_{1n}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{m1}\mathbf{B} & \dots & a_{mn}\mathbf{B} \end{bmatrix} \in \mathbb{R}^{mp \times nq} \tag{4}

A closely related concept is Vectorization, For a matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}, the vectorization of A\mathbf{A} is the mn×1mn \times 1 column vector which is obtained by stacking A\mathbf{A}'s columns on top of each other:

vec(A)=(a11am1a1namn)(5)\text{vec}(\mathbf{A}) = (a_{11} \dots a_{m1} \dots a_{1n} \dots a_{mn})^\top \tag{5}

Alternatively, the row vectorization of an [n,m][n, m] matrix X\mathbf{X} is the nmnm-vector that simply flattens X\mathbf{X} by concatenating its rows into a single vector:

rvec(X)=[x11,x12,,x1m,x21,,x22m,,xnm]T(6)\text{rvec}(X) = [x_{11}, x_{12}, \dots, x_{1m}, x_{21}, \dots, x_{22m}, \dots, x_{nm}]^T \tag{6}

Which is similar as numpy's np.reshape(X,[1])\text{np.reshape}(\mathbf{X}, [-1]) and this relationship can be equivalently expressed in terms of a Kronecker product with basis vectors eiRn×1\mathbf{e}_i \in \mathbb{R}^{n \times 1}:

rvec(X)=i=1neiXTei(7)\text{rvec}(X) = \sum_{i=1}^n \mathbf{e}_i \otimes X^T \mathbf{e}_i \tag{7}

Some of the common identies involving Kronecker products are summarized below:

Property NameMathematical ExpressionTranspose(AB)=(AB)Inverse(AB)1=A1B1Factorizing(AB)(CD)=(ACBD)Distributive (Left)(AB)+(AC)=A(B+C)Distributive (Right)(AB)+(CB)=(A+C)BAssociative(AB)C=A(BC)\begin{array}{lc} \text{Property Name} & \text{Mathematical Expression} \\ \hline \text{Transpose} & (\mathbf{A} \otimes \mathbf{B})^\top = (\mathbf{A}^\top \otimes \mathbf{B}^\top) \\ \text{Inverse} & (\mathbf{A} \otimes \mathbf{B})^{-1} = \mathbf{A}^{-1} \otimes \mathbf{B}^{-1} \\ \text{Factorizing} & (\mathbf{A} \otimes \mathbf{B})(\mathbf{C} \otimes \mathbf{D}) = (\mathbf{AC} \otimes \mathbf{BD}) \\ \text{Distributive (Left)} & (\mathbf{A} \otimes \mathbf{B}) + (\mathbf{A} \otimes \mathbf{C}) = \mathbf{A} \otimes (\mathbf{B} + \mathbf{C}) \\ \text{Distributive (Right)} & (\mathbf{A} \otimes \mathbf{B}) + (\mathbf{C} \otimes \mathbf{B}) = (\mathbf{A} + \mathbf{C}) \otimes \mathbf{B} \\ \text{Associative} & (\mathbf{A} \otimes \mathbf{B}) \otimes \mathbf{C} = \mathbf{A} \otimes (\mathbf{B} \otimes \mathbf{C}) \end{array}
  1. Einstein Summation:

Assuming we had a sequence of numbers indexed as 1,2,,n1, 2, \dots, n, we could represent their sum as k=1nXk\sum_{k=1}^{n} X_k, meaning we keep summing using our index from 11 to nn. Also, if there was a second sequence YY, the dot product of XX and YY would be:

X,Y=k=1nxkyk(8)\langle X, Y \rangle = \sum_{k=1}^{n} x_k y_k \tag{8}

Instead of writing the equation with the \sum sign, the Einstein notation is a simplification that implicitly removes the summation sign. That is, given an expression, we perform summations over all the values of the repeated index which occurs in pairs (one on the upper index position and the other on the lower index position).

So, with this implicit removing of the summation sign, we use xix^i which is also called contravariant vectors if the components transform in the 'opposite' way to the basis vectors. The xix_i notation also called covariant vector since omponents transform in the "same" way as the basis vectors. We us lowering (xi=gijxjx_i = g_{ij} x^j) and raising (xi=gijxjx^i = g^{ij} x_j) to convert a column vector into a row vector and vice versa.

Vector TypeNotationContravariant Vectorxi (Upper Index)Covariant Vectorxi (Lower Index)TensorAji (Upper & Lower)\begin{array}{lcc} \text{Vector Type} & \text{Notation} \\ \hline \text{Contravariant Vector} & x^i \text{ (Upper Index)} \\ \text{Covariant Vector} & x_i \text{ (Lower Index)} \\ \text{Tensor} & A^i_j \text{ (Upper \& Lower)} \end{array}

Some of the common einsum operations for machine learning are summarized below:

OperationSummation NotationEinstein NotationInner Productx,y=i=1nxiyixiyiBilinear FormA(x,y)=i=1nj=1nAijxiyjAijxiyjLinear Transformation(Ax)i=j=1nAijxj(Ax)i=AjixjMatrix Multiplication(AB)ij=k=1nAikBkj(AB)ji=AkiBjkTrace of MatrixtrA=i=1nAiiAii\begin{array}{lcc} \text{Operation} & \text{Summation Notation} & \text{Einstein Notation} \\ \hline \text{Inner Product} & \langle x, y \rangle = \sum_{i=1}^n x_i y_i & x^i y_i \\ \text{Bilinear Form} & A(x, y) = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i y_j & A_{ij} x^i y^j \\ \text{Linear Transformation} & (Ax)_i = \sum_{j=1}^n A_{ij} x_j & (Ax)^i = A^i_j x^j \\ \text{Matrix Multiplication} & (AB)_{ij} = \sum_{k=1}^n A_{ik} B_{kj} & (AB)^i_j = A^i_k B^k_j \\ \text{Trace of Matrix} & \text{tr} A = \sum_{i=1}^n A_{ii} & A^i_i \end{array}

Vector Spaces

To understand the fundamentals Vector Spaces and projection matrices, let's first review what linear Independence, span, basis, dimensions and linear Maps are.

Linear Independence happens when c1v1+c2v2++cnvn0; except when ci=0c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \dots + c_n \mathbf{v}_n \neq 0;\ \text{except when}\ \forall c_i = 0. Meaning, if there are other values of cic_i that result in the zero vector, then the vectors are said to be linearly dependent. In general, given the vectors v1,v2,,vn\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_n, we can check whether they are linearly independent by merging the vectors as column matrix A\mathbf{A}. For example:

v1=[123],v2=[456],v3=[789]\mathbf{v}_1 = \begin{bmatrix}1 \\ 2 \\ 3 \end{bmatrix}, \quad \mathbf{v}_2 = \begin{bmatrix}4 \\ 5 \\ 6 \end{bmatrix}, \quad \mathbf{v}_3 = \begin{bmatrix}7 \\ 8 \\ 9 \end{bmatrix} A=[147258369]\mathbf{A} = \begin{bmatrix}1 & 4 & 7 \\ 2 & 5 & 8 \\ 3 & 6 & 9 \end{bmatrix}

We can check if the columns of A\mathbf{A} are linearly independent, by solving the equation Ac=0\mathbf{A}\mathbf{c} = \mathbf{0} with Reduced Row Echelon Form (RREF) resulting into:

RREF(A)=[101012000]\text{RREF}(\mathbf{A}) = \begin{bmatrix}1 & 0 & -1 \\ 0 & 1 & 2 \\ 0 & 0 & 0 \end{bmatrix}

The pivot positions (the leading 1s in each row) appear in columns 1 and 2, so those are the pivot columns.The third column does not contain a pivot, so it is a free column, meaning v3\mathbf{v}_3 can be written as a linear combination of the pivot columns:

v3=1v1+2v2(9)\mathbf{v}_3 = -1\mathbf{v}_1 + 2\mathbf{v}_2 \tag{9}

Since there are only two pivot columns, the rank of A\mathbf{A} is 2, not 3. This means that rank(A)<n\text{rank}(\mathbf{A}) < n, confirming that the vectors are linearly dependent.

By contrast, for example, given the vectors:

v1=[100],v2=[010],v3=[001]\mathbf{v}_1 = \begin{bmatrix}1 \\ 0 \\ 0 \end{bmatrix}, \quad \mathbf{v}_2 = \begin{bmatrix}0 \\ 1 \\ 0 \end{bmatrix}, \quad \mathbf{v}_3 = \begin{bmatrix}0 \\ 0 \\ 1 \end{bmatrix} A=[100010001]\mathbf{A} = \begin{bmatrix}1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}

Here, each column contains a pivot, there are no free columns, and rank(A)=3\text{rank}(\mathbf{A}) = 3. The null space contains only the zero vector c=0\mathbf{c} = \mathbf{0}, meaning the columns of A\mathbf{A} are linearly independent.

Span, Basis, and Dimension

I will use one of the examples of the book to explain what a span is:

[311] in Span{[100],[111]}\begin{bmatrix} 3 \\ 1 \\ 1 \end{bmatrix} \text{ in } \text{Span}\left\{\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}\right\}

Consider the above, can we find scalars x1x_1 and x2x_2 that make the equation consistent?

[311]=x1[100]+x2[111](10)\begin{bmatrix} 3 \\ 1 \\ 1 \end{bmatrix} = x_1 \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} + x_2 \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \tag{10}

So with x1=2x_1 = 2 and x2=1x_2 = 1, being a consistent solution, therefore:

[311] is in Span{[100],[111]}\begin{bmatrix} 3 \\ 1 \\ 1 \end{bmatrix} \text{ is in } \text{Span}\left\{ \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \right\}

The basis is a set of vectors v1v_1, v2v_2, ...vnv_n that satisfy linearly independent and span that space. For each basis vector, its space contains the same number of vectors called the dimension of the space.

Fundamental subspaces

  1. Null Space

Given matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} and Av=0\mathbf{Av} = \mathbf{0}, the null space is a set of all vectors vRn\mathbf{v} \in \mathbb{R}^n that A\mathbf{A} maps to a zero vector.

  1. Column Space

Given matrix A\mathbf{A}, the column space of A\mathbf{A}, is the vector space that spans the column-vectors of A\mathbf{A} -- a set of all possible linear combinations of the column vectors. That is, given a matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}, we it as a function that maps vectors from Rn\mathbb{R}^n to vectors in Rm\mathbb{R}^m with Ax=b\mathbf{Ax} = \mathbf{b}. The interesting fact is that, th column space is simply the range mapping Ax\mathbf{Ax} meaning it is the set of all possible output vectors b\mathbf{b}.

  1. Row Space

Given a matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} with mm rows and nn columns, its row space is the vector space spanned by the row vectors of A\mathbf{A}. If we denote the rows of A\mathbf{A} as r1,r2,,rmRn\mathbf{r}_1, \mathbf{r}_2, \ldots, \mathbf{r}_m \in \mathbb{R}^n, therefore Row(A)=span{r1,r2,,rm}\text{Row}(\mathbf{A}) = \text{span}\{\mathbf{r}_1, \mathbf{r}_2, \ldots, \mathbf{r}_m\}. Meaning the row space consists of all possible linear combinations of the row vectors:

Row(A)={c1r1+c2r2++cmrmc1,c2,,cmR}(11)\text{Row}(\mathbf{A}) = \left\{ c_1\mathbf{r}_1 + c_2\mathbf{r}_2 + \cdots + c_m\mathbf{r}_m \mid c_1, c_2, \ldots, c_m \in \mathbb{R} \right\} \tag{11}

The dimension of the row space is the rank of the matrix, rank(A)\text{rank}(\mathbf{A}), which is the maximum number of linearly independent rows in A\mathbf{A} and also Row(A)=Col(A)\text{Row}(\mathbf{A}) = \text{Col}(\mathbf{A}^\top) . Also consider A=[123456]R2×3\mathbf{A} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \in \mathbb{R}^{2 \times 3}, The row vectors are r1=[1,2,3]\mathbf{r}_1 = [1, 2, 3] and r2=[4,5,6]\mathbf{r}_2 = [4, 5, 6]. The row space is span{[1,2,3],[4,5,6]}R3\text{span}\{[1, 2, 3], [4, 5, 6]\} \subseteq \mathbb{R}^3.

Matrix Inverses

Generally, given two numbers x=5x=5 and y=1/5y=1/5, multiplying them results in xy=1xy = 1. In this case, we call yy the multiplicative inverse of xx. While division plays an important role in scalar arithematics, there is no such operation as matrix division. Instead, the matrix inverse serves as its counterpart. For a square matrix, such as a 2×22 \times 2 matrix A\mathbf{A}, finding the inverse is straightforward:

A=[acbd](12)\mathbf{A} = \begin{bmatrix} a & c \\ b & d \end{bmatrix} \tag{12}

It's inverse, denoted as A1\mathbf{A}^{-1}, is calculated as:

A1=1det(A)[dcba]or1A[dcba](13)\mathbf{A}^{-1} = \frac{1}{\text{det}(\mathbf{A})} \begin{bmatrix} d & -c \\ -b & a \end{bmatrix} \quad \text{or} \quad \frac{1}{|\mathbf{A}|} \begin{bmatrix} d & -c \\ -b & a \end{bmatrix} \tag{13}

There are some relationships that exist between a matrix A\mathbf{A} and its inverse A1\mathbf{A}^{-1}. The identity property states that, the product of a square matrix and its inverse always results in the identity matrix I\mathbf{I}, such that A1A=I\mathbf{A}^{-1}\mathbf{A} = \mathbf{I} and AA1=I\mathbf{AA}^{-1} = \mathbf{I}. other relationship are summarized below:

ConditionDeterminantNameA0Non-zeroInvertible (Non-singular)A=0ZeroSingular (Non-invertible)\begin{array}{ccc} \text{Condition} & \text{Determinant} & \text{Name} \\ \hline |\mathbf{A}| \neq 0 & \text{Non-zero} & \text{Invertible (Non-singular)} \\ |\mathbf{A}| = 0 & \text{Zero} & \text{Singular (Non-invertible)} \end{array}

Deciding when to use matrix inversion is a tricky question generally, however in machine learning because we have take the derivatives of inverses and determinants in optimization its common to use inversion. While high-speed inversion algorithms exist, the general guideline is to avoid direct inversion whenever possible. To solve a linear system of the form Ax=b\mathbf{A}\mathbf{x} = \mathbf{b} for a square matrix A\mathbf{A}, utilizing matrix factorization methods is preferred because they are numerically stable and accurate than computing the inverse. However, when one has to solve for x=A1bx = \mathbf{A}^{-1}\mathbf{b} across many different vectors of b\mathbf{b}, precomputing and storing A1\mathbf{A}^{-1} becomes computationally efficient.

Some Special Matrices

  1. Diagonal Matrix

This is a matrix for which all the non-diagonal elements are 0. It is usually denoted D=diag(d1,d2,,dn)\mathbf{D} = \text{diag}(d_1, d_2, \dots, d_n) or in matrix notation as:

D=(d1d2dn)(14)\mathbf{D} = \begin{pmatrix} d_1 & & & \\ & d_2 & & \\ & & \ddots & \\ & & & d_n \end{pmatrix} \tag{14}

We could also rewrite the above as d=diag(D)\mathbf{d} = \text{diag}(\mathbf{D}) by just extracting the diagonal vector. One special example of a diagonal matrix is the identity matrix that is usually denoted IRn×n\mathbf{I} \in \mathbb{R}^{n \times n}.

  1. Symmetric and Skew-Symmetric Matrices

First, the concept of symmetry only applies to square matrices. Matrix A\mathbf{A} is said to be symmetric if and only if its transpose equals the original matrix A\mathbf{A}, that is:

A=AT(15)\mathbf{A} = \mathbf{A}^T \tag{15}

Meaning given A\mathbf{A} symmetric, the element at row ii, column jj must equal the element at row jj, column ii (aij=ajia_{ij} = a_{ji}).

A=[5213204714953756],the transpose (AT)=[5213204714953756]\mathbf{A} = \begin{bmatrix} 5 & -2 & 1 & 3 \\ -2 & 0 & 4 & 7 \\ 1 & 4 & 9 & -5 \\ 3 & 7 & -5 & 6 \end{bmatrix}, \quad \text{the transpose } (\mathbf{A}^T) = \begin{bmatrix} 5 & -2 & 1 & 3 \\ -2 & 0 & 4 & 7 \\ 1 & 4 & 9 & -5 \\ 3 & 7 & -5 & 6 \end{bmatrix}

Since A=AT\mathbf{A} = \mathbf{A}^T, we can conclude that the square matrix A\mathbf{A} is symmetric.

A simple tweak is when AT=A\mathbf{A}^T = -\mathbf{A}, we say that matrix A\mathbf{A} is skew-symmetric. In a skew-symmetric matrix, the main diagonal elements must be 00, and aij=ajia_{ij} = -a_{ji} as shown below:

A=[0213204714053750],the transpose (AT)=[0213204714053750]\mathbf{A} = \begin{bmatrix} 0 & -2 & 1 & 3 \\ 2 & 0 & -4 & -7 \\ -1 & 4 & 0 & 5 \\ -3 & 7 & -5 & 0 \end{bmatrix}, \quad \text{the transpose } (\mathbf{A}^T) = \begin{bmatrix} 0 & 2 & -1 & -3 \\ -2 & 0 & 4 & 7 \\ 1 & -4 & 0 & -5 \\ 3 & -7 & 5 & 0 \end{bmatrix}
  1. Orthogonal Matrix

This is a square matrix whose columns and also the rows are orthonormal vectors. We say a vector e.g q1,,qnq_1, \dots, q_n each derived from matrix Q=[q1q2qn]Q = \begin{bmatrix} | & | & & | \\ q_1 & q_2 & \dots & q_n \\ | & | & & | \end{bmatrix} to be orthonormal basis if and only if each vector is a unit vector (length = 1) and a set of orthonormal vectors q1,q2,,qnq_1, q_2, \dots, q_n form independent vectors.Therefore, given a square matrix QRn×nQ \in \mathbb{R}^{n \times n}, this matrix is orthogonal if its transpose is equal to its inverse:

QT=Q1(16)Q^T = Q^{-1} \tag{16}

Additionally, orthogonal matrices have the following interesting properties QTQ=QQT=IQ^T Q = Q Q^T = I. Where QTQ=IQ^T Q = I implies that the columns of QQ are orthonormal and QQT=IQ Q^T = I implies that the rows of QQ are also orthonormal.

4. Positive Definite Matrices

4.1 quadratic forms and eigenvalues

Given a quadratic equation like ax12+bx1x2+cx22ax_1^2 + bx_1x_2 + cx_2^2, we can rewrite its associated symmetric matrix where the square terms form the diagonal and the cross-term are divided by 2 as:

[ab/2b/2c](17)\begin{bmatrix} a & b/2 \\ b/2 & c \end{bmatrix} \tag{17}

The same rule also apply for the more complex quadratic form e.g for ax12+bx22cx32+dx1x2fx1x3+gx2x3ax_1^2 + bx_2^2 - cx_3^2 + dx_1x_2 - fx_1x_3 + gx_2x_3, the associated symmetric matrix is constructed by placing the coefficients of the squared terms (x12,x22,x32x_1^2, x_2^2, x_3^2) on the main diagonal and splitting the cross-term coefficients across the corresponding symmetric positions:

[ad/2f/2d/2bg/2f/2g/2c](18)\begin{bmatrix} a & d/2 & -f/2 \\ d/2 & b & g/2 \\ -f/2 & g/2 & -c \end{bmatrix} \tag{18}

So, generally the rules for converting any quadratic form to a symmetric matrix A\mathbf{A} are; the diagonal elements (aiia_{ii}) are the coefficients of the squared terms xi2x_i^2; the off-diagonal elements (aija_{ij}) are half of the coefficient of the xixjx_ix_j term to ensure the matrix remains symmetric (aij=ajia_{ij} = a_{ji}).

Therefore, we can formally define the relationship between the quadratic form Q(x)Q(x) and symmetric matrix A\mathbf{A} as:

Q(x)=xTAx(19)Q(x) = \mathbf{x}^T \mathbf{A} \mathbf{x} \tag{19}

We can prove this is correct by rewriting xTAx\mathbf{x}^T \mathbf{A} \mathbf{x} for our first example of the quadratic form ax12+bx1x2+cx22ax_1^2 + bx_1x_2 + cx_2^2.

Given vector x=[x1x2]\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} and symmetric matrix A=[ab/2b/2c]\mathbf{A} = \begin{bmatrix} a & b/2 \\ b/2 & c \end{bmatrix}:

Ax=[ab/2b/2c][x1x2]=[ax1+b2x2b2x1+cx2]\mathbf{Ax} = \begin{bmatrix} a & b/2 \\ b/2 & c \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} ax_1 + \frac{b}{2}x_2 \\ \frac{b}{2}x_1 + cx_2 \end{bmatrix} xT(Ax)=[x1x2][ax1+b2x2b2x1+cx2]\mathbf{x}^T(\mathbf{Ax}) = \begin{bmatrix} x_1 & x_2 \end{bmatrix} \begin{bmatrix} ax_1 + \frac{b}{2}x_2 \\ \frac{b}{2}x_1 + cx_2 \end{bmatrix} =x1(ax1+b2x2)+x2(b2x1+cx2)(20)= x_1\left(ax_1 + \frac{b}{2}x_2\right) + x_2\left(\frac{b}{2}x_1 + cx_2\right) \tag{20} =ax12+b2x1x2+b2x2x1+cx22= ax_1^2 + \frac{b}{2}x_1x_2 + \frac{b}{2}x_2x_1 + cx_2^2 =ax12+bx1x2+cx22= ax_1^2 + bx_1x_2 + cx_2^2

This has an interesting connection with eigenvalues in that we can first calculate the eigenvalues e.g. given:

A=[3227]\mathbf{A} = \begin{bmatrix} 3 & -2 \\ -2 & 7 \end{bmatrix}

To find (λ\lambda), we solve the characteristic equation det(AλI)=0\det(\mathbf{A} - \lambda\mathbf{I}) = 0:

det[3λ227λ]=0\det \begin{bmatrix} 3 - \lambda & -2 \\ -2 & 7 - \lambda \end{bmatrix} = 0 (3λ)(7λ)(2)(2)=0(3 - \lambda)(7 - \lambda) - (-2)(-2) = 0 213λ7λ+λ24=021 - 3\lambda - 7\lambda + \lambda^2 - 4 = 0 λ210λ+17=0\lambda^2 - 10\lambda + 17 = 0 λ=10±(10)24(1)(17)2(1)\lambda = \frac{10 \pm \sqrt{(-10)^2 - 4(1)(17)}}{2(1)} λ=10±100682=10±322\lambda = \frac{10 \pm \sqrt{100 - 68}}{2} = \frac{10 \pm \sqrt{32}}{2} λ=5±22\lambda = 5 \pm 2\sqrt{2} λ1=5+227.828,λ2=5222.172\lambda_1 = 5 + 2\sqrt{2} \approx 7.828, \lambda_2 = 5 - 2\sqrt{2} \approx 2.172

Also, another interesting property with the calculated eigenvalues, which is useful in machine learning, is that we can find the basis for the eigenspace associated with each eigenvector.Given our previously calculated eigenvalues λ1=5+22\lambda_1 = 5 + 2\sqrt{2} and λ2=522\lambda_2 = 5 - 2\sqrt{2}, we find the basis vectors by solving the homogeneous system (AλI)v=0(A - \lambda I)v = 0.

[3λ12027λ10]\left[ \begin{array}{cc|c} 3 - \lambda_1 & -2 & 0 \\ -2 & 7 - \lambda_1 & 0 \end{array} \right] (3(5+22))x12x2=0(3 - (5 + 2\sqrt{2}))x_1 - 2x_2 = 0 (222)x12x2=0(-2 - 2\sqrt{2})x_1 - 2x_2 = 0

If we let x1=1x_1 = 1, then:

v1=[112]v_1 = \begin{bmatrix} 1 \\ -1 - \sqrt{2} \end{bmatrix}

With the same approach, λ2=522\lambda_2 = 5 - 2\sqrt{2} would result into:

v2=[121]v_2 = \begin{bmatrix} 1 \\ \sqrt{2} - 1 \end{bmatrix}

Given the fact that matrix A\mathbf{A} is symmetric, these basis vectors are guaranteed to be orthogonal to each other.

4.2 Positive Definite

A symmetric matrix ASn\mathbf{A} \in \mathbb{S}^n is positive definite iff for all non-zero vectors xRn\mathbf{x} \in \mathbb{R}^n, xTAx>0\mathbf{x}^T \mathbf{A} \mathbf{x} > 0, It is positive semidefinite or psd if xTAx0\mathbf{x}^T \mathbf{A} \mathbf{x} \geq 0 is possible. A symmetric matrix ASn\mathbf{A} \in \mathbb{S}^n is negative definite iff for all non-zero xRn\mathbf{x} \in \mathbb{R}^n, xTAx<0\mathbf{x}^T \mathbf{A} \mathbf{x} < 0, it is negative semidefinite if xTAx0\mathbf{x}^T \mathbf{A} \mathbf{x} \leq 0 is possible. Finally, a symmetric matrix ASn\mathbf{A} \in \mathbb{S}^n is indefinite if there exist x1,x2Rn\mathbf{x}_1, \mathbf{x}_2 \in \mathbb{R}^n such that x1TAx1>0\mathbf{x}_1^T \mathbf{A} \mathbf{x}_1 > 0 and x2TAx2<0\mathbf{x}_2^T \mathbf{A} \mathbf{x}_2 < 0 as summarized below:

ClassificationNotationCondition (for non-zero x)Positive DefiniteA0xTAx>0Positive SemidefinitepsdxTAx0Negative DefiniteA0xTAx<0Negative SemidefinitexTAx0Indefinitex1TAx1>0,x2TAx2<0\begin{array}{lcl} \text{Classification} & \text{Notation} & \text{Condition (for non-zero } \mathbf{x}) \\ \hline \text{Positive Definite} & \mathbf{A} \succ 0 & \mathbf{x}^T \mathbf{A} \mathbf{x} > 0 \\ \text{Positive Semidefinite} & \text{psd} & \mathbf{x}^T \mathbf{A} \mathbf{x} \geq 0 \\ \text{Negative Definite} & \mathbf{A} \prec 0 & \mathbf{x}^T \mathbf{A} \mathbf{x} < 0 \\ \text{Negative Semidefinite} & & \mathbf{x}^T \mathbf{A} \mathbf{x} \leq 0 \\ \text{Indefinite} & & \exists \mathbf{x}_1^T \mathbf{A} \mathbf{x}_1 > 0, \mathbf{x}_2^T \mathbf{A} \mathbf{x}_2 < 0 \end{array}

Matrix Factorizations

Selecting a matrix factorization method to use for a given problem is mainly dictated by the dimensions and properties of the matrix. Matrices are usually categorized into square and rectangular matrices. For square matrices that are invertible and well-conditioned, methods such as Gauss-Jordan elimination or LU decomposition are preferred.However, if a square invertible matrix is ill-conditioned - meaning that small change in the input data result in disproportionately large changes in the output. In these instances, Singular Value Decomposition (SVD) or regularization techniques, such as Tikhonov regularization are used to stabilize the solution. For square singular matrices that are non-invertible, as well as rectangular matrices, the standard approach is the Moore-Penrose pseudoinverse (A+\mathbf{A}^+). This is particularly useful for overdetermined systems, where the matrix has more rows than columns (m>nm > n), and underdetermined systems, where the matrix has more columns than rows (n>mn > m). In these cases, the pseudoinverse provides the best-fit solution in a least-squares sense, calculated as x=A+b\mathbf{x} = \mathbf{A}^+ \mathbf{b}. For an overdetermined system, this often involves the normal equations where x=(AA)1Ab\mathbf{x} = (\mathbf{A}^\top \mathbf{A})^{-1} \mathbf{A}^\top \mathbf{b}.

Matrix TypePropertyRecommended MethodSquareInvertible (Well-conditioned)LU Decomposition / Gauss-JordanSquareInvertible (Ill-conditioned)SVD / Tikhonov RegularizationSquareSingular (Non-invertible)Pseudoinverse (A+)RectangularOver / UnderdeterminedPseudoinverse (A+)\begin{array}{ccc} \text{Matrix Type} & \text{Property} & \text{Recommended Method} \\ \hline \text{Square} & \text{Invertible (Well-conditioned)} & \text{LU Decomposition / Gauss-Jordan} \\ \text{Square} & \text{Invertible (Ill-conditioned)} & \text{SVD / Tikhonov Regularization} \\ \text{Square} & \text{Singular (Non-invertible)} & \text{Pseudoinverse } (\mathbf{A}^+) \\ \text{Rectangular} & \text{Over / Underdetermined} & \text{Pseudoinverse } (\mathbf{A}^+) \end{array}

With that in mind, we shall now cover the key matrice factorization methods:

  1. LU Decomposition

Given a square matrix A\mathbf{A} we want to rewrite it as A=LU\mathbf{A} = \mathbf{L} \mathbf{U} where L\mathbf{L} is lower triangular matrix LRn×n\mathbf{L}\in\mathbb{R}^{n\times n} and U\mathbf{U} upper triangular matrix URn×n\mathbf{U}\in\mathbb{R}^{n\times n}.

We follow an elimination process that transforms the original matrix into an upper triangular form while recording the operational history in a lower triangular matrix. For example given:

A=[211433879]\mathbf{A} = \begin{bmatrix} 2 & 1 & 1 \\ 4 & 3 & 3 \\ 8 & 7 & 9 \end{bmatrix}

First, we focus on the first column and use the pivot element A11=2A_{11} = 2 to eliminate the entries in the rows below it. We first calculate the multipliers by dividing the leading element of each target row by the pivot, resulting in l21=4/2=2l_{21} = 4 / 2 = 2 for the second row and l31=8/2=4l_{31} = 8 / 2 = 4 for the third row. We use multipliers to construct the L\mathbf{L} matrix later. We then execute the row operations by subtracting two times the first row from the second row (R2R22R1R_2 \leftarrow R_2 - 2R_1) and four times the first row from the third row (R3R34R1R_3 \leftarrow R_3 - 4R_1). This results in the first intermediate matrix, where the first column below the diagonal consists of zeros:

A(1)=[211011035]\mathbf{A}^{(1)} = \begin{bmatrix} 2 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 3 & 5 \end{bmatrix}

We move to the second column use the new pivot at A22=1A_{22} = 1. To eliminate the value below this pivot in the third row, we compute the multiplier l32=3/1=3l_{32} = 3 / 1 = 3. We then perform the row operation R3R33R2R_3 \leftarrow R_3 - 3R_2, which involves subtracting three times the current second row from the third row. This calculation results in 5(3×1)=25 - (3 \times 1) = 2 for the final element, completing the transformation into the upper triangular matrix U\mathbf{U}:

U=[211011002]\mathbf{U} = \begin{bmatrix} 2 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 2 \end{bmatrix}

We then, construct the lower triangular matrix L\mathbf{L}, that stored the records of the elimination steps. We begin with an identity matrix and then populate the empty positions below the main diagonal using the multipliers from the previous steps. That is, we place l21=2l_{21}=2 and l31=4l_{31}=4 in the first column and l32=3l_{32}=3 in the second column resulting in lower triangular matrix:

L=[100210431]\mathbf{L} = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 4 & 3 & 1 \end{bmatrix}
  1. QR Decomposition

For QR decomposition, our objective is to factorize a rectangular matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} or a square matrixARn×n\mathbf{A} \in \mathbb{R}^{n \times n} into the product of an orthogonal matrix Q\mathbf{Q} and an upper triangular matrix R\mathbf{R}. As a reminder, a square matrix BB is orthogonal if BTB=I\mathbf{B}^{T}\mathbf{B} = \mathbf{I} and BBT=I\mathbf{BB}^{T} = \mathbf{I}. And also, B1=BT\mathbf{B}^{-1} = \mathbf{B}^{T}. The columns and also the rows of an orthogonal matrix are orthonormal, meaning, their 2-norm (Euclidean length) is 1 as shown below:

[abba],a2+b2=1(21)\begin{bmatrix} a & b \\ -b & a \\ \end{bmatrix}, \quad a^2 + b^2 = 1 \tag{21}

While the Gram-Schmidt process is often used to in maths textbooks to introduce the concept of orthogonalization, it is unstable because it quickly degrades due to rounding errors. NumPy implement QR factorization using Householder reflections since each reflection preserves the Euclidean norm of the matrix columns through the "sign choice" that avoid numerical cancellation.

The general idea is to iteratively applies these reflections to smaller and smaller submatrices until only the upper triangular form remains. First, we isolate the first column of A\mathbf{A} denoted as x\mathbf{x} and our goal is to find a reflection that maps x\mathbf{x} to a multiple of the first unit vector e1\mathbf{e}_1, thus zeroing out all entries below the first row. We compute the Householder vector v1=x+sign(x1)x2e1\mathbf{v}_1 = \mathbf{x} + \text{sign}(x_1)\|\mathbf{x}\|_2 \mathbf{e}_1. Using this vector, we construct the first Householder matrix H1=I2v1v1v1v1\mathbf{H}_1 = \mathbf{I} - 2\frac{\mathbf{v}_1 \mathbf{v}_1^\top}{\mathbf{v}_1^\top \mathbf{v}_1}.

Multiplying the original matrix by this reflection, the first column of the resulting matrix H1A\mathbf{H}_1 \mathbf{A} becomes:

H1A=[sign(x1)x00](22)\mathbf{H}_1 \mathbf{A} = \begin{bmatrix} -\text{sign}(x_1)\|\mathbf{x}\| & \dots \\ 0 & \dots \\ 0 & \dots \end{bmatrix} \tag{22}

We repeat the process to the submatrix formed by ignoring the first row and first column of the transformed matrix resulting into (m1)×(n1)(m-1) \times (n-1) submatrix. We calculate a new Householder vector v2\mathbf{v}_2 and a corresponding reflection matrix H^2\hat{\mathbf{H}}_2. To apply this to the full matrix, we put it into a larger identity structure H2=diag(1,H^2)\mathbf{H}_2 = \text{diag}(1, \hat{\mathbf{H}}_2) which ensures that the second reflection zeros out the elements below the diagonal in the second column while leaving the zeros created in the first column.

This process continues for all columns until the matrix is fully reduced to an upper triangular form R\mathbf{R}. The final matrix R\mathbf{R} is the result of applying the sequence of all Householder reflections to the original matrix, such that R=HnH2H1A\mathbf{R} = \mathbf{H}_n \dots \mathbf{H}_2 \mathbf{H}_1 \mathbf{A}.

Because Householder matrices are orthogonal and symmetric (Hk=Hk=Hk1\mathbf{H}_k = \mathbf{H}_k^\top = \mathbf{H}_k^{-1}), we can solve for Q\mathbf{Q} by multiplying the reflections in reverse order. The orthogonal matrix Q\mathbf{Q} is thus defined as Q=H1H2Hn\mathbf{Q} = \mathbf{H}_1 \mathbf{H}_2 \dots \mathbf{H}_n. This guarantees that QQ=I\mathbf{Q}^\top \mathbf{Q} = \mathbf{I} and that the final product QR\mathbf{QR} reconstructs the original matrix A\mathbf{A}.

Eigenvalues and Eigenvectors

Given a square matrix A\mathbf{A}, for the eigens, our goal is to look for directions where the linear transformation just acts as simple scaling without change of direction. That is, Av=λv\mathbf{A}\mathbf{v} = \lambda\mathbf{v}. The v\mathbf{v} represents the eigenvector (also referred to as a characteristic or latent vector), which is the direction that remains invariant under the transformation. The scalar λ\lambda is the eigenvalue, which determines the factor by which that vector is scaled. To solve for these values, we rearrange the equation into a homogeneous linear system:

(AλI)v=0(22)(\mathbf{A} - \lambda\mathbf{I})\mathbf{v} = \mathbf{0} \tag{22}

For a non-trivial solution v0\mathbf{v} \neq \mathbf{0}, the matrix (AλI)(\mathbf{A} - \lambda\mathbf{I}) must be singular. That is the determinant of the matrix is 00, resulting into the characteristic equation:

det(AλI)=0(23)\det(\mathbf{A} - \lambda\mathbf{I}) = 0 \tag{23}

Solving this polynomial equation provides the eigenvalues. Once these are known, each corresponding eigenvector is got by solving the resulting linear system for v\mathbf{v}.

Consider matrix A\mathbf{A}:

A=[4123]\mathbf{A} = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}

First, we subtract λ\lambda from the main diagonal elements of A\mathbf{A}:

AλI=[4λ123λ]\mathbf{A} - \lambda\mathbf{I} = \begin{bmatrix} 4 - \lambda & 1 \\ 2 & 3 - \lambda \end{bmatrix}

We follow by calculating the determinant of this matrix:

det(AλI)=(4λ)(3λ)(1)(2)\det(\mathbf{A} - \lambda\mathbf{I}) = (4 - \lambda)(3 - \lambda) - (1)(2)

This results in the characteristic polynomial:

(124λ3λ+λ2)2=0(12 - 4\lambda - 3\lambda + \lambda^2) - 2 = 0 λ27λ+10=0\lambda^2 - 7\lambda + 10 = 0

We then factor by looking for two numbers that multiply to 1010 and add to 7-7:

(λ5)(λ2)=0(\lambda - 5)(\lambda - 2) = 0

Setting each factor to zero, we find the eigenvalues for the matrix A\mathbf{A}:

λ1=5,λ2=2\lambda_1 = 5, \quad \lambda_2 = 2

We can verify this with either the trace and determinant trick: (4+3=74 + 3 = 7 and λ1+λ2=5+2=7\lambda_1 + \lambda_2 = 5 + 2 = 7) or ((4×3)(1×2)=10(4 \times 3) - (1 \times 2) = 10 and λ1×λ2=5×2=10\lambda_1 \times \lambda_2 = 5 \times 2 = 10.)

  1. Eigen Decomposition

Eigendecomposition (also known as eigenvalue decomposition, spectral decomposition, or diagonalization) of a square matrix is a factorization of the form An=QΛnQ1\mathbf{A}^n = \mathbf{Q}\mathbf{\Lambda}^n\mathbf{Q}^{-1}. Q1\mathbf{Q}^{-1} is an invertible linear transformation, Λ\mathbf{\Lambda} is a scaling operation, and the inverse of the initial transformation Q\mathbf{Q}. The matrix Q\mathbf{Q} must be invertible to allow for the change of basis, and Λ\mathbf{\Lambda} is a diagonal matrix diag(λ)\text{diag}(\vec{\lambda}) where the diagonal entries are the eigenvalues. When the eigenvalues in Λ\mathbf{\Lambda} are negative, the eigenvectors in Q\mathbf{Q} are therefire not required to be orthogonal.

Remember the eigenvalues λ1=5\lambda_1 = 5 and λ2=2\lambda_2 = 2 and their corresponding eigenvectors to calculate A10\mathbf{A}^{10} for our previous example matrix A=[4123]\mathbf{A} = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}. We would first solve for the eigenvectors by substituting the eigenvalues back into (AλI)v=0(\mathbf{A} - \lambda\mathbf{I})\mathbf{v} = \mathbf{0}. For λ1=5\lambda_1 = 5, this would result in [1122]v1=0\begin{bmatrix} -1 & 1 \\ 2 & -2 \end{bmatrix} \mathbf{v}_1 = \mathbf{0}, which yields v1=[11]\mathbf{v}_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}. And for λ2=2\lambda_2 = 2, [2121]v2=0\begin{bmatrix} 2 & 1 \\ 2 & 1 \end{bmatrix} \mathbf{v}_2 = \mathbf{0}, which would be v2=[12]\mathbf{v}_2 = \begin{bmatrix} 1 \\ -2 \end{bmatrix}. Therefore the matrices Q\mathbf{Q} and Λ\mathbf{\Lambda} would be:

Q=[1112],Λ=[5002]\mathbf{Q} = \begin{bmatrix} 1 & 1 \\ 1 & -2 \end{bmatrix}, \quad \mathbf{\Lambda} = \begin{bmatrix} 5 & 0 \\ 0 & 2 \end{bmatrix}

Applying the power to the diagonal elements of Λ\mathbf{\Lambda} would result in:

A10=Q[51000210]Q1\mathbf{A}^{10} = \mathbf{Q} \begin{bmatrix} 5^{10} & 0 \\ 0 & 2^{10} \end{bmatrix} \mathbf{Q}^{-1}
  1. Singular Value Decomposition (SVD)

Let first start with stating the wikipedia equation of SVD which is a factorization of an m×nm \times n complex matrix M\mathbf{M} into the form M=UΣV\mathbf{M} = \mathbf{U\Sigma V^*} for complex matrices, or M=UΣV\mathbf{M} = \mathbf{U\Sigma V^\top} if the matrix consists of real numbers. This definition generalizes eigendecomposition to any matrix, regardless of its shape or properties. A matrix is considered unitary if its conjugate transpose is also its inverse, satisfying the property UU=I\mathbf{U}^*\mathbf{U} = \mathbf{I}. The matrix U\mathbf{U} is an m×mm \times m unitary matrix. It is also the left singular vectors that form an orthonormal basis of the output space. V\mathbf{V} is a n×nn \times n unitary matrix too, and its columns (equivalently, the rows of V\mathbf{V}^\top) are the right singular vectors forming an orthonormal basis of the input space. Σ\mathbf{\Sigma} is an m×nm \times n diagonal matrix. The matrix ΣRm×n\mathbf{\Sigma} \in \mathbb{R}^{m \times n} is diagonal and contains non-negative singular values, ordered as σ1σ20\sigma_1 \ge \sigma_2 \ge \cdots \ge 0.

ComponentDimensionPropertyRoleUm×mOrthogonalLeft-singular vectors (Output Basis)Σm×nDiagonalSingular values (Scaling factors)Vn×nOrthogonalRight-singular vectors (Input Basis)\begin{array}{cccc} \text{Component} & \text{Dimension} & \text{Property} & \text{Role} \\ \hline \mathbf{U} & m \times m & \text{Orthogonal} & \text{Left-singular vectors (Output Basis)} \\ \mathbf{\Sigma} & m \times n & \text{Diagonal} & \text{Singular values (Scaling factors)} \\ \mathbf{V}^\top & n \times n & \text{Orthogonal} & \text{Right-singular vectors (Input Basis)} \end{array}

The numpy decomposition is performed using LAPACK routine which I fully don't understand, if you are interested in doing it on paper, the power method much more simple and is the basis for most of the advanced SVD algorithnms.

  1. Cholesky Decomposition

Given a symmetric positive definite matrix A\mathbf{A}, Cholesky decomposition or "matrix square root" is a factorization A=RR\mathbf{A} = \mathbf{R}^\top\mathbf{R}, where R\mathbf{R} is an upper triangular matrix. This factorization can also be rewritten as A=LL\mathbf{A} = \mathbf{LL}^\top where L=R\mathbf{L} = \mathbf{R}^\top. It is always commonly used in the multivariate Gaussian.

Consider for example:

A=[3.38210.87840.36132.03490.87842.00680.55870.11690.36130.55873.66560.78072.03490.11690.78072.5397]\mathbf{A} = \begin{bmatrix} 3.3821 & 0.8784 & 0.3613 & -2.0349 \\ 0.8784 & 2.0068 & 0.5587 & 0.1169 \\ 0.3613 & 0.5587 & 3.6656 & 0.7807 \\ -2.0349 & 0.1169 & 0.7807 & 2.5397 \end{bmatrix}

We start by iteratively solving for the lower triangular matrix L\mathbf{L} by calculating the square root of the first diagonal element, L11=3.38211.8390L_{11} = \sqrt{3.3821} \approx 1.8390 and the remaining entries in this column are found by dividing the corresponding elements of A\mathbf{A} by L11L_{11}, resulting in L21=0.8784/1.83900.4776L_{21} = 0.8784 / 1.8390 \approx 0.4776, L31=0.3613/1.83900.1965L_{31} = 0.3613 / 1.8390 \approx 0.1965, and L41=2.0349/1.83901.1065L_{41} = -2.0349 / 1.8390 \approx -1.1065.

For the second column, we take into consideration what has happened in the first column before solving for the new diagonal. We calculate L22=2.0068(0.4776)21.3337L_{22} = \sqrt{2.0068 - (0.4776)^2} \approx 1.3337. The off-diagonal elements are then determined by L32=(0.5587(0.1965)(0.4776))/1.33370.3486L_{32} = (0.5587 - (0.1965)(0.4776)) / 1.3337 \approx 0.3486 and L42=(0.1169(1.1065)(0.4776))/1.33370.4839L_{42} = (0.1169 - (-1.1065)(0.4776)) / 1.3337 \approx 0.4839.

In the third column, we continue the subtraction process to find L33=3.6656(0.19652+0.34862)1.8722L_{33} = \sqrt{3.6656 - (0.1965^2 + 0.3486^2)} \approx 1.8722 and the subsequent value L43=(0.7807((1.1065)(0.1965)+(0.4839)(0.3486)))/1.87220.4441L_{43} = (0.7807 - ((-1.1065)(0.1965) + (0.4839)(0.3486))) / 1.8722 \approx 0.4441. The final element is derived by subtracting the squares of all preceding row elements in L\mathbf{L} from the last diagonal of A\mathbf{A}, giving L44=2.5397((1.1065)2+0.48392+0.44412)0.9413L_{44} = \sqrt{2.5397 - ((-1.1065)^2 + 0.4839^2 + 0.4441^2)} \approx 0.9413.

This results in the lower triangular matrix L\mathbf{L}:

[1.83900000.47761.3337000.19650.34861.872201.10650.48390.44410.9413]\begin{bmatrix} 1.8390 & 0 & 0 & 0 \\ 0.4776 & 1.3337 & 0 & 0 \\ 0.1965 & 0.3486 & 1.8722 & 0 \\ -1.1065 & 0.4839 & 0.4441 & 0.9413 \end{bmatrix}

With the lower triangular matrix L\mathbf{L} and LL\mathbf{LL}^\top, we can verify the original matrix A\mathbf{A} as:

LL=[1.83900000.47761.3337000.19650.34861.872201.10650.48390.44410.9413][1.83900.47760.19651.106501.33370.34860.4839001.87220.44410000.9413]\mathbf{L}\mathbf{L}^\top = \begin{bmatrix} 1.8390 & 0 & 0 & 0 \\ 0.4776 & 1.3337 & 0 & 0 \\ 0.1965 & 0.3486 & 1.8722 & 0 \\ -1.1065 & 0.4839 & 0.4441 & 0.9413 \end{bmatrix} \begin{bmatrix} 1.8390 & 0.4776 & 0.1965 & -1.1065 \\ 0 & 1.3337 & 0.3486 & 0.4839 \\ 0 & 0 & 1.8722 & 0.4441 \\ 0 & 0 & 0 & 0.9413 \end{bmatrix} =[3.38210.87840.36132.03490.87842.00680.55870.11690.36130.55873.66560.78072.03490.11690.78072.5397]= \begin{bmatrix} 3.3821 & 0.8784 & 0.3613 & -2.0349 \\ 0.8784 & 2.0068 & 0.5587 & 0.1169 \\ 0.3613 & 0.5587 & 3.6656 & 0.7807 \\ -2.0349 & 0.1169 & 0.7807 & 2.5397 \end{bmatrix}

Some of the numpy functions for some of the linear algebra operations covered in this post include:

numpy.dot, numpy.inner, numpy.outer, numpy.matmul, numpy.tensordot, numpy.einsum, numpy.kron,
numpy.linalg.cholesky, numpy.linalg.qr, numpy.linalg.svd, numpy.linalg.eig,
numpy.linalg.eigvals, numpy.linalg.norm, numpy.linalg.cond, numpy.linalg.det,
numpy.linalg.matrix_rank, numpy.trace, numpy.linalg.solve, numpy.linalg.tensorsolve,
numpy.linalg.lstsq, numpy.linalg.inv, numpy.linalg.pinv, numpy.linalg.tensorinv,
numpy.diagonal

References:

  1. Margalit, D., & Rabinoff, J. (2019). Interactive Linear Algebra. Georgia Institute of Technology. Retrieved from https://textbooks.math.gatech.edu/ila/ila.pdf
  2. Mathuranathan. (2013). Cholesky factorization and MATLAB code. Gaussian Waves. Retrieved from https://www.gaussianwaves.com/2013/05/cholesky-factorization-and-matlab-code/

For comments, please send me an email.