Linear Algebra is a broad topic that one blog cannot cover in its entirety. We will start with review of Matrix operations, including multiplication, vector spaces, and other related concepts. Then, explore the more interesting aspects of linear algebra for machine learning. The point of this blog is not to present formal mathematics, there are plenty and plenty of books and resources out there that will do that for you if that is your interest.
Vector and Matrix Multiplication:
Some of the key properties of vectors and matrices worth remembering before even looking at the vector/matrix multiplication operations are; Non-commutative ie AB=BA; Associative ie matrix multiplication is associative hence (AB)C=A(BC); Distributive ie matrix multiplication is distributive over matrix addition A(B+C)=AB+AC.
Vector-Vector Multiplication:
Given u,v∈Rn with the same dimension, u⊤v=v⊤u and both are guaranteed to always result in same scalar value.
However, when with vectors are of different dimensions (u∈Rm and v∈Rn), their product results into matrix uv⊤ that is refered as an outer product of the vectors.
Given a matrix A∈Rm×n and a vector x∈Rn, their product Ax results in a new vector b∈Rm. Each element bi is computed as the dot product of the ith row of A (denoted as ai:) with the vector x. This operation is the fundamental for a linear transformation:
bi=ai:⊤x=j=1∑nAijxj(1)
Which we can further visualize as full transformation from Rn to Rm:
Given two matrices A∈Rm×n and B∈Rn×p, their product C=AB results in a new matrix C∈Rm×p. Each element Cij is computed as the dot product of the ith row of A and the jth column of B:
Cij=ai:⊤b:j=k=1∑nAikBkj(3)
Kronecker Product:
Given matrix A∈Rm×n and B∈Rp×q, their Kronecker product is an A⊗B∈Rmp×nq matrix. For example:
A closely related concept is Vectorization, For a matrix A∈Rm×n, the vectorization of A is the mn×1 column vector which is obtained by stacking A's columns on top of each other:
vec(A)=(a11…am1…a1n…amn)⊤(5)
Alternatively, the row vectorization of an [n,m] matrix X is the nm-vector that simply flattens X by concatenating its rows into a single vector:
Which is similar as numpy's np.reshape(X,[−1]) and this relationship can be equivalently expressed in terms of a Kronecker product with basis vectors ei∈Rn×1:
rvec(X)=i=1∑nei⊗XTei(7)
Some of the common identies involving Kronecker products are summarized below:
Assuming we had a sequence of numbers indexed as 1,2,…,n, we could represent their sum as ∑k=1nXk, meaning we keep summing using our index from 1 to n. Also, if there was a second sequence Y, the dot product of X and Y would be:
⟨X,Y⟩=k=1∑nxkyk(8)
Instead of writing the equation with the ∑ sign, the Einstein notation is a simplification that implicitly removes the summation sign. That is, given an expression, we perform summations over all the values of the repeated index which occurs in pairs (one on the upper index position and the other on the lower index position).
So, with this implicit removing of the summation sign, we use xi which is also called contravariant vectors if the components transform in the 'opposite' way to the basis vectors. The xi notation also called covariant vector since omponents transform in the "same" way as the basis vectors. We us lowering (xi=gijxj) and raising (xi=gijxj) to convert a column vector into a row vector and vice versa.
Some of the common einsum operations for machine learning are summarized below:
OperationInner ProductBilinear FormLinear TransformationMatrix MultiplicationTrace of MatrixSummation Notation⟨x,y⟩=∑i=1nxiyiA(x,y)=∑i=1n∑j=1nAijxiyj(Ax)i=∑j=1nAijxj(AB)ij=∑k=1nAikBkjtrA=∑i=1nAiiEinstein NotationxiyiAijxiyj(Ax)i=Ajixj(AB)ji=AkiBjkAii
Vector Spaces
To understand the fundamentals Vector Spaces and projection matrices, let's first review what linear Independence, span, basis, dimensions and linear Maps are.
Linear Independence happens when c1v1+c2v2+⋯+cnvn=0;except when∀ci=0. Meaning, if there are other values of ci that result in the zero vector, then the vectors are said to be linearly dependent. In general, given the vectors v1,v2,…,vn, we can check whether they are linearly independent by merging the vectors as column matrix A. For example:
v1=123,v2=456,v3=789A=123456789
We can check if the columns of A are linearly independent, by solving the equation Ac=0 with Reduced Row Echelon Form (RREF) resulting into:
RREF(A)=100010−120
The pivot positions (the leading 1s in each row) appear in columns 1 and 2, so those are the pivot columns.The third column does not contain a pivot, so it is a free column, meaning v3 can be written as a linear combination of the pivot columns:
v3=−1v1+2v2(9)
Since there are only two pivot columns, the rank of A is 2, not 3. This means that rank(A)<n, confirming that the vectors are linearly dependent.
By contrast, for example, given the vectors:
v1=100,v2=010,v3=001A=100010001
Here, each column contains a pivot, there are no free columns, and rank(A)=3. The null space contains only the zero vector c=0, meaning the columns of A are linearly independent.
Span, Basis, and Dimension
I will use one of the examples of the book to explain what a span is:
311 in Span⎩⎨⎧100,111⎭⎬⎫
Consider the above, can we find scalars x1 and x2 that make the equation consistent?
311=x1100+x2111(10)
So with x1=2 and x2=1, being a consistent solution, therefore:
311 is in Span⎩⎨⎧100,111⎭⎬⎫
The basis is a set of vectors v1, v2, ...vn that satisfy linearly independent and span that space. For each basis vector, its space contains the same number of vectors called the dimension of the space.
Fundamental subspaces
Null Space
Given matrix A∈Rm×n and Av=0, the null space is a set of all vectors v∈Rn that A maps to a zero vector.
Column Space
Given matrix A, the column space of A, is the vector space that spans the column-vectors of A -- a set of all possible linear combinations of the column vectors. That is, given a matrix A∈Rm×n, we it as a function that maps vectors from Rn to vectors in Rm with Ax=b. The interesting fact is that, th column space is simply the range mapping Ax meaning it is the set of all possible output vectors b.
Row Space
Given a matrix A∈Rm×n with m rows and n columns, its row space is the vector space spanned by the row vectors of A. If we denote the rows of A as r1,r2,…,rm∈Rn, therefore Row(A)=span{r1,r2,…,rm}. Meaning the row space consists of all possible linear combinations of the row vectors:
The dimension of the row space is the rank of the matrix, rank(A), which is the maximum number of linearly independent rows in A and also Row(A)=Col(A⊤) . Also consider A=[142536]∈R2×3, The row vectors are r1=[1,2,3] and r2=[4,5,6]. The row space is span{[1,2,3],[4,5,6]}⊆R3.
Matrix Inverses
Generally, given two numbers x=5 and y=1/5, multiplying them results in xy=1. In this case, we call y the multiplicative inverse of x. While division plays an important role in scalar arithematics, there is no such operation as matrix division. Instead, the matrix inverse serves as its counterpart. For a square matrix, such as a 2×2 matrix A, finding the inverse is straightforward:
A=[abcd](12)
It's inverse, denoted as A−1, is calculated as:
A−1=det(A)1[d−b−ca]or∣A∣1[d−b−ca](13)
There are some relationships that exist between a matrix A and its inverse A−1. The identity property states that, the product of a square matrix and its inverse always results in the identity matrix I, such that A−1A=I and AA−1=I. other relationship are summarized below:
Deciding when to use matrix inversion is a tricky question generally, however in machine learning because we have take the derivatives of inverses and determinants in optimization its common to use inversion. While high-speed inversion algorithms exist, the general guideline is to avoid direct inversion whenever possible. To solve a linear system of the form Ax=b for a square matrix A, utilizing matrix factorization methods is preferred because they are numerically stable and accurate than computing the inverse. However, when one has to solve for x=A−1b across many different vectors of b, precomputing and storing A−1 becomes computationally efficient.
Some Special Matrices
Diagonal Matrix
This is a matrix for which all the non-diagonal elements are 0. It is usually denoted D=diag(d1,d2,…,dn) or in matrix notation as:
D=d1d2⋱dn(14)
We could also rewrite the above as d=diag(D) by just extracting the diagonal vector. One special example of a diagonal matrix is the identity matrix that is usually denoted I∈Rn×n.
Symmetric and Skew-Symmetric Matrices
First, the concept of symmetry only applies to square matrices. Matrix A is said to be symmetric if and only if its transpose equals the original matrix A, that is:
A=AT(15)
Meaning given A symmetric, the element at row i, column j must equal the element at row j, column i (aij=aji).
Since A=AT, we can conclude that the square matrix A is symmetric.
A simple tweak is when AT=−A, we say that matrix A is skew-symmetric. In a skew-symmetric matrix, the main diagonal elements must be 0, and aij=−aji as shown below:
This is a square matrix whose columns and also the rows are orthonormal vectors. We say a vector e.g q1,…,qn each derived from matrix Q=∣q1∣∣q2∣…∣qn∣ to be orthonormal basis if and only if each vector is a unit vector (length = 1) and a set of orthonormal vectors q1,q2,…,qn form independent vectors.Therefore, given a square matrix Q∈Rn×n, this matrix is orthogonal if its transpose is equal to its inverse:
QT=Q−1(16)
Additionally, orthogonal matrices have the following interesting properties QTQ=QQT=I. Where QTQ=I implies that the columns of Q are orthonormal and QQT=I implies that the rows of Q are also orthonormal.
4. Positive Definite Matrices
4.1 quadratic forms and eigenvalues
Given a quadratic equation like ax12+bx1x2+cx22, we can rewrite its associated symmetric matrix where the square terms form the diagonal and the cross-term are divided by 2 as:
[ab/2b/2c](17)
The same rule also apply for the more complex quadratic form e.g for ax12+bx22−cx32+dx1x2−fx1x3+gx2x3, the associated symmetric matrix is constructed by placing the coefficients of the squared terms (x12,x22,x32) on the main diagonal and splitting the cross-term coefficients across the corresponding symmetric positions:
ad/2−f/2d/2bg/2−f/2g/2−c(18)
So, generally the rules for converting any quadratic form to a symmetric matrix A are; the diagonal elements (aii) are the coefficients of the squared terms xi2; the off-diagonal elements (aij) are half of the coefficient of the xixj term to ensure the matrix remains symmetric (aij=aji).
Therefore, we can formally define the relationship between the quadratic form Q(x) and symmetric matrix A as:
Q(x)=xTAx(19)
We can prove this is correct by rewriting xTAx for our first example of the quadratic form ax12+bx1x2+cx22.
Given vector x=[x1x2] and symmetric matrix A=[ab/2b/2c]:
Also, another interesting property with the calculated eigenvalues, which is useful in machine learning, is that we can find the basis for the eigenspace associated with each eigenvector.Given our previously calculated eigenvalues λ1=5+22 and λ2=5−22, we find the basis vectors by solving the homogeneous system (A−λI)v=0.
With the same approach, λ2=5−22 would result into:
v2=[12−1]
Given the fact that matrix A is symmetric, these basis vectors are guaranteed to be orthogonal to each other.
4.2 Positive Definite
A symmetric matrix A∈Sn is positive definite iff for all non-zero vectors x∈Rn, xTAx>0, It is positive semidefinite or psd if xTAx≥0 is possible. A symmetric matrix A∈Sn is negative definite iff for all non-zero x∈Rn, xTAx<0, it is negative semidefinite if xTAx≤0 is possible. Finally, a symmetric matrix A∈Sn is indefinite if there exist x1,x2∈Rn such that x1TAx1>0 and x2TAx2<0 as summarized below:
Selecting a matrix factorization method to use for a given problem is mainly dictated by the dimensions and properties of the matrix. Matrices are usually categorized into square and rectangular matrices. For square matrices that are invertible and well-conditioned, methods such as Gauss-Jordan elimination or LU decomposition are preferred.However, if a square invertible matrix is ill-conditioned - meaning that small change in the input data result in disproportionately large changes in the output. In these instances, Singular Value Decomposition (SVD) or regularization techniques, such as Tikhonov regularization are used to stabilize the solution. For square singular matrices that are non-invertible, as well as rectangular matrices, the standard approach is the Moore-Penrose pseudoinverse (A+). This is particularly useful for overdetermined systems, where the matrix has more rows than columns (m>n), and underdetermined systems, where the matrix has more columns than rows (n>m). In these cases, the pseudoinverse provides the best-fit solution in a least-squares sense, calculated as x=A+b. For an overdetermined system, this often involves the normal equations where x=(A⊤A)−1A⊤b.
With that in mind, we shall now cover the key matrice factorization methods:
LU Decomposition
Given a square matrix A we want to rewrite it as A=LU where L is lower triangular matrix L∈Rn×n and U upper triangular matrix U∈Rn×n.
We follow an elimination process that transforms the original matrix into an upper triangular form while recording the operational history in a lower triangular matrix. For example given:
A=248137139
First, we focus on the first column and use the pivot element A11=2 to eliminate the entries in the rows below it. We first calculate the multipliers by dividing the leading element of each target row by the pivot, resulting in l21=4/2=2 for the second row and l31=8/2=4 for the third row. We use multipliers to construct the L matrix later. We then execute the row operations by subtracting two times the first row from the second row (R2←R2−2R1) and four times the first row from the third row (R3←R3−4R1). This results in the first intermediate matrix, where the first column below the diagonal consists of zeros:
A(1)=200113115
We move to the second column use the new pivot at A22=1. To eliminate the value below this pivot in the third row, we compute the multiplier l32=3/1=3. We then perform the row operation R3←R3−3R2, which involves subtracting three times the current second row from the third row. This calculation results in 5−(3×1)=2 for the final element, completing the transformation into the upper triangular matrix U:
U=200110112
We then, construct the lower triangular matrix L, that stored the records of the elimination steps. We begin with an identity matrix and then populate the empty positions below the main diagonal using the multipliers from the previous steps. That is, we place l21=2 and l31=4 in the first column and l32=3 in the second column resulting in lower triangular matrix:
L=124013001
QR Decomposition
For QR decomposition, our objective is to factorize a rectangular matrix A∈Rm×n or a square matrixA∈Rn×n into the product of an orthogonal matrix Q and an upper triangular matrix R. As a reminder, a square matrix B is orthogonal if BTB=I and BBT=I. And also, B−1=BT. The columns and also the rows of an orthogonal matrix are orthonormal, meaning, their 2-norm (Euclidean length) is 1 as shown below:
[a−bba],a2+b2=1(21)
While the Gram-Schmidt process is often used to in maths textbooks to introduce the concept of orthogonalization, it is unstable because it quickly degrades due to rounding errors. NumPy implement QR factorization using Householder reflections since each reflection preserves the Euclidean norm of the matrix columns through the "sign choice" that avoid numerical cancellation.
The general idea is to iteratively applies these reflections to smaller and smaller submatrices until only the upper triangular form remains. First, we isolate the first column of A denoted as x and our goal is to find a reflection that maps x to a multiple of the first unit vector e1, thus zeroing out all entries below the first row. We compute the Householder vector v1=x+sign(x1)∥x∥2e1. Using this vector, we construct the first Householder matrix H1=I−2v1⊤v1v1v1⊤.
Multiplying the original matrix by this reflection, the first column of the resulting matrix H1A becomes:
H1A=−sign(x1)∥x∥00………(22)
We repeat the process to the submatrix formed by ignoring the first row and first column of the transformed matrix resulting into (m−1)×(n−1) submatrix. We calculate a new Householder vector v2 and a corresponding reflection matrix H^2. To apply this to the full matrix, we put it into a larger identity structure H2=diag(1,H^2) which ensures that the second reflection zeros out the elements below the diagonal in the second column while leaving the zeros created in the first column.
This process continues for all columns until the matrix is fully reduced to an upper triangular form R. The final matrix R is the result of applying the sequence of all Householder reflections to the original matrix, such that R=Hn…H2H1A.
Because Householder matrices are orthogonal and symmetric (Hk=Hk⊤=Hk−1), we can solve for Q by multiplying the reflections in reverse order. The orthogonal matrix Q is thus defined as Q=H1H2…Hn. This guarantees that Q⊤Q=I and that the final product QR reconstructs the original matrix A.
Eigenvalues and Eigenvectors
Given a square matrix A, for the eigens, our goal is to look for directions where the linear transformation just acts as simple scaling without change of direction. That is, Av=λv. The v represents the eigenvector (also referred to as a characteristic or latent vector), which is the direction that remains invariant under the transformation. The scalar λ is the eigenvalue, which determines the factor by which that vector is scaled. To solve for these values, we rearrange the equation into a homogeneous linear system:
(A−λI)v=0(22)
For a non-trivial solution v=0, the matrix (A−λI) must be singular. That is the determinant of the matrix is 0, resulting into the characteristic equation:
det(A−λI)=0(23)
Solving this polynomial equation provides the eigenvalues. Once these are known, each corresponding eigenvector is got by solving the resulting linear system for v.
Consider matrix A:
A=[4213]
First, we subtract λ from the main diagonal elements of A:
A−λI=[4−λ213−λ]
We follow by calculating the determinant of this matrix:
det(A−λI)=(4−λ)(3−λ)−(1)(2)
This results in the characteristic polynomial:
(12−4λ−3λ+λ2)−2=0λ2−7λ+10=0
We then factor by looking for two numbers that multiply to 10 and add to −7:
(λ−5)(λ−2)=0
Setting each factor to zero, we find the eigenvalues for the matrix A:
λ1=5,λ2=2
We can verify this with either the trace and determinant trick: (4+3=7 and λ1+λ2=5+2=7) or ((4×3)−(1×2)=10 and λ1×λ2=5×2=10.)
Eigen Decomposition
Eigendecomposition (also known as eigenvalue decomposition, spectral decomposition, or diagonalization) of a square matrix is a factorization of the form An=QΛnQ−1. Q−1 is an invertible linear transformation, Λ is a scaling operation, and the inverse of the initial transformation Q. The matrix Q must be invertible to allow for the change of basis, and Λ is a diagonal matrix diag(λ) where the diagonal entries are the eigenvalues. When the eigenvalues in Λ are negative, the eigenvectors in Q are therefire not required to be orthogonal.
Remember the eigenvalues λ1=5 and λ2=2 and their corresponding eigenvectors to calculate A10 for our previous example matrix A=[4213]. We would first solve for the eigenvectors by substituting the eigenvalues back into (A−λI)v=0. For λ1=5, this would result in [−121−2]v1=0, which yields v1=[11]. And for λ2=2, [2211]v2=0, which would be v2=[1−2]. Therefore the matrices Q and Λ would be:
Q=[111−2],Λ=[5002]
Applying the power to the diagonal elements of Λ would result in:
A10=Q[51000210]Q−1
Singular Value Decomposition (SVD)
Let first start with stating the wikipedia equation of SVD which is a factorization of an m×n complex matrix M into the form M=UΣV∗ for complex matrices, or M=UΣV⊤ if the matrix consists of real numbers. This definition generalizes eigendecomposition to any matrix, regardless of its shape or properties. A matrix is considered unitary if its conjugate transpose is also its inverse, satisfying the property U∗U=I. The matrix U is an m×m unitary matrix. It is also the left singular vectors that form an orthonormal basis of the output space. V is a n×n unitary matrix too, and its columns (equivalently, the rows of V⊤) are the right singular vectors forming an orthonormal basis of the input space. Σ is an m×n diagonal matrix. The matrix Σ∈Rm×n is diagonal and contains non-negative singular values, ordered as σ1≥σ2≥⋯≥0.
The numpy decomposition is performed using LAPACK routine which I fully don't understand, if you are interested in doing it on paper, the power method much more simple and is the basis for most of the advanced SVD algorithnms.
Cholesky Decomposition
Given a symmetric positive definite matrix A, Cholesky decomposition or "matrix square root" is a factorization A=R⊤R, where R is an upper triangular matrix. This factorization can also be rewritten as A=LL⊤ where L=R⊤. It is always commonly used in the multivariate Gaussian.
We start by iteratively solving for the lower triangular matrix L by calculating the square root of the first diagonal element, L11=3.3821≈1.8390 and the remaining entries in this column are found by dividing the corresponding elements of A by L11, resulting in L21=0.8784/1.8390≈0.4776, L31=0.3613/1.8390≈0.1965, and L41=−2.0349/1.8390≈−1.1065.
For the second column, we take into consideration what has happened in the first column before solving for the new diagonal. We calculate L22=2.0068−(0.4776)2≈1.3337. The off-diagonal elements are then determined by L32=(0.5587−(0.1965)(0.4776))/1.3337≈0.3486 and L42=(0.1169−(−1.1065)(0.4776))/1.3337≈0.4839.
In the third column, we continue the subtraction process to find L33=3.6656−(0.19652+0.34862)≈1.8722 and the subsequent value L43=(0.7807−((−1.1065)(0.1965)+(0.4839)(0.3486)))/1.8722≈0.4441. The final element is derived by subtracting the squares of all preceding row elements in L from the last diagonal of A, giving L44=2.5397−((−1.1065)2+0.48392+0.44412)≈0.9413.