This is going to be a long read, and why should one take the trouble to go through?. These concepts are the backbone for computation of gradients and hessians of an objective function in classical machine learning and autodifferentiation, a key technique for calculating the derivatives of neural networks which is essential for the networks to learn. So, to smoothly understand backpropagation and gradient descent, one needs to have a good understanding of matrix calculus. Before going through matrix calculus and autodifferentiation, we shall first cover a high level overview of analytical, numerical, and symbolic techniques for calculating derivatives and why they are inefficient for neural networks and generally in Machine Learning.
1. Analytical Derivatives
If you have taken any calculus class, or read a textbook, certainly this is what you were introduced to. The main idea is to apply the rule of scalar derivatives to an input expression or function to generate its derivative by hand. These rules are summarized below:
ConstantMultiplicationby constantPower RuleSum RuleProduct RuleChain RuleInput Functionccf(x)xnf(x)+g(x)f(x)g(x)f(g(x))Rule0cdxdfnxn−1dxdf+dxdgfdxdg+dxdfgdudfdxdu, let u=g(x)Exampledxd10=0dxd7x=7dxdx6=6x5dxd(x5+x)=5x4+1dxdx2x=x2+x2x=3x2dxdln(x2)=x212x=x2
Scalar rules are an excellent introduction to the concept of differentiation. However, deriving gradients by hand becomes impractical for complex, high-dimensional functions. Even if an analytical solution exists, manual derivation is time consuming and highly prone to human error. Current modern applications rely on computational methods specifically Numerical, Symbolic, and Automatic Differentiation to automate this process. We will finish with the high level overview of Symbolic and Numerical then build up our understanding of matrix calculus for Automatic differentiation.
2. Numerical Differentiation
Before we explore what numerical differentiation is, let's first highlight the cardinal sins of numerical analysis and we shall later understand how numerical differentiation violates them.
Thou shalt not add small numbers to a big number and Thou shalt not subtract numbers which are approximately equal.
With that in mind, numerical differentiation offers a way to calculate approximate derivative values at discrete intervals, deriving its procedure from the limit definition of derivatives:
f′(x)=h→0limhf(x+h)−f(x)(1)
Finite Difference Methods approximate the derivative f′(xi) using the values of f(x) at discrete points separated by a uniform step size h, where xi+1=xi+h and xi−1=xi−h. The truncation error results from the lowest-order term neglected from the Taylor series expansion.
Numerical methods for differentiation can be sub divided into Forward, Backward and Centered Differences.
Forward Difference
This uses the value of the function at xi and the point immediately ahead, xi+1, to approximate the derivative. With Equation 2 above, we can derive for the first derivative, f′(xi):
More advanced techniques such as higher-order finite differences, Richardson extrapolation, and differential quadrature methods further minimize the truncation error. However, they lead to an increase of floating-point truncation (round-off) error. Therefore, while computational complexity increases with advanced methods, they are also highly prone to round-off error.
3. Symbolic Differentiation
Symbolic differentiation is similar to calculating derivatives by hand, except that the function is first represented in a formal symbolic structure which is in most cases an expression tree. In this representation, operands such as constants and variables form the leaf nodes, while operators and functions (+, -, /, *, sin, cos, log, exp) form the internal nodes of the tree. Differentiation is then carried out by applying a collection of symbolic transformation rules as in analytical differentiation. These rules are applied recursively to the tree by the computer, producing a new expression tree that represents the derivative of the original function. This how Mathematica and SymPy libraries work. While symbolic differentiation can produce exact derivatives, it suffers from expression swell, where the size of the expression grows exponentially with each parameter, leading to inefficiencies in both computation time and memory usage. This makes symbolic differentiation impractical for complex functions or higher-order derivatives in machine learning applications.
Partial Derivatives
1. Scalar-Valued Multivariable Functions
Given a scalar input multi-variable function, to calculate its derivative, we simply apply the same rules as ordinary derivatives. For example, given a function f(x,y)=x2y2, computing the partial derivative would be as follows:
We treat y as a constant and take the derivative of x∂x∂x2y2=2xy2(17)
For the partial derivative with respect to y, we also treat x as a constant. Therefore, the general idea given multiple variables is to take the derivative of each variable while treating all other variables as constants.
In some use cases, it is common to take a second derivative of a derived ordinary differential equation. For partial derivatives, it is also valid to take a partial derivative of a partial derivative (e.g., for a second partial derivative). This would be ∂x∂(∂x∂f)=∂x2∂2f, ∂x∂(∂y∂f)=∂x∂y∂2f, ∂y∂(∂x∂f)=∂y∂x∂2f, ∂y∂(∂y∂f)=∂y2∂2f.
We can put the resultant partial derivatives into a vector. This vector is commonly denoted as ∇f of a vector-valued function and is formally called the gradient.
The gradient of f points in the direction of steepest ascent. Moving in the direction of the gradient takes you to the local maxima of a function or curve, and the magnitude of the vector ∇f(x,y,…) tells you the slope of the function in that direction.
2. Directional Derivatives
Since the first derivative of f(x,y) tells us the slope of the tangent line at a given point in either the positive x direction for x and positive y direction for y, i.e. parallel to the variables axes. However, when we need a slope in any direction, we need directional derivatives which is a generalization to calculate the slope in any direction or axis. We specify the direction with a vector e.g. v=(v1,v2,…,vn) and the common notation in math for directional derivatives is ∇vf.
A vector-valued function is a mapping f:Rn→Rm. Instead of producing a single scalar value, the function maps an input vector x=(x1,x2,…,xn) to an output vector f(x)=(f1(x),f2(x),…,fm(x)). When the partial derivatives of a vector-valued function are collected together, the result is called the Jacobian. For f(x)=y m-dimensional vector, the corresponding expression is y=f(x) which can further be expanded into:
For scaled-valued functions, i.e f:Rn→R that takes in x∈Rn and outputs a scalar f(x)∈R. The Hessian matrix H of f is a square n×n matrix generated from the second-order partial derivatives and summarized as this:
For equation 27, each output dimension depends exclusively on the corresponding input dimension, i.e., (yi=wixi). This independence implies no coordinate influences any other between different dimensions. As a result, the associated linear operator is diagonal, meaning that each dimension is processed independently as below:
Each output dimension is influenced exclusively by the corresponding input dimension, with no cross-dimensional interactions. As a result, the Jacobian matrix is diagonal:
Given a function y=w⋅x, we can rewrite the dot product as wxT or wTx if the two vectors w and x are of equal length. We can also write it as ∑inwixi which will be ∑(w⊙x).
This can finally be rewritten as y=∑(v) and v being an intermediate simplification of w⊙x which we can easily use the chain rule to find the derivative.
If y=∑i=1nvi=∑(v) then we first, need to find ∂x∂v:
In general, to compute derivatives involving scalar-vector or vector-vector expressions, one first identifies the dimensions of all input vectors and matrices to determine the form of the resulting expression. The function is then written explicitly in terms of its scalar components, noting that operations such as vector multiplication may yield either a scalar or a matrix depending on the context. Differentiation is performed component-wise with respect to each element of the variable vector, and the resulting partial derivatives are finally assembled into a vector or matrix that matches the appropriate derivative structure (e.g., a gradient or Jacobian).
5 Matrix Derivatives
As you might have observed from the above, scalar and vectors are denoted using lower-case symbols (a, x, etc). However for matrices, we usually use bold upper-case symbols (X, A, B, etc), each element is indexed using subscripts, such as xi (i∈1,…,n). The rules for vector calculus in the previous sections to a great extent carry over to matrix differentials, like determinant and inverse differentials are quite different and not familiar unless one has had an exposure to matrix calculus.
Given matrix A of constants and scalar value x; dA=0, d(xA)=xdA, d(AT)=(dA)T, if A is symmetric, dtrA=trdA, If the matrix product AB is defined, d(AB)=(dA)B+AdB, and if A and B are of the same order, then d(A+B)=dA+dB.
5.1 Matrix Products
5.1.1 Derivative of Scalar by Matrix
Given f(X) that maps a matrix X∈Rm×n to a scalar value, computing the derivative requires us to consider how the scalar output changes with respect to each individual element of the input matrix. For instance, given:
Since f(X) results into a scalar, we can compute the partial derivative with respect to each individual element Xi,j, ∂X1,1∂f represents how f changes when only X1,1 changes, this will be so for every element in the matrix.
All the partial derivatives are collected into a gradient matrix:
The gradient has the same dimensions as the input matrix, i.e. m×n. This dimensional consistency is particularly useful in machine learning applications where we need to update matrix parameters using their gradients.
5.1.2 Derivative of Matrix by scalar
Given f(x) that maps a scalar x∈R to a matrix F(x)∈Rp×q, to compute the derivative with respect to the scalar input requires us to consider how each individual element of the output matrix changes. For instance, with:
The derivative for matrix by scalar will always match the dimensions of the output matrix F(x). This is the matrix generalization of taking the derivative of a vector-valued function with respect to a scalar.
5.1.3 Derivative of vector by Matrix
Given f(X) that maps a matrix X∈Rn×m to a vector f(X)∈Rp×1, its derivative is with respect to each element of the input matrix hence resulting into a 3rd-order tensor with dimensions p×n×m. We take the individual derivative and collect them together like this:
The derivative structure maintains the same n×m layout as the input matrix, but each entry is now a vector rather than a scalar. This creates a 3rd-order tensor with dimensions p×n×m, though it's often more intuitive to think of it as p separate n×m gradient matrices stacked together.
5.1.4 Derivative of Matrix by vector
With a function F(x) that maps a vector x∈Rm×1 to a matrix F(x)∈Rp×q. To calculate the derivative, we consider how each element of the output matrix changes with respect to each component of the input vector. This will also be a 3rd-order tensor with dimensions p×q×m.
∂x∂F=[∂x1∂F∂x2∂F⋯∂xm∂F](40)
Each entry ∂xk∂F is itself a p×q matrix showing how all elements of F change with respect to the k-th component of x. So, we can view each sequence of m matrices, each of size p×q, together in a row.
Conclusion
This blog has covered the foundational concepts of derivatives and their applications in machine learning, from single-variable differentiation to the complexities of tensor derivatives and computational graphs.