Published on
30/10/2019

Derivatives, Partial Derivatives, Vector and Matrix Calculus

This is going to be a long read, and why should one take the trouble to go through?. These concepts are the backbone for computation of gradients and hessians of an objective function in classical machine learning and autodifferentiation, a key technique for calculating the derivatives of neural networks which is essential for the networks to learn. So, to smoothly understand backpropagation and gradient descent, one needs to have a good understanding of matrix calculus. Before going through matrix calculus and autodifferentiation, we shall first cover a high level overview of analytical, numerical, and symbolic techniques for calculating derivatives and why they are inefficient for neural networks and generally in Machine Learning.

1. Analytical Derivatives

If you have taken any calculus class, or read a textbook, certainly this is what you were introduced to. The main idea is to apply the rule of scalar derivatives to an input expression or function to generate its derivative by hand. These rules are summarized below:

Input FunctionRuleExampleConstantc0ddx10=0Multiplicationcf(x)cdfdxddx7x=7by constantPower Rulexnnxn1ddxx6=6x5Sum Rulef(x)+g(x)dfdx+dgdxddx(x5+x)=5x4+1Product Rulef(x)g(x)fdgdx+dfdxgddxx2x=x2+x2x=3x2Chain Rulef(g(x))dfdududx, let u=g(x)ddxln(x2)=1x22x=2x\begin{array}{lcll} & \textbf{Input Function} & \textbf{Rule} & \textbf{Example} \\ \text{Constant} & c & 0 & \frac{d}{dx}10 = 0 \\ \\ \text{Multiplication} & cf(x) & c\frac{df}{dx} & \frac{d}{dx}7x = 7 \\ \text{by constant} & & & \\ \\ \text{Power Rule} & x^n & nx^{n-1} & \frac{d}{dx}x^6 = 6x^5 \\ \\ \text{Sum Rule} & f(x) + g(x) & \frac{df}{dx} + \frac{dg}{dx} & \frac{d}{dx}(x^5 + x) = 5x^4 + 1 \\ \\ \text{Product Rule} & f(x)g(x) & f\frac{dg}{dx} + \frac{df}{dx}g & \frac{d}{dx}x^2x = x^2 + x2x = 3x^2 \\ \\ \text{Chain Rule} & f(g(x)) & \frac{df}{du}\frac{du}{dx}, \text{ let } u = g(x) & \frac{d}{dx}\ln(x^2) = \frac{1}{x^2}2x = \frac{2}{x} \\ \end{array}

Scalar rules are an excellent introduction to the concept of differentiation. However, deriving gradients by hand becomes impractical for complex, high-dimensional functions. Even if an analytical solution exists, manual derivation is time consuming and highly prone to human error. Current modern applications rely on computational methods specifically Numerical, Symbolic, and Automatic Differentiation to automate this process. We will finish with the high level overview of Symbolic and Numerical then build up our understanding of matrix calculus for Automatic differentiation.

2. Numerical Differentiation

Before we explore what numerical differentiation is, let's first highlight the cardinal sins of numerical analysis and we shall later understand how numerical differentiation violates them.

Thou shalt not add small numbers to a big number and Thou shalt not subtract numbers which are approximately equal.

With that in mind, numerical differentiation offers a way to calculate approximate derivative values at discrete intervals, deriving its procedure from the limit definition of derivatives:

f(x)=limh0f(x+h)f(x)h(1)f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} \tag{1}

Finite Difference Methods approximate the derivative f(xi)f'(x_i) using the values of f(x)f(x) at discrete points separated by a uniform step size hh, where xi+1=xi+hx_{i+1} = x_i + h and xi1=xihx_{i-1} = x_i - h. The truncation error results from the lowest-order term neglected from the Taylor series expansion.

f(xi+1)=f(xi)+hf(xi)+h22!f(xi)+h33!f(xi)+O(h4)(2)f(x_{i+1}) = f(x_i) + hf'(x_i) + \frac{h^2}{2!}f''(x_i) + \frac{h^3}{3!}f'''(x_i) + O(h^4) \tag{2} f(xi1)=f(xi)hf(xi)+h22!f(xi)h33!f(xi)+O(h4)(3)f(x_{i-1}) = f(x_i) - hf'(x_i) + \frac{h^2}{2!}f''(x_i) - \frac{h^3}{3!}f'''(x_i) + O(h^4) \tag{3}

Numerical methods for differentiation can be sub divided into Forward, Backward and Centered Differences.

Forward Difference

This uses the value of the function at xix_i and the point immediately ahead, xi+1x_{i+1}, to approximate the derivative. With Equation 2 above, we can derive for the first derivative, f(xi)f'(x_i):

hf(xi)=f(xi+1)f(xi)h22f(xi)O(h3)(4)hf'(x_i) = f(x_{i+1}) - f(x_i) - \frac{h^2}{2}f''(x_i) - O(h^3) \tag{4} f(xi)=f(xi+1)f(xi)h(h2f(xi)+O(h2))Truncation Error(5)f'(x_i) = \frac{f(x_{i+1}) - f(x_i)}{h} - \underbrace{\left(\frac{h}{2}f''(x_i) + O(h^2)\right)}_{\text{Truncation Error}} \tag{5} f(xi)f(xi+1)f(xi)h(6)f'(x_i) \approx \frac{f(x_{i+1}) - f(x_i)}{h} \tag{6}

Backward Difference

This instead uses xix_i and the preceding point xi1x_{i-1}, Its approximation is derived by solving Equation 3 for f(xi)f'(x_i):

hf(xi)=f(xi)f(xi1)+h22f(xi)+O(h3)(7)hf'(x_i) = f(x_i) - f(x_{i-1}) + \frac{h^2}{2}f''(x_i) + O(h^3) \tag{7} f(xi)=f(xi)f(xi1)h+(h2f(xi)+O(h2))Truncation Error(8)f'(x_i) = \frac{f(x_i) - f(x_{i-1})}{h} + \underbrace{\left( \frac{h}{2}f''(x_i) + O(h^2) \right)}_{\text{Truncation Error}} \tag{8} f(xi)f(xi)f(xi1)h(9)\quad \displaystyle f'(x_i) \approx \frac{f(x_i) - f(x_{i-1})}{h} \tag{9}

Centered Difference

We use the symmetric information from both sides of xix_i, using xi1x_{i-1} and xi+1x_{i+1} to eliminate the even-order terms, most notably the f(xi)f''(x_i) term:

f(xi+1)f(xi1)=2hf(xi)+2h33!f(xi)+O(h5)(10)\quad f(x_{i+1}) - f(x_{i-1}) = 2hf'(x_i) + 2\frac{h^3}{3!}f'''(x_i) + O(h^5) \tag{10} 2hf(xi)=f(xi+1)f(xi1)h33f(xi)O(h5)(11)2hf'(x_i) = f(x_{i+1}) - f(x_{i-1}) - \frac{h^3}{3}f'''(x_i) - O(h^5) \tag{11} f(xi)=f(xi+1)f(xi1)2h(h26f(xi)+O(h4))Truncation Error(12)f'(x_i) = \frac{f(x_{i+1}) - f(x_{i-1})}{2h} - \underbrace{\left( \frac{h^2}{6}f'''(x_i) + O(h^4) \right)}_{\text{Truncation Error}} \tag{12} f(xi)f(xi+1)f(xi1)2h(13)\quad \displaystyle f'(x_i) \approx \frac{f(x_{i+1}) - f(x_{i-1})}{2h} \tag{13}

Taylor series expansions can be extended to higher-order derivatives. For example, for the second derivative approximation:

f(xi+1)+f(xi1)=2f(xi)+h2f(xi)+O(h4)(14)\quad f(x_{i+1}) + f(x_{i-1}) = 2f(x_i) + h^2f''(x_i) + O(h^4) \tag{14} f(xi)=f(xi+1)2f(xi)+f(xi1)h2+O(h2)(15)f''(x_i) = \frac{f(x_{i+1}) - 2f(x_i) + f(x_{i-1})}{h^2} + O(h^2) \tag{15} f(xi)f(xi+1)2f(xi)+f(xi1)h2(16)\quad \displaystyle f''(x_i) \approx \frac{f(x_{i+1}) - 2f(x_i) + f(x_{i-1})}{h^2} \tag{16}

More advanced techniques such as higher-order finite differences, Richardson extrapolation, and differential quadrature methods further minimize the truncation error. However, they lead to an increase of floating-point truncation (round-off) error. Therefore, while computational complexity increases with advanced methods, they are also highly prone to round-off error.

3. Symbolic Differentiation

Symbolic differentiation is similar to calculating derivatives by hand, except that the function is first represented in a formal symbolic structure which is in most cases an expression tree. In this representation, operands such as constants and variables form the leaf nodes, while operators and functions (+, -, /, *, sin, cos, log, exp) form the internal nodes of the tree. Differentiation is then carried out by applying a collection of symbolic transformation rules as in analytical differentiation. These rules are applied recursively to the tree by the computer, producing a new expression tree that represents the derivative of the original function. This how Mathematica and SymPy libraries work. While symbolic differentiation can produce exact derivatives, it suffers from expression swell, where the size of the expression grows exponentially with each parameter, leading to inefficiencies in both computation time and memory usage. This makes symbolic differentiation impractical for complex functions or higher-order derivatives in machine learning applications.

Partial Derivatives

1. Scalar-Valued Multivariable Functions

Given a scalar input multi-variable function, to calculate its derivative, we simply apply the same rules as ordinary derivatives. For example, given a function f(x,y)=x2y2f(x, y) = x^2 y^2, computing the partial derivative would be as follows:

xx2y2We treat y as a constant and take the derivative of x=2xy2(17)\underbrace{\frac{\partial}{\partial x} x^2 y^2}_{\substack{\text{We treat } y \text{ as a constant } \\ \text{and take the derivative of } x}} = 2xy^2 \tag{17}

For the partial derivative with respect to yy, we also treat xx as a constant. Therefore, the general idea given multiple variables is to take the derivative of each variable while treating all other variables as constants.

In some use cases, it is common to take a second derivative of a derived ordinary differential equation. For partial derivatives, it is also valid to take a partial derivative of a partial derivative (e.g., for a second partial derivative). This would be x(fx)=2fx2\frac{\partial}{\partial x} \left( \frac{\partial f}{\partial x} \right) = \frac{\partial^2 f}{\partial x^2}, x(fy)=2fxy\quad \frac{\partial}{\partial x} \left( \frac{\partial f}{\partial y} \right) = \frac{\partial^2 f}{\partial x \partial y}, y(fx)=2fyx\frac{\partial}{\partial y} \left( \frac{\partial f}{\partial x} \right) = \frac{\partial^2 f}{\partial y \partial x}, y(fy)=2fy2\quad \frac{\partial}{\partial y} \left( \frac{\partial f}{\partial y} \right) = \frac{\partial^2 f}{\partial y^2}.

We can put the resultant partial derivatives into a vector. This vector is commonly denoted as f\nabla f of a vector-valued function and is formally called the gradient.

f(x,y,z,)=[fx(x,y,z,)fy(x,y,z,)fz(x,y,z,)](18)\nabla f(x, y, z, \dots) = \begin{bmatrix} \frac{\partial f}{\partial x}(x, y, z, \dots) \\ \frac{\partial f}{\partial y}(x, y, z, \dots) \\ \frac{\partial f}{\partial z}(x, y, z, \dots) \\ \vdots \end{bmatrix} \tag{18}

The gradient of ff points in the direction of steepest ascent. Moving in the direction of the gradient takes you to the local maxima of a function or curve, and the magnitude of the vector f(x,y,)\nabla f(x, y, \dots) tells you the slope of the function in that direction.

2. Directional Derivatives

Since the first derivative of f(x,y)f(x,y) tells us the slope of the tangent line at a given point in either the positive xx direction for xx and positive yy direction for y,y, i.e. parallel to the variables axes. However, when we need a slope in any direction, we need directional derivatives which is a generalization to calculate the slope in any direction or axis. We specify the direction with a vector e.g. v=(v1,v2,,vn)\vec{v} = (v_1, v_2, \dots, v_n) and the common notation in math for directional derivatives is vf.\nabla_{\vec{v}} f.

vf=fv(19)\nabla_{\vec{v}} f = \nabla f \cdot \vec{v} \tag{19}

If f(v,w,x,y,z)f(v,w,x,y,z) and v=[v1,v2,v3,v4,v5]\vec{v} = [v_1, v_2, v_3, v_4, v_5] then:

vf=v1fv+v2fw+v3fx+v4fy+v5fz.(20)\nabla_{\vec{v}} f = v_1 \frac{\partial f}{\partial v} + v_2 \frac{\partial f}{\partial w} + v_3 \frac{\partial f}{\partial x} + v_4 \frac{\partial f}{\partial y} + v_5 \frac{\partial f}{\partial z}. \tag{20}

And in vector-value notation:

[fvfwfxfyfz][v1v2v3v4v5].(21)\begin{bmatrix} \frac{\partial f}{\partial v} \\ \frac{\partial f}{\partial w} \\ \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \\ \frac{\partial f}{\partial z} \end{bmatrix} \cdot \begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ v_4 \\ v_5 \end{bmatrix}. \tag{21}

3. Jacabian, Hessian and Laplacian

3.1 Jacobian

A vector-valued function is a mapping f:RnRmf : \mathbb{R}^n \to \mathbb{R}^m. Instead of producing a single scalar value, the function maps an input vector x=(x1,x2,,xn)\mathbf{x} = (x_1, x_2, \dots, x_n) to an output vector f(x)=(f1(x),f2(x),,fm(x))f(\mathbf{x}) = (f_1(\mathbf{x}), f_2(\mathbf{x}), \dots, f_m(\mathbf{x})). When the partial derivatives of a vector-valued function are collected together, the result is called the Jacobian. For f(x)=yf(\vec{x}) = \vec{y} m-dimensional vector, the corresponding expression is y=f(x)\vec{y} = \vec{f}(\vec{x}) which can further be expanded into:

[y1y2ym]=[f1(x)f2(x)fm(x)]=[f1(x1,x2,,xn)f2(x1,x2,,xn)fn(x1,x2,,xn)](22)\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} = \begin{bmatrix} f_1(\vec{x}) \\ f_2(\vec{x}) \\ \vdots \\ f_m(\vec{x}) \end{bmatrix} = \begin{bmatrix} f_1(x_1, x_2, \dots, x_n) \\ f_2(x_1, x_2, \dots, x_n) \\ \vdots \\ f_n(x_1, x_2, \dots, x_n) \end{bmatrix} \tag{22}

Therefore the Jacobian matrix becomes the partial derivatives of each component :

yx=[f1xf2xfmx]=[f1x1f1x2f1xnf2x1f2x2f2xnfmx1fmx2fmxn](23)\frac{\partial \vec{y}}{\partial \vec{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial \vec{x}} \\ \frac{\partial f_2}{\partial \vec{x}} \\ \vdots \\ \frac{\partial f_m}{\partial \vec{x}} \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \tag{23}

3.2 Hessian

For scaled-valued functions, i.e f:RnRf : \mathbb{R}^n \to \mathbb{R} that takes in xRn\vec{x} \in \mathbb{R}^n and outputs a scalar f(x)Rf(\mathbf{x}) \in \mathbb{R}. The Hessian matrix H\mathbf{H} of ff is a square n×nn \times n matrix generated from the second-order partial derivatives and summarized as this:

[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2](24)\begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} \tag{24}

3.3 Laplacian

When we sum the diagonal elements of the Hessian matrix aka the trace of the Hessian, we get the Laplacian operator 2f\nabla^2 f which is a scalar value:

2f  =  trace(H)  =  2fx12+2fx22++2fxn2(25)\nabla^2 f \;=\; \operatorname{trace}(\mathbf{H}) \;=\; \frac{\partial^2 f}{\partial x_1^2} + \frac{\partial^2 f}{\partial x_2^2} + \cdots + \frac{\partial^2 f}{\partial x_n^2} \tag{25}

4. Derivatives of VectorProducts

4.1 Element-wise (Hadamard) Product

Given y=f(w)g(x)\vec{y} = \vec{f}(\vec{w}) \odot \vec{g}(\vec{x}), we could further expand it as:

[y1y2yn]=[f1(w)g1(x)f2(w)g2(x)fn(w)gn(x)]=[f1(w1,w2,,wn)g1(x1,x2,,xn)f2(w1,w2,,wn)g2(x1,x2,,xn)fn(w1,w2,,wn)gn(x1,x2,,xn)](26)\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} f_1(\vec{w}) \odot g_1(\vec{x}) \\ f_2(\vec{w}) \odot g_2(\vec{x}) \\ \vdots \\ f_n(\vec{w}) \odot g_n(\vec{x}) \end{bmatrix} = \begin{bmatrix} f_1(w_1, w_2, \ldots, w_n) \odot g_1(x_1, x_2, \ldots, x_n) \\ f_2(w_1, w_2, \ldots, w_n) \odot g_2(x_1, x_2, \ldots, x_n) \\ \vdots \\ f_n(w_1, w_2, \ldots, w_n) \odot g_n(x_1, x_2, \ldots, x_n) \end{bmatrix} \tag{26}

This can be generalized to Jacobian matrix of partial derivatives as:

yx=[(f1g1)x(f2g2)x(fngn)x]=[(f1g1)x1(f1g1)x2(f1g1)xn(f2g2)x1(f2g2)x2(f2g2)xn(fngn)x1(fngn)x2(fngn)xn](27)\frac{\partial \vec{y}}{\partial \vec{x}} = \begin{bmatrix} \frac{\partial(f_1 \odot g_1)}{\partial \vec{x}} \\ \frac{\partial(f_2 \odot g_2)}{\partial \vec{x}} \\ \vdots \\ \frac{\partial(f_n \odot g_n)}{\partial \vec{x}} \end{bmatrix} = \begin{bmatrix} \frac{\partial(f_1 \odot g_1)}{\partial x_1} & \frac{\partial(f_1 \odot g_1)}{\partial x_2} & \cdots & \frac{\partial(f_1 \odot g_1)}{\partial x_n} \\ \frac{\partial(f_2 \odot g_2)}{\partial x_1} & \frac{\partial(f_2 \odot g_2)}{\partial x_2} & \cdots & \frac{\partial(f_2 \odot g_2)}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial(f_n \odot g_n)}{\partial x_1} & \frac{\partial(f_n \odot g_n)}{\partial x_2} & \cdots & \frac{\partial(f_n \odot g_n)}{\partial x_n} \end{bmatrix} \tag{27}

For equation 27, each output dimension depends exclusively on the corresponding input dimension, i.e., (yi=wixi)(y_i = w_i x_i). This independence implies no coordinate influences any other between different dimensions. As a result, the associated linear operator is diagonal, meaning that each dimension is processed independently as below:

[y1y2yn]=[f1(w)g1(x)f2(w)g2(x)fn(w)gn(x)]=[w1x1w2x2wnxn](28)\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} f_1(\vec{w}) \odot g_1(\vec{x}) \\ f_2(\vec{w}) \odot g_2(\vec{x}) \\ \vdots \\ f_n(\vec{w}) \odot g_n(\vec{x}) \end{bmatrix} = \begin{bmatrix} w_1 \odot x_1 \\ w_2 \odot x_2 \\ \vdots \\ w_n \odot x_n \end{bmatrix} \tag{28}

Each output dimension is influenced exclusively by the corresponding input dimension, with no cross-dimensional interactions. As a result, the Jacobian matrix is diagonal:

yx=[(w1x1)x(w2x2)x(wnxn)x]=[(w1x1)x1000(w2x2)x2000(wnxn)xn]=[w1000w2000wn](29)\frac{\partial \vec{y}}{\partial \vec{x}} = \begin{bmatrix} \frac{\partial(w_1 \odot x_1)}{\partial \vec{x}} \\ \frac{\partial(w_2 \odot x_2)}{\partial \vec{x}} \\ \vdots \\ \frac{\partial(w_n \odot x_n)}{\partial \vec{x}} \end{bmatrix} = \begin{bmatrix} \frac{\partial(w_1 \odot x_1)}{\partial x_1} & 0 & \cdots & 0 \\ 0 & \frac{\partial(w_2 \odot x_2)}{\partial x_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{\partial(w_n \odot x_n)}{\partial x_n} \end{bmatrix} = \begin{bmatrix} w_1 & 0 & \cdots & 0 \\ 0 & w_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & w_n \end{bmatrix} \tag{29}

4.2 Dot Product

Given a function y=wxy = \vec{w} \cdot \vec{x}, we can rewrite the dot product as wxT\vec{w}\vec{x}^T or wTx\vec{w}^T\vec{x} if the two vectors w\vec{w} and x\vec{x} are of equal length. We can also write it as inwixi\sum_i^n w_i x_i which will be (wx)\sum(\vec{w} \odot \vec{x}).

This can finally be rewritten as y=(v)y = \sum(\vec{v}) and v\vec{v} being an intermediate simplification of wx\vec{w} \odot \vec{x} which we can easily use the chain rule to find the derivative.

If y=i=1nvi=(v)y = \sum_{i=1}^{n} v_i = \sum(\vec{v}) then we first, need to find vx\frac{\partial \vec{v}}{\partial \vec{x}}:

vx=(wx)x=diag(w)=[w1000w2000wn](30)\frac{\partial \vec{v}}{\partial \vec{x}} = \frac{\partial(\vec{w} \odot \vec{x})}{\partial \vec{x}} = diag(\vec{w}) = \begin{bmatrix} w_1 & 0 & \cdots & 0 \\ 0 & w_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & w_n \end{bmatrix} \tag{30}

We also need to find yv\frac{\partial y}{\partial \vec{v}}:

yv=vi=1nvi=1T=[11,12,,1n](31)\frac{\partial y}{\partial \vec{v}} = \frac{\partial}{\partial \vec{v}}\sum_{i=1}^{n} v_i = \vec{1}^T = [1_1, 1_2, \ldots, 1_n] \tag{31}

With chain rule, the product of the two intermediate results gives the final derivative:

yx=yvvx=1Tdiag(w)=[w1,w2,,wn]=wT(32)\frac{\partial y}{\partial \vec{x}} = \frac{\partial y}{\partial \vec{v}} \frac{\partial \vec{v}}{\partial \vec{x}} = \vec{1}^T \, diag(\vec{w}) = [w_1, w_2, \ldots, w_n] = \vec{w}^T \tag{32}

Similarly, if we get the derivative with respect to w\vec{w} it would be:

vw=diag(x)=[x1000x2000xn](33)\frac{\partial \vec{v}}{\partial \vec{w}} = diag(\vec{x}) = \begin{bmatrix} x_1 & 0 & \cdots & 0 \\ 0 & x_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_n \end{bmatrix} \tag{33} yw=1Tdiag(x)=[x1,x2,,xn]=xT(34)\frac{\partial y}{\partial \vec{w}} = \vec{1}^T \, diag(\vec{x}) = [x_1, x_2, \ldots, x_n] = \vec{x}^T \tag{34}

In general, to compute derivatives involving scalar-vector or vector-vector expressions, one first identifies the dimensions of all input vectors and matrices to determine the form of the resulting expression. The function is then written explicitly in terms of its scalar components, noting that operations such as vector multiplication may yield either a scalar or a matrix depending on the context. Differentiation is performed component-wise with respect to each element of the variable vector, and the resulting partial derivatives are finally assembled into a vector or matrix that matches the appropriate derivative structure (e.g., a gradient or Jacobian).

5 Matrix Derivatives

As you might have observed from the above, scalar and vectors are denoted using lower-case symbols (a, x, etc). However for matrices, we usually use bold upper-case symbols (X\mathbf{X}, A\mathbf{A}, B\mathbf{B}, etc), each element is indexed using subscripts, such as xix_i (i1,,n{i \in {1,\ldots,n}}). The rules for vector calculus in the previous sections to a great extent carry over to matrix differentials, like determinant and inverse differentials are quite different and not familiar unless one has had an exposure to matrix calculus.

Given matrix A\mathbf{A} of constants and scalar value xx; dA=0d\mathbf{A} = 0, d(xA)=xdAd(x\mathbf{A}) = xd\mathbf{A}, d(AT)=(dA)Td(\mathbf{A}^T) = (d\mathbf{A})^T, if A\mathbf{A} is symmetric, dtrA=trdAd\,tr\,\mathbf{A} = tr\,d\mathbf{A}, If the matrix product AB\mathbf{A}\mathbf{B} is defined, d(AB)=(dA)B+AdBd(\mathbf{A}\mathbf{B}) = (d\mathbf{A})\mathbf{B} + \mathbf{A}d\mathbf{B}, and if A\mathbf{A} and B\mathbf{B} are of the same order, then d(A+B)=dA+dBd(\mathbf{A}+\mathbf{B}) = d\mathbf{A} + d\mathbf{B}.

5.1 Matrix Products

5.1.1 Derivative of Scalar by Matrix

Given f(X)f(\mathbf{X}) that maps a matrix XRm×n\mathbf{X} \in \mathbb{R}^{m \times n} to a scalar value, computing the derivative requires us to consider how the scalar output changes with respect to each individual element of the input matrix. For instance, given:

X=[X1,1X1,2X1,nX2,1X2,2X2,nXm,1Xm,2Xm,n](35)\mathbf{X} = \begin{bmatrix} X_{1,1} & X_{1,2} & \cdots & X_{1,n} \\ X_{2,1} & X_{2,2} & \cdots & X_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ X_{m,1} & X_{m,2} & \cdots & X_{m,n} \end{bmatrix} \tag{35}

Since f(X)f(\mathbf{X}) results into a scalar, we can compute the partial derivative with respect to each individual element Xi,jX_{i,j}, fX1,1\frac{\partial f}{\partial X_{1,1}} represents how ff changes when only X1,1X_{1,1} changes, this will be so for every element in the matrix.

All the partial derivatives are collected into a gradient matrix:

fX=[fX1,1fX1,2fX1,nfX2,1fX2,2fX2,nfXm,1fXm,2fXm,n](36)\frac{\partial f}{\partial \mathbf{X}} = \begin{bmatrix} \frac{\partial f}{\partial X_{1,1}} & \frac{\partial f}{\partial X_{1,2}} & \cdots & \frac{\partial f}{\partial X_{1,n}} \\ \frac{\partial f}{\partial X_{2,1}} & \frac{\partial f}{\partial X_{2,2}} & \cdots & \frac{\partial f}{\partial X_{2,n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial X_{m,1}} & \frac{\partial f}{\partial X_{m,2}} & \cdots & \frac{\partial f}{\partial X_{m,n}} \end{bmatrix} \tag{36}

The gradient has the same dimensions as the input matrix, i.e. m×nm \times n. This dimensional consistency is particularly useful in machine learning applications where we need to update matrix parameters using their gradients.

5.1.2 Derivative of Matrix by scalar

Given f(x)f(x) that maps a scalar xRx \in \mathbb{R} to a matrix F(x)Rp×q\mathbf{F}(x) \in \mathbb{R}^{p \times q}, to compute the derivative with respect to the scalar input requires us to consider how each individual element of the output matrix changes. For instance, with:

F(x)=[F1,1(x)F1,2(x)F1,q(x)F2,1(x)F2,2(x)F2,q(x)Fp,1(x)Fp,2(x)Fp,q(x)](37)\mathbf{F}(x) = \begin{bmatrix} F_{1,1}(x) & F_{1,2}(x) & \cdots & F_{1,q}(x) \\ F_{2,1}(x) & F_{2,2}(x) & \cdots & F_{2,q}(x) \\ \vdots & \vdots & \ddots & \vdots \\ F_{p,1}(x) & F_{p,2}(x) & \cdots & F_{p,q}(x) \end{bmatrix} \tag{37}

The individual derivatives with respect to xx results into F1,1x\frac{\partial F_{1,1}}{\partial x},F2,3x\frac{\partial F_{2,3}}{\partial x}, etc. We collect all the resultant derivatives into:

Fx=[F1,1xF1,2xF1,qxF2,1xF2,2xF2,qxFp,1xFp,2xFp,qx](38)\frac{\partial \mathbf{F}}{\partial x} = \begin{bmatrix} \frac{\partial F_{1,1}}{\partial x} & \frac{\partial F_{1,2}}{\partial x} & \cdots & \frac{\partial F_{1,q}}{\partial x} \\ \frac{\partial F_{2,1}}{\partial x} & \frac{\partial F_{2,2}}{\partial x} & \cdots & \frac{\partial F_{2,q}}{\partial x} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial F_{p,1}}{\partial x} & \frac{\partial F_{p,2}}{\partial x} & \cdots & \frac{\partial F_{p,q}}{\partial x} \end{bmatrix} \tag{38}

The derivative for matrix by scalar will always match the dimensions of the output matrix F(x)\mathbf{F}(x). This is the matrix generalization of taking the derivative of a vector-valued function with respect to a scalar.

5.1.3 Derivative of vector by Matrix

Given f(X)f(\mathbf{X}) that maps a matrix XRn×m\mathbf{X} \in \mathbb{R}^{n \times m} to a vector f(X)Rp×1\mathbf{f}(\mathbf{X}) \in \mathbb{R}^{p \times 1}, its derivative is with respect to each element of the input matrix hence resulting into a 3rd-order tensor with dimensions p×n×mp \times n \times m. We take the individual derivative and collect them together like this:

fX=[fX1,1fXm,1fX1,nfXm,n](39)\frac{\partial \mathbf{f}}{\partial \mathbf{X}} = \begin{bmatrix} \frac{\partial f} {\partial X_{1,1}} & \cdots & \frac{\partial f}{\partial X_{m,1}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial X_{1,n}} & \cdots & \frac{\partial f} {\partial X_{m,n}} \end{bmatrix} \tag{39}

The derivative structure maintains the same n×mn \times m layout as the input matrix, but each entry is now a vector rather than a scalar. This creates a 3rd-order tensor with dimensions p×n×mp \times n \times m, though it's often more intuitive to think of it as pp separate n×mn \times m gradient matrices stacked together.

5.1.4 Derivative of Matrix by vector

With a function F(x)\mathbf{F}(\vec{x}) that maps a vector xRm×1\vec{x} \in \mathbb{R}^{m \times 1} to a matrix F(x)Rp×q\mathbf{F}(\vec{x}) \in \mathbb{R}^{p \times q}. To calculate the derivative, we consider how each element of the output matrix changes with respect to each component of the input vector. This will also be a 3rd-order tensor with dimensions p×q×mp \times q \times m.

Fx=[Fx1Fx2Fxm](40)\frac{\partial \mathbf{F}}{\partial \vec{x}} = \begin{bmatrix} \frac{\partial \mathbf{F}}{\partial x_1} & \frac{\partial \mathbf{F}}{\partial x_2} & \cdots & \frac{\partial \mathbf{F}}{\partial x_m} \end{bmatrix} \tag{40}

Each entry Fxk\frac{\partial \mathbf{F}}{\partial x_k} is itself a p×qp \times q matrix showing how all elements of F\mathbf{F} change with respect to the kk-th component of x\vec{x}. So, we can view each sequence of mm matrices, each of size p×qp \times q, together in a row.

Conclusion

This blog has covered the foundational concepts of derivatives and their applications in machine learning, from single-variable differentiation to the complexities of tensor derivatives and computational graphs.

References:

  1. Olah, C. (2015). Calculus on computational graphs: Backpropagation. Retrieved from https://colah.github.io/posts/2015-08-Backprop/

For comments, please send me an email.