Published on
23/09/2019

Entropy, Cross-entropy, KL divergence and Beyond

As a reminder, entropy represents the level of uncertainty/randomness in the data or unpredictability associated with a random variable drawn from some distribution. That is, it shows how mixed or varied the data is, high entropy values means high uncertainty and very hard to predict, while low entropy means low uncertainty. Mathematically information entropy is calculated as:

H(X)=i=1npilog2(pi)(1)H(X) = -\sum_{i=1}^n p_i \log_2(p_i) \tag{1}

where nn is the number of classes, and pip_i is the probability of class ii.

The uniform distribution has the highest entropy since all outcomes are equally likely, in this case, the entropy is:

H(X)=i=1n1nlog2(1n)(2)H(X) = -\sum_{i=1}^n \frac{1}{n} \log_2(\frac{1}{n}) \tag{2} =log2(n)= \log_2(n)

Lets go through an example to solidify the concept. Let's assume we had a hypothetical country of 6 people from 2 tribes. Without any split, the entropy would be:

(36log2(36)+36log2(36))=1-(\frac{3}{6} \log_2(\frac{3}{6}) + \frac{3}{6} \log_2(\frac{3}{6})) = 1
ocean

Updated : Jan 9 2025 - Thanks to a kind reader for pointing out an error in the previous image.

However, with that split, the right side after the split with two people would have entropy of 0, and left side would have entropy of :

(14log2(14)+34log2(34))0.811-(\frac{1}{4} \log_2(\frac{1}{4}) + \frac{3}{4} \log_2(\frac{3}{4})) \approx 0.811

The weighted entropy for both sides would be:

(26×0)+(46×0.811)0.541(\frac{2}{6} \times 0) + (\frac{4}{6} \times 0.811) \approx 0.541 =10.5410.459= 1 - 0.541 \approx 0.459

Entropy variations

Cross-Entropy

Cross-entropy extends the idea of entropy when there exists two distributions (actual observed distribution and predicted data distribution). It is mostly useful in tracking the training loss of classification ML models. Mathematically, if the actual is (PP) and predicted (QQ), then each cross-entropy:

Hce(P,Q)=i=1nPilog2Qi(3)H_{ce}(P,Q) = -\sum_{i=1}^{n} P_i \log_2 Q_i \tag{3}

With our six people example in the previous section, let's assume we had a road and some ML algorithm to predict who has pass the road among the six people. Hence assuming the two probability distributions for each road:

Actual probabilities (P)Predicted probabilities (Q)0.30.10.20.050.150.150.150.20.10.250.10.25\begin{array}{cc} \text{Actual probabilities } (P) & \text{Predicted probabilities } (Q) \\ \hline 0.3 & 0.1 \\ 0.2 & 0.05 \\ 0.15 & 0.15 \\ 0.15 & 0.2 \\ 0.1 & 0.25 \\ 0.1 & 0.25 \end{array}

The cross-entropy for the two distribution would be:

Hce(P,Q)=(0.3log20.1+0.2log20.05+0.15log20.15+0.15log20.2+0.1log20.25+0.1log20.25)H_{ce}(P,Q) = -(0.3\log_2 0.1 + 0.2\log_2 0.05 + 0.15\log_2 0.15 + 0.15\log_2 0.2 + 0.1\log_2 0.25 + 0.1\log_2 0.25) 3.26\approx 3.26

However if the probability distribution between the Actual and predicted were similar as:

Actual probabilities (P)Predicted probabilities (Q)0.30.280.20.220.150.140.150.160.10.110.10.09\begin{array}{cc} \text{Actual probabilities } (P) & \text{Predicted probabilities } (Q) \\ \hline 0.3 & 0.28 \\ 0.2 & 0.22 \\ 0.15 & 0.14 \\ 0.15 & 0.16 \\ 0.1 & 0.11 \\ 0.1 & 0.09 \end{array}

The cross-entropy would be:

Hce(P,Q)=(0.3log20.28+0.2log20.22+0.15log20.14+0.15log20.16+0.1log20.11+0.1log20.09)H_{ce}(P,Q) = -(0.3\log_2 0.28 + 0.2\log_2 0.22 + 0.15\log_2 0.14 + 0.15\log_2 0.16 + 0.1\log_2 0.11 + 0.1\log_2 0.09) 2.38\approx 2.38

Joint Entropy

Consider selecting a number in the of range of 1 to 12 n{1,,12}n \in \{1, \dots, 12\}. Let X(n)X(n) be 1 if n is even, and Y(n)Y(n) be 1 if n is divisible by 3.

Number123456789101112X010101010101Y001001001001\begin{array}{cccccccccccc} \text{Number} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 \\ \hline X & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ Y & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 \end{array}

Since (X=0,Y=0)(X=0, Y=0) occurs at n{1,5,7,11}n \in \{1, 5, 7, 11\}, (X=0,Y=1)(X=0, Y=1) occurs at n{3,6}n \in \{3, 6\}, (X=1,Y=0)(X=1, Y=0) occurs at n{2,4,8,10}n \in \{2, 4, 8, 10\}, and (X=1,Y=1)(X=1, Y=1) occurs at n{6,12}n \in \{6, 12\}. We can summarize the joint distribution of XX and YY as:

P(X,Y)Y=0Y=1X=0412212X=1412212\begin{array}{ccc} P(X, Y) & Y=0 & Y=1 \\ \hline X=0 & \frac{4}{12} & \frac{2}{12} \\ X=1 & \frac{4}{12} & \frac{2}{12} \\ \end{array}

And our joint entropy would be:

=(412log2412+212log2212+412log2412+212log2212)= -\left(\frac{4}{12}\log_2 \frac{4}{12} + \frac{2}{12}\log_2 \frac{2}{12} + \frac{4}{12}\log_2 \frac{4}{12} + \frac{2}{12}\log_2 \frac{2}{12}\right) 1.918\approx 1.918

Joint entropy is a measure of the total uncertainty associated with a set of random variables.Therefore we can represent it a formula as used above is:

H(X,Y)=xXyYP(x,y)log2P(x,y)(4)H(X, Y) = - \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} P(x, y) \log_2 P(x, y) \tag{4}

OR

H(X,Y)=x,yP(x,y)log2P(x,y)(5)H(X, Y) = - \sum_{x, y} P(x, y) \log_2 P(x, y) \tag{5}

Perplexity

This is a measure of predictability. The lowest perplexity is 1, which means the outcome is certain. Higher perplexity values indicate more uncertainty. It is calculated as:

Perplexity(P)=2Hce(P)(6)Perplexity(P) = 2^{H_{ce}(P)} \tag{6}

Kullback-Leibler (KL) Divergence

KL divergence is an extension of cross-entropy for statistical comparison of two probability distributions. It solves the inherent uncertainity problem in cross-entropy since H(P,P)=H(P)H(P,P) = H(P). This is because cross-entropy satisfies H(P,Q)=H(P)+DKL(PQ)andH(P,P)=H(P)>0H(P,Q)=H(P)+D_{\mathrm{KL}}(P||Q) \quad \text{and} \quad H(P,P)=H(P)>0 in general, it does not vanish even when (Q=P)(Q=P). Hence KL divergence subtracts the entropy of the actual distribution from the cross-entropy resulting in:

DKL(P,Q)=Hce(P,Q)H(P)(7)D_{KL}(P,Q) = H_{ce}(P,Q) - H(P) \tag{7} =(j=1npjlog2qjj=1npjlog2pj)(8)= -\left(\sum_{j=1}^{n} p_j \log_2 q_j - \sum_{j=1}^{n} p_j \log_2 p_j\right) \tag{8}

Mutual Information

Mutual information is very similar to KL divergence; the main difference is that while KL divergence compares how similar two distributions are, mutual information measures the dependency between two random variables by calculating the KL divergence between their joint distribution P(X,Y)P(X, Y) and the product of their marginals P(X)P(Y)P(X)P(Y) which is calculated as:

I(X;Y)=xXyYp(x,y)log(p(x,y)p(x)p(y))(9)I(X; Y) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log \left( \frac{p(x, y)}{p(x)p(y)} \right) \tag{9}

With another representation as:

I(X;Y)=H(X,Y)H(X/Y)H(Y/X)(10)I(X; Y) = H(X,Y) - H(X/Y) - H(Y/X) \tag{10} =H(X)+H(Y)H(X,Y)(11)= H(X) + H(Y) - H(X,Y) \tag{11}

Reconsider the joint entropy example above with XX and YY. First the conditional distributions P(Y/X)P(Y/X) normalizing each row:

P(Y/X)Y=0Y=1X=04/126/12=232/126/12=13X=14/126/12=232/126/12=13\begin{array}{ccc} P(Y/X) & Y=0 & Y=1 \\ \hline X=0 & \frac{4/12}{6/12} = \frac{2}{3} & \frac{2/12}{6/12} = \frac{ 1}{3} \\ X=1 & \frac{4/12}{6/12} = \frac{2}{3} & \frac{2/12}{6/12} = \frac{1}{3} \\ \end{array}

Then calculating the conditional entropy H(Y/X)H(Y/X):

=(23log223+13log213+23log223+13log213)= -\left(\frac{2}{3}\log_2 \frac{2}{3} + \frac{1}{3}\log_2 \frac{1}{3} + \frac{2}{3}\log_2 \frac{2}{3} + \frac{1}{3}\log_2 \frac{1}{3}\right) 0.918\approx 0.918

Hence the mutual information between XX and YY would be:

I(X;Y)=H(X)H(Y/X)I(X; Y) = H(X) - H(Y/X) =10.918= 1 - 0.918 =0.082= 0.082

And we can verify joint entropy with the mutual information formula as:

H(X,Y)=H(X)+H(Y)I(X;Y)(12)H(X,Y) = H(X) + H(Y) - I(X; Y) \tag{12} =1+10.082=1.918= 1 + 1 - 0.082 = 1.918

Conclusion

Understanding these concepts is crucial since all the machine learning algorithms are based on the idea of minimizing some form of a loss function which is often a variation of the above concepts.

For comments, please send me an email.