Published on
17/12/2019

Understanding and Quantifying Uncertainties Related to Random Events

Probability

1. Basics

In this section, we start by defining some of the basic concepts that prepare us for conditional probability, bayes theorem and more advanced concepts like distributions.

1.1 Sample Space and Events

A sample space Ω\Omega is a list/set of possible outcomes (the result) of an experiment. For example a coin flip Ω={H,T}\Omega = \{H, T\}, rolling a dice Ω={1,2,3,4,5,6}\Omega = \{1, 2, 3, 4, 5, 6\}. Events are a group of one or more outcome(s) of an experiment, that is, it's a subset of the sample space e.g E={1,2,5}E = \{1, 2, 5\} after rolling a dice 3 times. complementary outcome is the probability of an event not occurring and it is usually denoted as A\overline{A} for set AA.

1.2 Sets and Set Operations

A set is a collection of members where the order and repetition does not matter; for example, {a,b,c}\{a, b, c\} and {a,b,b,c,a}\{a, b, b, c, a\} are considered equal because they contain the same elements. Commonly, sets are described by listing its members in braces for example {x:2<x<3}\{x : 2 < x < 3\} or the shorthand {2<x<3}\{2 < x < 3\}. Here, the symbol \in denotes that an element belongs to a set, while \notin is for non-membership. A set XX is a subset of YY, denoted as XYX \subset Y, if every element of XX is also in YY. This relationship can be reversed to YXY \supset X, meaning YY is a superset of XX. Normally, every set is considered a subset of itself.

Given sets A={1,2}A = \{1, 2\}, B={2,3}B = \{2, 3\}, and a sample space S={1,2,3,4}S = \{1, 2, 3, 4\}. We can define some of the follow operations summarized below:

PropertyExpressionExampleCommutativeAB=BA{1,2}{2,3}={2,3}{1,2}={1,2,3}AB=BA{1,2}{2,3}={2,3}{1,2}={2}AssociativeA(BC)=(AB)CA{2,3,4}={1,2,3}C={1,2,3,4}A(BC)=(AB)CA{2,3}={2}C={2}DistributiveA(BC)=(AB)(AC)A{2}={1,2,3}{1,2,4}={1,2}A(BC)=ABACA{2,3,4}={2}{1}={1,2}IdempotentAA=A, AA=A{1,2}{1,2}={1,2}; {1,2}{1,2}={1,2}Rules for A=A, A={1,2}={1,2}; {1,2}=Rules for SAS=S, AS=A{1,2}S=S; {1,2}S={1,2}Rules for AAA=S, AA={1,2}{3,4}=S; {1,2}{3,4}=DeMorgan’sAB=AB{2}={1,3,4}; {3,4}{1,4}={1,3,4}AB=AB{1,2,3}={4}; {3,4}{1,4}={4}Double ComplementA=AComplement 1{3,4} 2nd {1,2}\begin{array}{lcl} \text{Property} & \text{Expression} & \text{Example} \\ \hline \text{Commutative} & A \cup B = B \cup A & \{1, 2\} \cup \{2, 3\} = \{2, 3\} \cup \{1, 2\} = \{1, 2, 3\} \text{} \\ & AB = BA & \{1, 2\} \cap \{2, 3\} = \{2, 3\} \cap \{1, 2\} = \{2\} \text{} \\ \text{Associative} & A \cup (B \cup C) = (A \cup B) \cup C & A \cup \{2, 3, 4\} = \{1, 2, 3\} \cup C = \{1, 2, 3, 4\} \text{} \\ & A(BC) = (AB)C & A \cap \{2, 3\} = \{2\} \cap C = \{2\} \text{} \\ \text{Distributive} & A \cup (BC) = (A \cup B)(A \cup C) & A \cup \{2\} = \{1, 2, 3\} \cap \{1, 2, 4\} = \{1, 2\} \text{} \\ & A(B \cup C) = AB \cup AC & A \cap \{2, 3, 4\} = \{2\} \cup \{1\} = \{1, 2\} \text{} \\ \text{Idempotent} & A \cup A = A, \text{ } AA = A & \{1, 2\} \cup \{1, 2\} = \{1, 2\}; \text{ } \{1, 2\} \cap \{1, 2\} = \{1, 2\} \text{} \\ \text{Rules for } \emptyset & A \cup \emptyset = A, \text{ } A\emptyset = \emptyset & \{1, 2\} \cup \emptyset = \{1, 2\}; \text{ } \{1, 2\} \cap \emptyset = \emptyset \text{} \\ \text{Rules for } S & A \cup S = S, \text{ } AS = A & \{1, 2\} \cup S = S; \text{ } \{1, 2\} \cap S = \{1, 2\} \text{} \\ \text{Rules for } \overline{A} & A \cup \overline{A} = S, \text{ } A\overline{A} = \emptyset & \{1, 2\} \cup \{3, 4\} = S; \text{ } \{1, 2\} \cap \{3, 4\} = \emptyset \text{} \\ \text{DeMorgan's} & \overline{A \cap B} = \overline{A} \cup \overline{B} & \overline{\{2\}} = \{1, 3, 4\}; \text{ } \{3, 4\} \cup \{1, 4\} = \{1, 3, 4\} \text{} \\ & \overline{A \cup B} = \overline{A} \cap \overline{B} & \overline{\{1, 2, 3\}} = \{4\}; \text{ } \{3, 4\} \cap \{1, 4\} = \{4\} \text{} \\ \text{Double Complement} & \overline{\overline{A}} = A & \text{Complement 1} \{3, 4\} \text{ 2nd } \{1, 2\} \text{} \\ \end{array}

Indexed family of sets

Similar to the unions \cup and intersection \cap above, the indexed family of sets is a set whose elements are themselves sets. They are commonly written for unions and intersections as:

i=1nAiwhich is same asA1A2An(1)\bigcup_{i=1}^{n} A_i \quad \text{which is same as} \quad A_1 \cup A_2 \cup \dots \cup A_n \text{} \tag{1} i=1nAialso same asA1A2An(2)\bigcap_{i=1}^{n} A_i \quad \text{also same as} \quad A_1 \cap A_2 \cap \dots \cap A_n \text{} \tag{2}

The definitions of union and intersection can be expressed naturally using logical quantifiers. An element xx belongs to the union if there exists at least one index iIi \in I such that xAix \in A_i. In contrast, an element xx belongs to the intersection only if for every index iIi \in I, xAix \in A_i. Thus we have:

iIAi={x:iI xAi}(3)\bigcup_{i \in I} A_i = \{x : \exists i \in I \ x \in A_i\} \text{} \tag{3} iIAi={x:iI xAi}(4)\bigcap_{i \in I} A_i = \{x : \forall i \in I \ x \in A_i\} \text{} \tag{4}

1.3 Enumerative combinatorics

1.3.1 Additive and Multiplicative Principles

Let's imagine you are visiting your home area after a few years of being away, and you have very limited time for your visit. Assume you have 10 friends, each living in a different address or sub-area, and you also have 20 relatives, with each relative living in their own area; Problem 1: If your time only allows you to see only one person (either a friend or one relative) how many possibilities to select do you have?. Problem 2: If the time allows you to visit one friend and one relative (meaning you visit two people), how many possibilities to select do you have?

We could agree that, the first problem is no that hard. We just add 10 friends + 20 relatives and your choice of whom to visit is from a pool of 30 people. We generally call this the Additive principle and it states:

If an event AA occurs in XX ways, event BB occurs in YY ways, and the two events AA and BB are disjoint (both events can't happen at the same time - aka mutually exclusive) then the event AA or BB occurs is X+YX + Y ways.

Formally, the Additive principle given two sets AA and BB implies that if AB=A \cap B = \emptyset, then AB=0|A \cap B| = 0. Therefore, we can express the union of these sets in terms of cardinality and probability:

AB=A+B(5)|A \cup B| = |A| + |B| \tag{5} P(AB)=P(A)+P(B)(6)P(A \cup B) = P(A) + P(B) \tag{6}

For both of these scenarios, we could extend the logic to three or more events. For example, consider the expression P(ABC)=P(A)+P(B)+P(C)P(AB)P(AC)P(BC)+P(ABC)P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C).

On the contrary, the second problem is not quite as direct because it involves a cartesian product. We call this the Multiplicative Principle, and it states that:

If an event AA occurs in XX ways and each possibility of AA allows for YY ways of event BB, then events AA and BB can occur together in XYX \cdot Y ways.

For our two sets AA and BB, we formally write this relationship as:

A×B=AB(7)|A \times B| = |A| \cdot |B| \tag{7} P(AB)=P(A)×P(B)(8)P(A \cap B) = P(A) \times P(B) \tag{8}

The key distinction is that while the first problem focused on the occurrence of event AA or BB, the second problem focuses on the occurrence of both event AA and BB.

1.3.2 Permutations and Combinations

Permutation is the different number of ways we can pick/select/arrange some or all items from a set when order is important. That is, we order kk items from nn choices (set) count all the ordering and stop after we find kk items. The formula is summarized as:

P(n,k)=n!(nk)!(9)P(n, k) = \frac{n!}{(n - k)!} \tag{9}

Lets assume we had a set of 5 books (A,B,C,D,E)(A, B, C, D, E), a permutation of 2 books from the five would be P(5,2)P(5, 2).

P(5,2)=5!(52)!=5!3!=5×4×3×2×13×2×1=20P(5, 2) = \frac{5!}{(5 - 2)!} = \frac{5!}{3!} = \frac{5 \times 4 \times 3 \times 2 \times 1}{3 \times 2 \times 1} = 20

Similarly if we wanted the total possibilities of order for the first 3 books:

P(5,3)=5!(53)!=5!2!=5×4×3×2×12×1=60P(5, 3) = \frac{5!}{(5 - 3)!} = \frac{5!}{2!} = \frac{5 \times 4 \times 3 \times 2 \times 1}{2 \times 1} = 60

However, when we want to select an number of items and their order or way of selecting does not matter, we call this a combination. It is usually summarized with the formula:

C(n,k)=P(n,k)k!(10)C(n, k) = \frac{P(n, k)}{k!} \tag{10} =n!(nk)!k!(11)= \frac{n!}{(n - k)! k!} \tag{11}

For example, using the same set of 5 books (A,B,C,D,E)(A, B, C, D, E), if we wanted to choose a group of 2 books where the order doesn't matter:

C(5,2)=5!(52)!2!=5!3!2!=5×42×1=10C(5, 2) = \frac{5!}{(5 - 2)! 2!} = \frac{5!}{3! 2!} = \frac{5 \times 4}{2 \times 1} = 10

In this case, a selection of {A,B}\{A, B\} is considered the same as {B,A}\{B, A\}, which is why the total number of ways is half that of the permutation P(5,2)P(5, 2).

1.3.3 Permutation and Combination with Repetition

For permutations with repetition, the order in which you select the items remains important, but you are free to pick the same item multiple times. Since each of the kk selections you make has nn possible choices, the total number of ways to arrange them is simply nn raised to the power of kk.

Using our previous example of the 5 books (A,B,C,D,E)(A, B, C, D, E), if you were to select 3 books where you could pick the same book more than once (for example, the sequence A,A,BA, A, B), you would have 5 choices for the first position, 5 for the second, and 5 for the third. This results in 535^3 or 125 total permutations.

Combinations with repetition also called a multiset happens when we are selecting a group of items where the order does not matter, and we can include multiples of the same item summarized as:

Crep(n,k)=(n+k1k)(12)C_{rep}(n, k) = \binom{n + k - 1}{k} \tag{12}

With our 5 books example, choosing a set of 3 books with repetition allowed would be:

Crep(5,3)=(5+313)=(73)=7×6×53×2×1=35C_{rep}(5, 3) = \binom{5 + 3 - 1}{3} = \binom{7}{3} = \frac{7 \times 6 \times 5}{3 \times 2 \times 1} = 35

In summary the key points to remember are:

MethodRepetition NOT allowedRepetition allowedPermutationsP(n,k)=n!(nk)!P(n,k)=nkCombinationsC(n,k)=n!(nk)!k!C(n,k)=(n+k1)!k!(n1)!\begin{array}{lcll} \text{Method} & \text{Repetition NOT allowed} & \text{Repetition allowed} \\ \hline \text{Permutations} & P(n, k) = \frac{n!}{(n-k)!} & P(n, k) = n^k \\ \text{Combinations} & C(n, k) = \frac{n!}{(n-k)!k!} & C(n, k) = \frac{(n+k-1)!}{k!(n-1)!} \\ \end{array}

2. Random Variables

Let's use a coin flip and a six-faced die to explain what random variables are and how they differ from a sample space. We know the sample space of the die will be S={1,2,3,4,5,6}S = \{1, 2, 3, 4, 5, 6\}, and S={H,T}S = \{H, T\} for the coin flip.We could define the random variable XX for the die as X={1,2,3,4,5,6}X = \{1, 2, 3, 4, 5, 6\} and the probability of each outcome P(X=x)P(X=x) when a fair die is rolled would be 1/61/6 for each value of the random variable as:

P(X=1)=1/6, P(X=2)=1/6 P(X=6)=1/6P(X=1) = 1/6, \ P(X=2) = 1/6 \dots \ P(X=6) = 1/6

For the coin, let's assume you have 2 chances to flip and our interest is to observe the number of heads. We could summarize the possible outcomes as:

Flip 1Flip 2Total Number of HeadsHH2HT1TH1TT0\begin{array}{lcl} \text{Flip 1} & \text{Flip 2} & \text{Total Number of Heads} \\ \hline \text{H} & \text{H} & 2 \\ \text{H} & \text{T} & 1 \\ \text{T} & \text{H} & 1 \\ \text{T} & \text{T} & 0 \\ \end{array}

We could further summarize the above as a probability distribution since the random variables X={0,1,2}X = \{0, 1, 2\}:

Total Number of heads X012Total occurrences121P(X=x)1/42/41/4\begin{array}{lcll} \text{Total Number of heads } X & 0 & 1 & 2 \\ \text{Total occurrences} & 1 & 2 & 1 \\ P(X=x) & 1/4 & 2/4 & 1/4 \\ \end{array}

With the above two examples, we can clearly see the difference between the sample space and random variables. The random variable provides information for expressing the probability distribution of the possible outcomes.

2.1 Discrete and Continuous Random Variables

Both examples in the previous section are discrete random variables scenarios. When the sample space XX is finite, then XX is a discrete random variable. However, if we cannot create a finite set of distinct possible outcomes but have to create or rely on intervals to represent the probability distribution, we call this a continuous random variable.

2.2 Expected Value

Given arange of numbers say 1 through 10, the average is a single value that we could use to describe that range. However, for a random variable, we call this Expected Value. The Expected Value, denoted as E[X]E[X], is a weighted average of all values in a range where each possible outcome is multiplied by its corresponding probability.

For a discrete random variable, we expressed it as:

E[X]=i=1nxi×P(xi)(13)E[X] = \sum_{i=1}^{n} x_i \times P(x_i) \tag{13}

For continuous random variables, where the values exist across an infinite range of intervals, the expected value is calculated using integration over the probability density function(we shall cover what is means in the later sections):

E[X]=xf(x)dx(14)E[X] = \int_{-\infty}^{\infty} x f(x) \, dx \tag{14}

2.3 Joint probability, Conditional Probability, Independence and Conditional Independence

Given events (AA and BB), their probability of happening together is called the joint probability, however if AA and BB are random variables, we call the probability a joint probability distribution. Formally the joint probability of events AA and BB is defined as:

P(AB)=P(A,B)(15)P(A \land B) = P(A,B) \tag{15} P(AB)=P(AB)conditional probability×P(B)(16)P(A \land B) = \underbrace{P(A|B)}_{\text{conditional probability}} \times P(B) \tag{16}

Two events are independent if the occurrence of one event does not influence the probability of the occurence of the other. If the events are independent, we denote the joint probability as:

P(A,B)=P(A)×P(B)(17)P(A,B) = P(A) \times P(B) \tag{17}

Independent events differ from mutually exclusive events though they look like the same concept, mutually exclusive events are actually dependent in that if one event occurs, the other cannot, meaning the occurrence of one directly affects the probability of the other event.

With that in mind, its worth notating that the joint probability is symmetrical in that P(AB)P(A \land B) is the same as P(BA)P(B \land A), as shown below:

P(AB)=P(AB)×P(B)=P(BA)×P(A)(18)P(A \land B) = P(A|B) \times P(B) = P(B|A) \times P(A) \tag{18}

And for independent events:

P(AB)=P(A)×P(B)=P(B)×P(A)(19)P(A \land B) = P(A) \times P(B) = P(B) \times P(A) \tag{19}

We say two events are conditionally independent if they are independent given the occurence of a third event. Formally, events AA and BB are conditionally independent given event CC if:

P(ABC)=P(AC)×P(BC)(20)P(A \land B | C) = P(A|C) \times P(B|C) \tag{20}

3. Bayes Theorem

Bayes theorem is case of conditional probability that is denoted as:

P(A/B)=P(B/A)LikelihoodP(A)P(B)marginalwhere P(B)0(21)P(A/B) = \frac{\overbrace{P(B/A)}^{\text{Likelihood}} P(A)}{\underbrace{P(B)}_{\text{marginal}}} \quad \text{where } P(B) \neq 0 \tag{21}

To better understand the above formula, we shall considered the wikipedia formula for when specificity, sensitivy and prevelance of a disease is known.

P(user/positive)=P(positive/user)P(user)P(positive)(22)P(\text{user} / \text{positive}) = \frac{P(\text{positive} / \text{user}) P(\text{user})}{P(\text{positive})} \tag{22} =P(positive/user)P(user)P(positive/user)P(user)+P(positive/non-user)P(non-user)(23)= \frac{P(\text{positive} / \text{user}) P(\text{user})}{P(\text{positive} / \text{user}) P(\text{user}) + P(\text{positive} / \text{non-user}) P(\text{non-user})} \tag{23}

There is currently a rare serious flu reported overseas, lets assume you suspecting the flu you just contracted seems unusual. You visits the hospital and the lab test shows you are positive of the overseas flu. The doctors reiterate that the lab test carried reports 95% of the positive results accurately and 95% when the test is negative, that is:

P(+/rare flu)=0.95P(+ / \text{rare flu}) = 0.95 P(/no rare flu)=0.95P(- / \text{no rare flu}) = 0.95

But P(+/rare flu)P(+ / \text{rare flu}) is not the same as P(rare flu/+)P(\text{rare flu} / +), the later is the Bayes' theorem which we are interested in. So, similar with the wikipedia example, we shall calculate it as:

P(rare flu/+)=P(+/rare flu)P(rare flu)P(+/rare flu)P(rare flu)+P(+/no rare flu)P(no rare flu)(24)P(\text{rare flu} / +) = \frac{P(+ / \text{rare flu}) P(\text{rare flu})}{P(+ / \text{rare flu}) P(\text{rare flu}) + P(+ / \text{no rare flu}) P(\text{no rare flu})} \tag{24}

Assuming the incident ratio for the new overseas flu is 1 in 10,000 people, the P(rare flu)P(\text{rare flu}) would be 0.00010.0001 and P(no rare flu)=0.9999P(\text{no rare flu}) = 0.9999 there fore P(rare flu/+)P(\text{rare flu} / +) would be:

=0.95×0.00010.95×0.0001+0.05×0.9999=0.0001= \frac{0.95 \times 0.0001}{0.95 \times 0.0001 + 0.05 \times 0.9999} = 0.0001

This would translate to 0.1% chance of you actually having the overseas flu not 95% due to test accuracy which is a common misconception.

Statistics

1. Basics

1.1 Variance, standard deviation, covariance and correlation

The variance measures the dupersion of data points from the mean xˉ\bar{x} of random vanables XX. Meaning a small variance indicates that the data is clustered around the mean and higher values for when the data points are far from the mean. We summarize variance with:

s2=(Xxˉ)2N1divide by N only if it’s a population(25)s^2 = \frac{\sum (X - \bar{x})^2}{N - 1} \quad \text{divide by } N \text{ only if it's a population} \tag{25}

It can also be re-written as:

Var(X)=E[(Xxˉ)2]where xˉ=E[X](26)\text{Var}(X) = E \left[ (X - \bar{x})^2 \right] \quad \text{where } \bar{x} = E[X] \tag{26}

The standard deviation is the square root of the variance for sample data and it is denoted as:

s=1N1i=1N(xixˉ)2(27)s = \sqrt{\frac{1}{N - 1} \sum_{i=1}^N (x_i - \bar{x})^2} \tag{27}

Also re-written as:

σ=E[(Xxˉ)2](28)\sigma = \sqrt{E[(X - \bar{x})^2]} \tag{28}

The main difference is that the standard deviation is expressed the dispersion in the same units as the data points of the random variables while the variance does not.

The covariance generalizes the concept of variance to two random variables. For example, for two random variable XX and YY, the covariance is the expected product of their deviations from the expected values for each denoted as:

Cov(X,Y)=E[(XE[X])(YE[Y])](29)\text{Cov}(X, Y) = E \left[ (X - E[X]) (Y - E[Y]) \right] \tag{29}

The sign of the resultant value shows the linear relationship between those variables. A positive result cov(x,y)>0\text{cov}(x, y) > 0 happens if XX increases and yy also increases, and vice-versa. For a negative cov(x,y)<0\text{cov}(x, y) < 0 happens if XX increases but yy decreases, and vice-versa. And a zero covariance means the two or more variables are uncorrelated.For multiple variables, a squared matrix that summarizes covariances between multiple variables is the covariance matrix which is usually represented as:

[cov(x1,x1)cov(x1,x2)cov(x1,xn)cov(x2,x1)cov(x2,x2)cov(x2,xn)cov(xn,x1)cov(xn,x2)cov(xn,xn)]\begin{bmatrix} \text{cov}(x_1, x_1) & \text{cov}(x_1, x_2) & \cdots & \text{cov}(x_1, x_n) \\ \text{cov}(x_2, x_1) & \text{cov}(x_2, x_2) & \cdots & \text{cov}(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \text{cov}(x_n, x_1) & \text{cov}(x_n, x_2) & \cdots & \text{cov}(x_n, x_n) \end{bmatrix}

The challenge with the covariance is that it can become infinitely positive or negative. For better interpretability, we could normalize the covariance to a certain range of values. The Pearson correlation is a normalized covariance in a range of [1,1][-1, 1] and denoted as:

Corr(X,Y)=cov(X,Y)σXσY(30)Corr(X, Y) = \frac{\text{cov}(X, Y)}{\sigma_X \sigma_Y} \tag{30} =E[(XE[X])(YE[Y])]E[(XE[X])2]E[(YE[Y])2](31)= \frac{E \left[ (X - E[X])(Y - E[Y]) \right]}{\sqrt{E \left[ (X - E[X])^2 \right]} \sqrt{E \left[ (Y - E[Y])^2 \right]}} \tag{31}

The coefficients +1+1 and 1-1 show a perfect linear relationship between those variables and 00 indicates no linear relationship. The positive coefficients indicate that when the variable increases, the other also increase.

1.2 Sampling Distribution

First, the sampling distribution is a different concept from the probability distribution though they are closely linked. Let's explore what it is with an example. Imagine you wanted to know the average height of an adult nilo-hermite tribe in a rural village in Eastern Uganda. The village has 10,000 adults, you have limited time to ask or measure everyone but instead decide to draw a sample of 100 adults and record their mean xˉ1=192 cm\bar{x}_1 = 192\text{ cm}. xˉ1\bar{x}_1 is a sample statistic that summarizes your first sample. You continue drawing samples of 100 adults say another 99 times (x1,x2,,x100x_1, x_2, \dots, x_{100}) meaning in total 10,000 people. Therefore the sample distribution is the distribution of the 100 adults drawn from the population.

The Central Limit Theorem (CLT) states that if you take sufficiently large samples from a population with mean μ\mu and standard deviation σ\sigma, the distribution of the sample means will be approximately normal, regardless of the population's original distribution. For the CLT and subsequent inference to hold, three key conditions must be met: Randomness, where the samples are collected using a random process; Independence, typically satisfied if the sample size nn is less than 10% of the population (the 10% Rule); and the Normal/Large Sample condition, which requires the population itself to be normal or the sample size to be at least n30n \ge 30.

Notably, sampling distributions can be constructed for various statistics, not just the sample mean; for instance, we can generate the sampling distribution of the sample proportion (p^\hat{p}) or the sample variance (s2s^2). However, the most important concept with the sampling distribution is the Standard Error (SE), which measures the standard deviation of a sampling distribution. While the population standard deviation (σ\sigma) describes the spread of individual data points, the Standard Error describes the spread of the sample statistic itself (e.g., SE=σ/nSE = \sigma / \sqrt{n} for the mean). As the sample size nn increases, the Standard Error decreases, meaning our sample estimate becomes more precise and more tightly clustered around the true population parameter.

2. Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a method used to estimate the parameters of a statistical model based on observed sample data. Essentially, it measures how well a model with specific parameter values explains the data that has been collected.

One fundamental idea with MLE is the assumption of i.i.d. (independent and identically distributed) observations. Independent means that the occurrence of one observation does not affect the probability of another. For example, in the height example, measuring one person's height gives you no information about the height of the next person selected. Identically Distributed means every observation is drawn from the same underlying probability distribution, i.e., they all share the same parameters, such as the same population mean (μ\mu) and standard deviation (σ\sigma).

To calculate the likelihood of an entire sample, we assume that the individual observations are i.i.d. Under this assumption, the likelihood of observing the whole sample is the product of the individual probability densities:

L(x)=i=1nf(xi)(32)L(x) = \prod_{i=1}^{n} f(x_i) \tag{32}

Where nn represents the sample size. If we assume the data is sampled from a normal distribution, the likelihood function can be rewritten as:

L(x)=i=1n12πσ2e(xiμ)22σ2(33)L(x) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \tag{33}

The fundamental goal of MLE is to find the parameter values (like the mean μ\mu or variance σ2\sigma^2) that maximize L(x)L(x). In other words, it identifies the population parameters most likely to have generated the observed data given a specific distribution type.

The downside with Maximum Likelihood Estimation (MLE) is that the value of L(x)L(x) can become extremely small as the sample size grows. Additionally, when the variance is small, it is possible to have probability densities higher than 1, which would result in the likelihood function producing extremely large numbers. This is problematic because computers have a limited capacity for the number of digits they can store for each given number, meaning very large or very small numbers might be recorded as 0 or Infinity instead of their actual values. To handle this, it is standard practice to use the natural logarithm of the likelihood function, denoted as log(L(x))\log(L(x)). A useful property of logarithms is that the logarithm of a product of values is equal to the sum of the logarithms of those individual values as summarized below:

log(L(x))=i=1nlog(f(xi))(34)\log(L(x)) = \sum_{i=1}^{n} \log(f(x_i)) \tag{34}

Apart from simple, trivial models, MLE often cannot be solved analytically. One approach is a grid search, where you try a sequence of parameter values to find the one that yields the maximum log-likelihood. However, this becomes inefficient when there are many parameters to estimate. Instead, non-linear optimizers are used. By convention, these optimizers are designed to minimize functions rather than maximize them. Therefore, the standard practice is to tweak the problem to minimizing the Negative Log-Likelihood (NLL).

It's key to note that, while the probability density function f(x)f(x) can represent any statistical distribution e.g Normal, Bernoulli, or Poisson distribution, and among others, the underlying objective of Maximum Likelihood Estimation (MLE) is to identify the specific population parameter values that maximize the likelihood function L(x)L(x), thereby finding the parameters that are most likely to have generated the observed sample data.

For comments, please send me an email.