- Published on
- 17/12/2019
Understanding and Quantifying Uncertainties Related to Random Events
Probability
1. Basics
In this section, we start by defining some of the basic concepts that prepare us for conditional probability, bayes theorem and more advanced concepts like distributions.
1.1 Sample Space and Events
A sample space is a list/set of possible outcomes (the result) of an experiment. For example a coin flip , rolling a dice . Events are a group of one or more outcome(s) of an experiment, that is, it's a subset of the sample space e.g after rolling a dice 3 times. complementary outcome is the probability of an event not occurring and it is usually denoted as for set .
1.2 Sets and Set Operations
A set is a collection of members where the order and repetition does not matter; for example, and are considered equal because they contain the same elements. Commonly, sets are described by listing its members in braces for example or the shorthand . Here, the symbol denotes that an element belongs to a set, while is for non-membership. A set is a subset of , denoted as , if every element of is also in . This relationship can be reversed to , meaning is a superset of . Normally, every set is considered a subset of itself.
Given sets , , and a sample space . We can define some of the follow operations summarized below:
Indexed family of sets
Similar to the unions and intersection above, the indexed family of sets is a set whose elements are themselves sets. They are commonly written for unions and intersections as:
The definitions of union and intersection can be expressed naturally using logical quantifiers. An element belongs to the union if there exists at least one index such that . In contrast, an element belongs to the intersection only if for every index , . Thus we have:
1.3 Enumerative combinatorics
1.3.1 Additive and Multiplicative Principles
Let's imagine you are visiting your home area after a few years of being away, and you have very limited time for your visit. Assume you have 10 friends, each living in a different address or sub-area, and you also have 20 relatives, with each relative living in their own area; Problem 1: If your time only allows you to see only one person (either a friend or one relative) how many possibilities to select do you have?. Problem 2: If the time allows you to visit one friend and one relative (meaning you visit two people), how many possibilities to select do you have?
We could agree that, the first problem is no that hard. We just add 10 friends + 20 relatives and your choice of whom to visit is from a pool of 30 people. We generally call this the Additive principle and it states:
If an event occurs in ways, event occurs in ways, and the two events and are disjoint (both events can't happen at the same time - aka mutually exclusive) then the event or occurs is ways.
Formally, the Additive principle given two sets and implies that if , then . Therefore, we can express the union of these sets in terms of cardinality and probability:
For both of these scenarios, we could extend the logic to three or more events. For example, consider the expression .
On the contrary, the second problem is not quite as direct because it involves a cartesian product. We call this the Multiplicative Principle, and it states that:
If an event occurs in ways and each possibility of allows for ways of event , then events and can occur together in ways.
For our two sets and , we formally write this relationship as:
The key distinction is that while the first problem focused on the occurrence of event or , the second problem focuses on the occurrence of both event and .
1.3.2 Permutations and Combinations
Permutation is the different number of ways we can pick/select/arrange some or all items from a set when order is important. That is, we order items from choices (set) count all the ordering and stop after we find items. The formula is summarized as:
Lets assume we had a set of 5 books , a permutation of 2 books from the five would be .
Similarly if we wanted the total possibilities of order for the first 3 books:
However, when we want to select an number of items and their order or way of selecting does not matter, we call this a combination. It is usually summarized with the formula:
For example, using the same set of 5 books , if we wanted to choose a group of 2 books where the order doesn't matter:
In this case, a selection of is considered the same as , which is why the total number of ways is half that of the permutation .
1.3.3 Permutation and Combination with Repetition
For permutations with repetition, the order in which you select the items remains important, but you are free to pick the same item multiple times. Since each of the selections you make has possible choices, the total number of ways to arrange them is simply raised to the power of .
Using our previous example of the 5 books , if you were to select 3 books where you could pick the same book more than once (for example, the sequence ), you would have 5 choices for the first position, 5 for the second, and 5 for the third. This results in or 125 total permutations.
Combinations with repetition also called a multiset happens when we are selecting a group of items where the order does not matter, and we can include multiples of the same item summarized as:
With our 5 books example, choosing a set of 3 books with repetition allowed would be:
In summary the key points to remember are:
2. Random Variables
Let's use a coin flip and a six-faced die to explain what random variables are and how they differ from a sample space. We know the sample space of the die will be , and for the coin flip.We could define the random variable for the die as and the probability of each outcome when a fair die is rolled would be for each value of the random variable as:
For the coin, let's assume you have 2 chances to flip and our interest is to observe the number of heads. We could summarize the possible outcomes as:
We could further summarize the above as a probability distribution since the random variables :
With the above two examples, we can clearly see the difference between the sample space and random variables. The random variable provides information for expressing the probability distribution of the possible outcomes.
2.1 Discrete and Continuous Random Variables
Both examples in the previous section are discrete random variables scenarios. When the sample space is finite, then is a discrete random variable. However, if we cannot create a finite set of distinct possible outcomes but have to create or rely on intervals to represent the probability distribution, we call this a continuous random variable.
2.2 Expected Value
Given arange of numbers say 1 through 10, the average is a single value that we could use to describe that range. However, for a random variable, we call this Expected Value. The Expected Value, denoted as , is a weighted average of all values in a range where each possible outcome is multiplied by its corresponding probability.
For a discrete random variable, we expressed it as:
For continuous random variables, where the values exist across an infinite range of intervals, the expected value is calculated using integration over the probability density function(we shall cover what is means in the later sections):
2.3 Joint probability, Conditional Probability, Independence and Conditional Independence
Given events ( and ), their probability of happening together is called the joint probability, however if and are random variables, we call the probability a joint probability distribution. Formally the joint probability of events and is defined as:
Two events are independent if the occurrence of one event does not influence the probability of the occurence of the other. If the events are independent, we denote the joint probability as:
Independent events differ from mutually exclusive events though they look like the same concept, mutually exclusive events are actually dependent in that if one event occurs, the other cannot, meaning the occurrence of one directly affects the probability of the other event.
With that in mind, its worth notating that the joint probability is symmetrical in that is the same as , as shown below:
And for independent events:
We say two events are conditionally independent if they are independent given the occurence of a third event. Formally, events and are conditionally independent given event if:
3. Bayes Theorem
Bayes theorem is case of conditional probability that is denoted as:
To better understand the above formula, we shall considered the wikipedia formula for when specificity, sensitivy and prevelance of a disease is known.
There is currently a rare serious flu reported overseas, lets assume you suspecting the flu you just contracted seems unusual. You visits the hospital and the lab test shows you are positive of the overseas flu. The doctors reiterate that the lab test carried reports 95% of the positive results accurately and 95% when the test is negative, that is:
But is not the same as , the later is the Bayes' theorem which we are interested in. So, similar with the wikipedia example, we shall calculate it as:
Assuming the incident ratio for the new overseas flu is 1 in 10,000 people, the would be and there fore would be:
This would translate to 0.1% chance of you actually having the overseas flu not 95% due to test accuracy which is a common misconception.
Statistics
1. Basics
1.1 Variance, standard deviation, covariance and correlation
The variance measures the dupersion of data points from the mean of random vanables . Meaning a small variance indicates that the data is clustered around the mean and higher values for when the data points are far from the mean. We summarize variance with:
It can also be re-written as:
The standard deviation is the square root of the variance for sample data and it is denoted as:
Also re-written as:
The main difference is that the standard deviation is expressed the dispersion in the same units as the data points of the random variables while the variance does not.
The covariance generalizes the concept of variance to two random variables. For example, for two random variable and , the covariance is the expected product of their deviations from the expected values for each denoted as:
The sign of the resultant value shows the linear relationship between those variables. A positive result happens if increases and also increases, and vice-versa. For a negative happens if increases but decreases, and vice-versa. And a zero covariance means the two or more variables are uncorrelated.For multiple variables, a squared matrix that summarizes covariances between multiple variables is the covariance matrix which is usually represented as:
The challenge with the covariance is that it can become infinitely positive or negative. For better interpretability, we could normalize the covariance to a certain range of values. The Pearson correlation is a normalized covariance in a range of and denoted as:
The coefficients and show a perfect linear relationship between those variables and indicates no linear relationship. The positive coefficients indicate that when the variable increases, the other also increase.
1.2 Sampling Distribution
First, the sampling distribution is a different concept from the probability distribution though they are closely linked. Let's explore what it is with an example. Imagine you wanted to know the average height of an adult nilo-hermite tribe in a rural village in Eastern Uganda. The village has 10,000 adults, you have limited time to ask or measure everyone but instead decide to draw a sample of 100 adults and record their mean . is a sample statistic that summarizes your first sample. You continue drawing samples of 100 adults say another 99 times () meaning in total 10,000 people. Therefore the sample distribution is the distribution of the 100 adults drawn from the population.
The Central Limit Theorem (CLT) states that if you take sufficiently large samples from a population with mean and standard deviation , the distribution of the sample means will be approximately normal, regardless of the population's original distribution. For the CLT and subsequent inference to hold, three key conditions must be met: Randomness, where the samples are collected using a random process; Independence, typically satisfied if the sample size is less than 10% of the population (the 10% Rule); and the Normal/Large Sample condition, which requires the population itself to be normal or the sample size to be at least .
Notably, sampling distributions can be constructed for various statistics, not just the sample mean; for instance, we can generate the sampling distribution of the sample proportion () or the sample variance (). However, the most important concept with the sampling distribution is the Standard Error (SE), which measures the standard deviation of a sampling distribution. While the population standard deviation () describes the spread of individual data points, the Standard Error describes the spread of the sample statistic itself (e.g., for the mean). As the sample size increases, the Standard Error decreases, meaning our sample estimate becomes more precise and more tightly clustered around the true population parameter.
2. Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation (MLE) is a method used to estimate the parameters of a statistical model based on observed sample data. Essentially, it measures how well a model with specific parameter values explains the data that has been collected.
One fundamental idea with MLE is the assumption of i.i.d. (independent and identically distributed) observations. Independent means that the occurrence of one observation does not affect the probability of another. For example, in the height example, measuring one person's height gives you no information about the height of the next person selected. Identically Distributed means every observation is drawn from the same underlying probability distribution, i.e., they all share the same parameters, such as the same population mean () and standard deviation ().
To calculate the likelihood of an entire sample, we assume that the individual observations are i.i.d. Under this assumption, the likelihood of observing the whole sample is the product of the individual probability densities:
Where represents the sample size. If we assume the data is sampled from a normal distribution, the likelihood function can be rewritten as:
The fundamental goal of MLE is to find the parameter values (like the mean or variance ) that maximize . In other words, it identifies the population parameters most likely to have generated the observed data given a specific distribution type.
The downside with Maximum Likelihood Estimation (MLE) is that the value of can become extremely small as the sample size grows. Additionally, when the variance is small, it is possible to have probability densities higher than 1, which would result in the likelihood function producing extremely large numbers. This is problematic because computers have a limited capacity for the number of digits they can store for each given number, meaning very large or very small numbers might be recorded as 0 or Infinity instead of their actual values. To handle this, it is standard practice to use the natural logarithm of the likelihood function, denoted as . A useful property of logarithms is that the logarithm of a product of values is equal to the sum of the logarithms of those individual values as summarized below:
Apart from simple, trivial models, MLE often cannot be solved analytically. One approach is a grid search, where you try a sequence of parameter values to find the one that yields the maximum log-likelihood. However, this becomes inefficient when there are many parameters to estimate. Instead, non-linear optimizers are used. By convention, these optimizers are designed to minimize functions rather than maximize them. Therefore, the standard practice is to tweak the problem to minimizing the Negative Log-Likelihood (NLL).
It's key to note that, while the probability density function can represent any statistical distribution e.g Normal, Bernoulli, or Poisson distribution, and among others, the underlying objective of Maximum Likelihood Estimation (MLE) is to identify the specific population parameter values that maximize the likelihood function , thereby finding the parameters that are most likely to have generated the observed sample data.
For comments, please send me an email.