Published on
15/05/2020

A Sample of Probability Distributions and Their Properties

Probability distributions can be broadly categorised into discrete and continuous distributions based on the nature of the random variable they model. Discrete probability distributions model random variables that take on a finite or countably infinite set of values, whereas continuous distributions apply to random variables that can take any value within a specified interval of the real number line. This article focuses on distributions, detailing their properties, probability mass functions, and statistical properties.

Discrete Probability Distributions

Binomial Distribution

The binomial distribution is used when an experiment consists of a fixed number of independent trials, where each trial results in only two possible outcomes, often categorized as success or failure. If the probability of success in each individual trial is denoted by pp and there are nn total trials, we describe the random variable XX, which represents the total number of successes, as a binomial distribution denoted by XBin(n,p)X \sim \text{Bin}(n, p). To determine the probability of observing exactly xx successes across those nn trials, we use the probability mass function. This formula accounts for both the probability of the successes and failures occurring, as well as the different sequences in which they can appear:

P(X=x)=(nx)px(1p)nx(1)P(X = x) = \binom{n}{x} p^x (1 - p)^{n - x} \tag{1}

The term (nx)\binom{n}{x} is known as the binomial coefficient, calculated as n!x!(nx)!\frac{n!}{x!(n - x)!}, and it represents the total number of distinct ways to choose xx successes from nn available trials. The rest of the equation, px(1p)nxp^x (1 - p)^{n - x}, calculates the probability of one specific sequence of xx successes and nxn - x failures.

Two key properties describe the center and spread of this distribution. The expected value of XX, which represents the mean number of successes we would expect over many repetitions of the nn trials, is simply the product of the number of trials and the probability of success as below:

E(X)=np(2)E(X) = np \tag{2}

Additionally, the variance, which quantifies how much the number of successes typically deviates from that mean, is given by:

V(X)=np(1p)(3)V(X) = np(1 - p) \tag{3}

As the probability of failure (1p)(1 - p) increases, or as nn increases, the potential spread of outcomes changes accordingly

As an example, consider a biased coin with its head having a probability p=0.1p = 0.1 when tossed. To find the probability of seeing exactly 2 heads in n=6n = 6 tosses, we need to use the probability mass function. so the binomial coefficient (62)\binom{6}{2}, which represents the number of ways to arrange 2 successes in 6 trials would be:

(62)=6!2!(62)!=6×52×1=15\binom{6}{2} = \frac{6!}{2!(6 - 2)!} = \frac{6 \times 5}{2 \times 1} = 15

With this, we get:

f(2,6,0.1)=15×0.12×0.94=0.098415f(2, 6, 0.1) = 15 \times 0.1^2 \times 0.9^4 = 0.098415

The expected value of the average number of heads is E(X)=6×0.1=0.6E(X) = 6 \times 0.1 = 0.6. The variance, which measures the spread of these outcomes, is V(X)=6×0.1×0.9=0.54V(X) = 6 \times 0.1 \times 0.9 = 0.54.

Let consider another example of a scenario where a manufacturing plant produces light bulbs with a known defect rate of 5%. If we randomly select a batch of n=10n = 10 bulbs to test, the probability pp of finding a defective bulb is 0.050.05. To find the probability of observing exactly 2 defective bulbs would mean 2 successes out of 10 trials which we calculate as:

(102)=10!2!(102)!=10×92×1=45\binom{10}{2} = \frac{10!}{2!(10 - 2)!} = \frac{10 \times 9}{2 \times 1} = 45 f(2,10,0.05)=45×0.052×0.958=0.074635f(2, 10, 0.05) = 45 \times 0.05^2 \times 0.95^8 = 0.074635

The expected number of defective bulbs is E(X)=10×0.05=0.5E(X) = 10 \times 0.05 = 0.5. The variance for this distribution is V(X)=10×0.05×0.95=0.475V(X) = 10 \times 0.05 \times 0.95 = 0.475.

Geometric Distribution

The geometric distribution models the number of independent trials required to achieve the first success. We assumes that each trial is independent and has a same probability of success pp. If we define the random variable XX as the specific trial number where the first success occurs, we say that XX follows a geometric distribution, denoted as XGeom(p)X \sim \text{Geom}(p).To find the probability that the first success occurs exactly on trial xx, we use the probability mass function. Meaning we must observe exactly x1x-1 consecutive failures before finally reaching a success on the xx-th attempt:

P(X=x)=(1p)x1p(4)P(X = x) = (1 - p)^{x - 1} p \tag{4}

The (1p)x1(1 - p)^{x - 1} calculates the probability of the initial failures, while pp accounts for the success that stops the sequence. This distribution is particularly useful for "waiting-time" problems, for example predicting the number of coin tosses needed to see the first head or the number of attempts required to pass a quality control test. The expected value, which provides the average number of trials one would expect to perform before succeeding, is given by:

E(X)=1p(5)E(X) = \frac{1}{p} \tag{5}

This inverse relationship means that as the probability of success decreases, the expected number of attempts increases proportionally. The variance, which measures the spread or uncertainty of when that first success will happen, is calculated as:

V(X)=1pp2(6)V(X) = \frac{1 - p}{p^2} \tag{6}

This distribution also has a unique "memoryless" property, implying that the probability of success on the next trial does not depend on how many failures have already occurred.

Lets consider an example of rolling a fair six-sided die repeatedly until a "1" is seen. Since there is one success out of six possible outcomes, the probability of success is p=1/6p = 1/6. Following the expected value formula, the average number of rolls needed to see that first "1" is E(X)=11/6=6E(X) = \frac{1}{1/6} = 6. It is also often useful to distinguish between the total number of trials and the number of failures that occur before the first success. In this case, the average number of failures is calculated as 1pp\frac{1-p}{p}, which simplifies to 1(1/6)1/6=5\frac{1-(1/6)}{1/6} = 5.

Negative Binomial Distribution

This models the number of trials needed to achieve a specific outcome, BUT the focus from a fixed number of attempts to a fixed number of successes. While a standard binomial distribution counts how many successes occur in nn trials, the negative binomial distribution reverses this logic by continuing the trials until the rr-th success observed. This makes it an extension of the geometric distribution; which models the trials until the first success, the negative binomial models the journey toward the rr-th success. This is invaluable in fields like sales or quality control, where one needs to predict how many "failures" or "rejections" will be faced before hitting a specific target of successful outcomes.

If we define XX as the number of failures that occur before reaching the rr-th success, we denote this as XNegBin(r,p)X \sim \text{NegBin}(r, p). The probability mass function for this distribution is:

P(X=x)=(x+r1r1)pr(1p)x(7)P(X = x) = \binom{x + r - 1}{r - 1} p^r (1 - p)^x \tag{7}

The binomial coefficient (x+r1r1)\binom{x + r - 1}{r - 1} is calculated as (x+r1)!x!(r1)!\frac{(x + r - 1)!}{x!(r - 1)!}. It represents the total number of ways to arrange the first r1r - 1 successes within the initial x+r1x + r - 1 trials, ensuring the last trial is the rr-th success.

The mean number of failures before obtaining rr successes is:

E(X)=r(1p)p(8)E(X) = \frac{r(1 - p)}{p} \tag{8}

And the variance is:

V(X)=r(1p)p2(9)V(X) = \frac{r(1 - p)}{p^2} \tag{9}

The negative binomial distribution can be expressed in several ways depending on which variable you are looking for and which outcome serves as the stopping criteria as shown below:

GoalFormula P(X=x)k failures, given r successes(k+r1k)pr(1p)kn trials, given r successes(n1r1)pr(1p)nrn trials, given r failures(n1r1)(1p)rpnrk successes, given r failures(k+r1k)(1p)rpk\begin{array}{ll} \text{Goal} & \text{Formula } P(X=x) \\ \hline k \text{ failures, given } r \text{ successes} & \binom{k+r-1}{k} p^r (1-p)^k \\ n \text{ trials, given } r \text{ successes} & \binom{n-1}{r-1} p^r (1-p)^{n-r} \\ n \text{ trials, given } r \text{ failures} & \binom{n-1}{r-1} (1-p)^r p^{n-r} \\ k \text{ successes, given } r \text{ failures} & \binom{k+r-1}{k} (1-p)^r p^k \\ \end{array}

For the second alternative, the objective is to determine the total number of trials needed to reach a success threshold, the formula changes slightly to account for the fact that the total sample size is the random variable now. The forth version models the number of successes achieved before a failure threshold is reached.

The key difference is that, In a binomial distribution, you have a fixed number of trials nn and you count how many successes kk occur. In a negative binomial distribution, the number of successes (or failures) is fixed, and the number of trials is the random variable that continues until that target is met.

Poisson distribution

This models how many times an event occurs within a specific window of time or space. It's built on an assumption that events happen independently and at a constant average rate λ\lambda and kk number of occurrences. We say that XPoisson(λ)X \sim \text{Poisson}(\lambda), and the probability of observing exactly kk events is:

P(X=k)=λkeλk!(10)P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \tag{10}

For poisson, the mean and variance are identical, both being equal to the average rate λ\lambda which implies that as the average number of events increases, the spread or uncertainty of the distribution increases at the exact same rate.

Here is the summary table for all the common discrete distributions above:

DistributionRandom Variable XPMFMean E(X)Variance V(X)Binomial# of successes in n trials(nx)px(1p)nxnpnp(1p)GeometricTrials until 1st success(1p)x1p1p1pp2Negative BinomialFailures before rth success(x+r1r1)pr(1p)xr(1p)pr(1p)p2Poisson# of events in intervaleλλxx!λλ \begin{array}{lllll} \text{Distribution} & \text{Random Variable } X & \text{PMF} & \text{Mean } E(X) & \text{Variance } V(X) \\ \hline \text{Binomial} & \text{\# of successes in } n \text{ trials} & \binom{n}{x}p^x(1-p)^{n-x} & np & np(1-p) \\ \text{Geometric} & \text{Trials until 1st success} & (1-p)^{x-1}p & \frac{1}{p} & \frac{1-p}{p^2} \\ \text{Negative Binomial} & \text{Failures before } r\text{th success} & \binom{x+r-1}{r-1}p^r(1-p)^x & \frac{r(1-p)}{p} & \frac{r(1-p)}{p^2} \\ \text{Poisson} & \text{\# of events in interval} & \frac{e^{-\lambda}\lambda^x}{x!} & \lambda & \lambda \end{array}

Continuous Random Variables and Probability Distributions

Uniform Distribution

This continuous distribution models scenarios where every outcome within a range is equally probable. Meaning, when a continuous random variable XX is defined over an interval [A,B][A, B], and no particular value is more likely than another, we say that XUniform(A,B)X \sim \text{Uniform}(A, B). It's distribution is defined as:

f(x)={1BA,AxB0,otherwisef(x) = \begin{cases} \frac{1}{B - A}, & A \leq x \leq B \\ 0, & \text{otherwise} \end{cases}

Due to the probabilities being spread evenly, the PDF is a flat rectangle with a height of 1/(BA)1/(B - A), with the total area under the curve is exactly 1. The expected value is:

E(X)=A+B2(11)E(X) = \frac{A + B}{2} \tag{11}

The spread of the data, or the variance, is calculated with:

V(X)=(BA)212(12)V(X) = \frac{(B - A)^2}{12} \tag{12}

The example of this distribution is the random number generation where we must pick a value e.g between 0 and 1 with perfect neutrality. Another common case is modeling "maximum uncertainty" scenarios such as estimating waiting times with no prior data regarding a system's variability.

Normal (Gaussian) Distribution

The normal or Gaussian distribution is the most common continuous probability distribution in statistics, primarily due to the Central Limit Theorem (CLT). The CLT states that, the sum or average of a large number of independent random variables will tend toward a normal distribution, regardless of the original distribution's shape.

When a continuous random variable XX follows a normal distribution with a mean μ\mu and a variance σ2\sigma^2, we denote it as XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2). It's probability density function (PDF) is "bell curve" given by:

f(x)=12πσ2exp((xμ)22σ2)(13)f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \tag{13}

The expected value of a normal distribution is E(X)=μE(X) = \mu and the variance V(X)=σ2V(X) = \sigma^2. It's special version is denoted as ZN(0,1)Z \sim \mathcal{N}(0, 1), where the mean is 0 and the variance is 1.

We use the cumulative distribution function (CDF) to calculate the probabilities of the normal distribution. It always requires standardization of the variable into a ZZ-score with:

Z=Xμσ(14)Z = \frac{X - \mu}{\sigma} \tag{14}

Exponential Distribution

This models the times between independent events that occur constantly at a fixed average rate. When a random variable XX follows an exponential distribution with rate parameter λ\lambda, it is denoted as XExp(λ)X \sim \text{Exp}(\lambda). Its distribution is defined as:

f(x)={λeλx,x00,x<0(15)f(x) = \begin{cases} \lambda e^{-\lambda x}, & x \geq 0 \\ 0, & x < 0 \end{cases} \tag{15}

The probability density starts at its maximum value λ\lambda and exponential decays as xx increases. It's expected value of the mean time between events is:

E(X)=1λ(16)E(X) = \frac{1}{\lambda} \tag{16}

The variance is calculated with:

V(X)=1λ2(17)V(X) = \frac{1}{\lambda^2} \tag{17}

Common examples of this distribution include survival analysis, reliability modeling, and calculating the failure rates of components. It is also the standard model for the time between arrivals in a Poisson process, such as the time between phone calls at a service center.

Weibull Distribution

This generalizes the exponential distribution with an introduction of a shape parameter, allowing for more flexible modeling of lifetimes and failure rates. When a continuous random variable XX follows a Weibull distribution with shape parameter α\alpha and scale parameter β\beta, it is denoted as XWeibull(α,β)X \sim \text{Weibull}(\alpha, \beta). The distribution is defined as:

f(x)={αβ(xβ)α1e(x/β)α,x>00,x0(19)f(x) = \begin{cases} \frac{\alpha}{\beta} \left(\frac{x}{\beta}\right)^{\alpha - 1} e^{-(x/\beta)^\alpha}, & x > 0 \\ 0, & x \leq 0 \end{cases} \tag{19}

By adjusting the shape parameter α\alpha, we can model problems with decreasing, constant, or increasing failure rates. Due this, the calculations for the mean and variance involves the Gamma function Γ(n)\Gamma(n).

The expected value is:

E(X)=βΓ(1+1α)(20)E(X) = \beta \Gamma\left(1 + \frac{1}{\alpha}\right) \tag{20}

It's variance is calculated as:

V(X)=β2[Γ(1+2α)(Γ(1+1α))2](21)V(X) = \beta^2 \left[ \Gamma\left(1 + \frac{2}{\alpha}\right) - \left(\Gamma\left(1 + \frac{1}{\alpha}\right)\right)^2 \right] \tag{21}

This distribution is very common in survival modeling. For example, when α<1\alpha < 1, it models "infant mortality" where failure rates decrease over time; when α=1\alpha = 1, it's just the exponential distribution (constant failure rate); and when α>1\alpha > 1, it models "wear-out" periods where the probability of failure increases as the problem ages.

Chi-Squared Distribution

The Chi-squared (χ2\chi^2) distribution is statistical inference tool for particularly hypothesis testing and confidence intervals generation. Given X1,X2,,XkX_1, X_2, \dots, X_k independent standard normal variables, the sum of their squares follows a chi-squared distribution with kk degrees of freedom. This is denoted as Xχ2(k)X \sim \chi^2(k), where:

X=i=1kXi2(22)X = \sum_{i=1}^{k} X_i^2 \tag{22}

Its distribution is defined as:

f(x)={12k/2Γ(k/2)x(k/2)1ex/2,x>00,x0(23)f(x) = \begin{cases} \frac{1}{2^{k/2} \Gamma(k/2)} x^{(k/2) - 1} e^{-x/2}, & x > 0 \\ 0, & x \leq 0 \end{cases} \tag{23}

The shape of the distribution depends entirely on the degrees of freedom. It is usually skewed for small kk and becomes more symmetric as kk increases.

The expected value is:

E(X)=k(24)E(X) = k \tag{24}

And it's variance is calculated with:

V(X)=2k(25)V(X) = 2k \tag{25}

This distribution provides a framework for goodness-of-fit tests, variance estimation, and determining the independence between categorical variables in contingency tables. It is also the basis for the FF-distribution used in ANOVA.

Beta Distribution

This probability distributions is defined on the interval [0,1][0, 1]. Because its domain is bounded, it is the common choice for modeling variables that represent proportions, probabilities, or percentages. When a random variable XX is parameterized by two shape parameters α\alpha and β\beta, we denote it as XBeta(α,β)X \sim \text{Beta}(\alpha, \beta). Its distribution is defined as:

f(x)=xα1(1x)β1B(α,β),0x1(26)f(x) = \frac{x^{\alpha - 1} (1 - x)^{\beta - 1}}{B(\alpha, \beta)}, \quad 0 \leq x \leq 1 \tag{26}

The denominator B(α,β)B(\alpha, \beta) is the Beta function is a normalization constant to ensure the total area under the PDF is 1 as:

B(α,β)=01tα1(1t)β1dt(27)B(\alpha, \beta) = \int_0^1 t^{\alpha - 1} (1 - t)^{\beta - 1} dt \tag{27}

By adjusting α\alpha and β\beta, the distribution can take on a variety of shapes including uniform, U-shaped, bell-shaped, or skewed—making it highly adaptable.

The expected value is:

E(X)=αα+β(28)E(X) = \frac{\alpha}{\alpha + \beta} \tag{28}

It's variance is calculated with:

V(X)=αβ(α+β)2(α+β+1)(29)V(X) = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} \tag{29}

The Beta distribution is key in Bayesian statistics, where it often serves as a conjugate prior for Bernoulli, Binomial, and Geometric distributions. It is also widely applied in A/B testing, election forecasting, and quality control to model the uncertainty surrounding the true success rate of a process.

Conclusion

We have explored the different discrete and continuous probability distributions, which are the foundation for most statistical analyses. There’s no need to memorize every detail; instead, keep these concepts in mind for future reference. Over time, you'll naturally learn the most appropriate distribution for a given problem.

For comments, please send me an email.