Published on
06/08/2019

Theory Behind Some classical Machine Learning Algorithms

This blog is motivated by a tutorial session and followup questions I got last year at Outbox Hub. It should be self-contained for anyone who did not attend the session. We will revisit the slides and provide more non formal theoretical notes for the maths of classical machine learning algorithms.

First, all these classical algorithms fall in a branch of ML called supervised learning, where a model is trained on labelled datasets/examples(each training input is already associated with a known output) and the primary objective is to learn a function mapping /generalisation that reasonably maps any input(s) to their corresponding output(s).

Linear Regression

Given some data set XX and the corresponding response variable YY such that:

X=[x1x2x3xn],Y=[y1y2y3yn]X = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_n \end{bmatrix}, \quad Y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{bmatrix}

The goal is to develop the best-fitting linear model y^\hat{y} that predicts the value of yy given a value of xx such that:

y^=β0+β1x(1)\hat{y} = \beta_0 + \beta_1 x \tag{1}

The best fit or generalized model is got by either minimizing or maximizing the cost function for the linear regression. The common cost function is the Residual Sum of Squares (RSS), (yiy^(xi))2\sum (y_i - \hat{y}(x_i))^2, and the Mean Squared Error (MSE), 1nRSS\frac{1}{n} \text{RSS}. These functions differ in scale, as RSS is the total sum of squared residuals while MSE is the average.

From equation 1, since the β\beta parameters control the resultant predictions, our cost function of choice will heavily depend on these two parameters. Choosing RSS as our "cost function" JJ would result in:

J(β0,β1)=(yiy^(xi))2(2)J(\beta_0, \beta_1) = \sum (y_i - \hat{y}(x_i))^2 \tag{2} =(yi(β0+β1xi))2(3)= \sum (y_i - (\beta_0 + \beta_1 x_i))^2 \tag{3}

To find the parameters, we have two choice of options for linear regression, gradient descent optimization or closed form solution with ordinary least squares (OLS).

Gradient Descent Optimization

To compute for β0,β1\beta_0, \beta_1 that minimizes J(β0,β1)J(\beta_0, \beta_1), using Gradient Descent, a learning/optimization algorithm in machine learning. First, the gradient of a function is as follows:

f(x,y)=[fxfy](4)\nabla f(x, y) = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} \tag{4}

The simple version of the Gradient Descent algorithm works as follows:

  1. Select a random starting value for the variables.
  2. Calculate the gradient of the cost function at that point.
  3. Move your point in the opposite direction from where we calculate gradient points by trying a bit, i.e., take the gradient f\nabla f and subtract by αf\alpha \nabla f from your variable (α\alpha can be tuned for).
  4. Repeat steps 2 and 3 until convergence.

After convergence, we have the values of the β\beta parameters which we can plug any XX value into and get the corresponding yy value.

For both the closed-form solution and gradient descent solution, before estimating the best linear unbiased estimates of coefficients, the following assumptions must be met first:

  1. Linearity: The relationship between the independent and dependent variables is linear.
  2. Independence: The residuals (errors) are independent of one another.
  3. Homoscedasticity: The residuals have constant variance across all levels of the independent variables.
  4. Multicollinearity: The independent variables are not perfectly linearly related to each other.
  5. Normality of Errors: The residuals are normally distributed (this is primarily important for hypothesis testing and calculating confidence intervals).

If all the above are satisfied, the OLS variance-covariance matrix is defined as:

Var(β^)=σ2(XX)1(5)\text{Var}(\hat{\beta}) = \sigma^2 (X^\top X)^{-1} \tag{5}

In this matrix, σ2\sigma^2 represents the variance of the error terms. Since the true population variance is typically unknown, it is estimated from the residuals. To estimate the variance of the error terms (σ^2\hat{\sigma}^2), we use the RSS:

σ^2=RSSnk=i=1n(yiy^i)2nk(6)\hat{\sigma}^2 = \frac{\text{RSS}}{n - k} = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n - k} \tag{6}

Where nn is the number of observations and kk is the number of estimated parameters, including the intercept. Finally, to assess the goodness-of-fit—how well the model explains the variability of the data, we use the Coefficient of Determination (R2R^2):

R2=1RSSTSS(7)R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} \tag{7}

Logistic Regression

Logistic regression is used when the parameters to estimate are for a binary classification problem. While the core concept is straightforward, implementing logistic regression is slightly more complicated compared to linear regression model because of how to penalize the loss. At the root of this method is the logistic function, often called the sigmoid function defined as:

f(x)=11+ex(8)f(x) = \frac{1}{1 + e^{-x}} \tag{8}

It creates a characteristic S-like shape. We can further expand this function by incorporating specific parameters known as weights, denoted as θ\theta, along with the input features xx. This results in a model where the input is a linear combination of features:

f(x)=11+e(θ0x0+θ1x1++θnxn)(9)f(x) = \frac{1}{1 + e^{-(\theta_0 x_0 + \theta_1 x_1 + \dots + \theta_n x_n)}} \tag{9}

The key advantage of the logistic function is that it restricts values between negative and positive infinity into a tight range of [0, 1].

Choosing a Loss Function for Binary Classification

Similar to linear regression, we define a loss function and then train the classifier to minimizing this function using an optimization algorithm such as gradient descent. However, using the standard squared loss function from linear regression is seriously problematic for classification tasks. One significant issue is that squared loss would penalize a correct classification if the predicted value is very positive and far from the decision boundary, which is a behavior we want to avoid. Also, the squared loss is highly sensitive to outliers. A single sample that is badly misclassified can shift the training boundary too drastically, which potentially degrades the performance of the model.

One of the alternative loss functions for binary classification is the Hinge Loss. Unlike squared loss, hinge loss does not penalize correct classifications that fall outside the margin. It penalizes the incorrect classifications, though not as squared loss. Hinge loss is convex where the loss is non-zero, making it suitable for gradient descent optimization.

L(y,f(x))=max(0,1yf(x))(10)L(y, f(x)) = \max(0, 1 - y \cdot f(x)) \tag{10}

The Log Loss (also known as Cross-Entropy Loss) is the most common loss function used to train binary classifiers because it provides a probabilistic interpretation of the model's output, penalizing confident but wrong predictions. To compute the gradient for this loss function, we first define the cost for a single sample x(i)x^{(i)}. For the case where the target y=+1y = +1, the cost is C+1=log(f(x(i)))C_{+1} = -\log(f(x^{(i)})). By applying the chain rule and the identity f(z)=f(z)(1f(z))f'(z) = f(z)(1 - f(z)), we find that the contribution to the gradient for θj\theta_j is:

C+1θj=(f(x(i))1)xj(i)(11)\frac{\partial C_{+1}}{\partial \theta_j} = (f(x^{(i)}) - 1)x_j^{(i)} \tag{11}

Gradient Descent Optimization

With the loss gradient derived, we can use gradient descent to optimize our classifier given the training data. The process is similar to linear regression, where we compute the total gradient and update our parameters θj\theta_j iteratively. The general update rule is defined as:

θj=θjηLθj(12)\theta_j = \theta_j - \eta \frac{\partial L}{\partial \theta_j} \tag{12}

To ensure the model generalizes well to unseen data, we often compute gradients for the loss function with a regularization term e.g L2L_2 regularization, also known as Ridge regularization, to serves as a means of preventing overfitting by penalizing large weights. The L2L_2 regularization term is defined as the sum of the squares of the weights:

R(θ)=λj=1nθj2(13)R(\theta) = \lambda \sum_{j=1}^{n} \theta_j^2 \tag{13}

Updating the Log Loss with this L2L_2 regularization term results in a modified cost function that balances the empirical error with model complexity. This combined approach ensures that the gradient descent steps not only reduce classification error but also maintain a simpler, more generazible model. The details of the gradient descent update, given our updated log loss and regularization, will be as follows:

θj=θjη(i=1k(f(x(i))y(i))xj(i)+2λθj)(14)\theta_j = \theta_j - \eta \left( \sum_{i=1}^{k} (f(x^{(i)}) - y^{(i)})x_j^{(i)} + 2\lambda\theta_j \right) \tag{14}

k-nearest neighbours (k-NN)

The k-Nearest Neighbors (k-NN) algorithm is a non-parametric algorithm for both classification and regression tasks, particularly when working with smaller datasets. Given a dataset containing nn observations, where each observation is represented by a feature vector xiRd\mathbf{x}_i \in \mathbb{R}^d and an associated label yiy_i, the algorithm predicts the outcome for a new input x\mathbf{x} through proximity-based steps. The process begins by computing the distance between the new input and all other samples in the training set. For continuous variables, the Euclidean distance is a standard choice for this calculation:

d(x,xi)=j=1d(xjxij)2(15)d(\mathbf{x}, \mathbf{x}_i) = \sqrt{\sum_{j=1}^d (x_j - x_{ij})^2} \tag{15}

Once distances are calculated, the algorithm maps the nearest neighbors by selecting the kk samples with the smallest distances to the input x\mathbf{x}. In classification tasks, the model assigns the class label that appears most frequently among these kk neighbors using a majority voting scheme. If we let Nk(x)\mathcal{N}_k(\mathbf{x}) represent the set of indices for these neighbors, the predicted class y^\hat{y} is determined by the sample that maximizes the frequency of a class label cc:

y^=argmaxciNk(x)I(yi=c)(16)\hat{y} = \arg\max_{c} \sum_{i \in \mathcal{N}_k(\mathbf{x})} \mathbb{I}(y_i = c) \tag{16}

Here, the I()\mathbb{I}(\cdot) serves as the indicator function, returning 1 if the condition is met and 0 otherwise. For regression problems, the approach shifts from voting to averaging, where the predicted output y^\hat{y} is simply the mean of the values associated with the kk nearest neighbors:

y^=1kiNk(x)yi(17)\hat{y} = \frac{1}{k} \sum_{i \in \mathcal{N}_k(\mathbf{x})} y_i \tag{17}

Model Selection and Performance Optimization

Picking the most appropriate value for kk is vital for the model's performance, as it balances the trade-off between bias and variance. A very small kk often leads to overfitting due to high sensitivity to noise in the data, whereas a large kk can smooth the decision boundary to the point of underfitting. Cross-validation is fundamental for identification of the optimal kk that minimizes prediction error. Aside from the value of kk, the choice of the distance metric must align with the data type; while Euclidean distance is standard, Manhattan distance is often more useful in high-dimensional spaces, and Hamming distance is useful for categorical data.

Other alternatives for optimizaton, especially in regression is that, weights can be assigned to neighbors based on their proximity to the input x\mathbf{x}. A common strategy involves using the inverse distance to give closer neighbors more influence over the prediction:

wi=1d(x,xi)(18)w_i = \frac{1}{d(\mathbf{x}, \mathbf{x}_i)} \tag{18}

Under this weighting scheme, the regression prediction is adjusted to a weighted average, ensuring that the influence of a neighbor decays as its distance from the query point increases:

y^=iNk(x)wiyiiNk(x)wi(19)\hat{y} = \frac{\sum_{i \in \mathcal{N}_k(\mathbf{x})} w_i y_i}{\sum_{i \in \mathcal{N}_k(\mathbf{x})} w_i} \tag{19}

While k-NN is easy to implement and benefits from having no explicit training phase, it remains computationally intensive for large datasets because distance calculations must be performed for every new prediction. Furthermore, the algorithm is susceptible to the curse of dimensionality, where its performance may degrade as the number of features increases significantly.

Naive Bayes

Naive Bayes is a probabilistic algorithm for classification tasks. It is based on the Bayes' Theorem, with a specific focus on the conditional independence between features. Bayes' Theorem provides a way for updating our beliefs based on observed evidence, expressed as:

P(AB)=P(BA)P(A)P(B)(20)P(A|B) = \frac{P(B|A)P(A)}{P(B)} \tag{20}

The P(AB)P(A|B) represents the posterior probability, which is the likelihood of class AA occurring given the observed features BB. The term P(BA)P(B|A) is the likelihood, indicating the probability of observing the evidence BB given that the class is AA. P(A)P(A) represents the prior probability, quantifying our initial belief about class AA before any evidence is presented, while P(B)P(B) is the evidence, representing the total probability of observing BB across all possible classes.

The "naive" part of the algorithm comes from the simplifying assumption that all features are conditionally independent given the class. This assumption allows us to break the likelihood into a product of individual feature probabilities that simplify the calculation to:

P(BA)=i=1nP(biA)(21)P(B|A) = \prod_{i = 1}^{n} P(b_i|A) \tag{21}

By treating individual features bib_i as independent, the algorithm reduces the computational complexity, allowing it to handle high-dimensional data. That is, for any given classification task, the algorithm calculates the posterior for every possible class and selects the one that results into the highest value:

A=argmaxAP(AB)(22)A^* = \arg\max_{A} P(A|B) \tag{22}

Error Rates and Practical Limitations

The theoretical performance limit of Naive Bayes is closely tied to the concept of the Bayes Error Rate. This metric represents the minimum possible error rate that any classifier can achieve, given the inherent uncertainty in the underlying data distributions. Because different classes often have overlapping probability densities, no decision rule can perfectly separate them. For a binary classification problem, the Bayes Error Rate is computed by integrating the minimum posterior probability over the feature space:

Error Rate=min(P(A1B),P(A2B))p(B)dB(23)\text{Error Rate} = \int \min(P(A_1|B), P(A_2|B)) \, p(B) \, dB \tag{23}

The p(B)p(B) is the density function of the features. For multi-class scenarios, this formula generalizes to summing the uncertainty across all classes by considering the probability that the most likely class is still incorrect as:

Error Rate=(1maxAP(AB))p(B)dB(24)\text{Error Rate} = \int \left(1 - \max_{A} P(A|B)\right) p(B) \, dB \tag{24}

While Naive Bayes is an efficient algorithm, its primary drawback is the independence assumption, which can limit performance when features are strongly correlated. Despite this theoretical limitation, the algorithm often performs remarkably well in practice. It is particularly good in domains like text classification and spam detection, where the occurrences of various words can often approximate independence well enough to provide highly accurate results.

Decision Trees

Tree-based algorithms are the most intuitive sets of machine learning methods because they model relationships within the data using a flowchart-like structure. Each internal node represents a decision based on a specific feature, while each branch corresponds to the outcome of that decision. The division process continues until it reaches a leaf node, which represents the final prediction or outcome. Decision trees can be used for both classification and regression tasks and serve as a foundation for more advanced ensemble methods like Random Forests and gradient-boosted trees.

So, the construction of a tree begins with selecting the feature to split the data. With that, the primary goal is to partition the dataset into subsets that are as similar as possible to the target variable based on the measure of impurity or information gain. Information gain measures the reduction in uncertainty, or entropy, achieved by splitting a dataset on a particular feature, it is summarised as:

H(S)=i=1cpilog2(pi)(25)H(S) = -\sum_{i = 1}^{c} p_i \log_2(p_i) \tag{25}

When there is no uncertainty in the classification i.e all samples belong to a single class, the entropy will be 0. Conversely, maximum uncertainty occurs during a perfect class balance. For a binary classification problem, maximum entropy occurs when p1=p2=0.5p_1 = p_2 = 0.5, resulting in a value of 1. More generally, for a problem with cc classes, maximum entropy occurs when all classes have an equal probability of 1/c1/c, resulting in an entropy value of log2(c)\log_2(c).

Another commonly used metric for assessing potential splits is Gini Impurity. This metric provides a different way through which the "purity" of a node and is defined as:

Gini(S)=1i=1cpi2(26)Gini(S) = 1 - \sum_{i = 1}^{c} p_i^2 \tag{26}

Lower Gini impurity values indicate that the resulting subsets are more similar. By iteratively applying these metrics at each node, the algorithm iteratively refines the data into distinct, predictable groups until it satisfies a stopping criterion, such as a maximum tree depth or a minimum number of samples per leaf.

How Growing the Tree Continues?

Once the best feature is selected, the dataset is split recursively for each subset. This recursive partitioning continues until certain stopping conditions are met. These conditions include scenarios where all nodes become pure, a maximum depth is reached, or when a minimum number of samples in a node is enforced. These stopping criteria help prevent the tree from becoming too complex and overfitting the training data.

While growing a decision tree, overfitting is a common concern. Overfitted trees tend to memorize the training data, leading to poor generalization. Pruning helps mitigate this problem by simplifying the tree. This can be implemented through two main approaches; pre-pruning, also known as early stopping, where the growth of the tree is halted early based on predefined conditions, such as limiting the depth or requiring a minimum number of samples per leaf. The second approach is post-pruning, where a fully grown tree is simplified by removing branches that contribute little to predictive accuracy. The latter is usually guided by validation data or statistical tests.

Conclusion

These are foundational algorithms that can be extended to more complex problems with either ensemble methods or deep learning. They serve as a basis for understanding more complex algorithms and are widely used in practice due to their simplicity and interpretability.

References:

  1. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.

  2. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning: with Applications in R. Springer Texts in Statistics. Springer New York, New York, NY, 2013.

For comments, please send me an email.