Tuesday, November 7, 2023

Supervised Learning Algorithms - Tree and Ensemble

Decision Trees

  • Data analysis - ask series of questions to narrow down on the label (Color? Brown; Has wings? Yes; Answer - Bird)
  • Splits - the goal is to:
    • Minimize the number of splits thus making the tree max efficient
    • Maximize Information Gain (IG) at each split (i.e. ask the right questions)
      • Impurity - how heavily the data across multiple classes is mixed
      • IG = Decrease in impurity = Impurity(parent) - Impurity(Child)
  • Tree depth - hyperparameter; max number of splits that can occur
    • Analyze the data set - split into two according to the majority of the class present in each leaf
    • Shallow tree - can underfit the data
    • Deep tree - can overfit (i.e. every example ends up a separate leaf; introduce a lot of noise)
  • Traces the through the tree until a leaf is reached.  Then use the mean target value of the training points at that leaf as the prediction.

Regression trees

  • Instead of going by the majority, split by the mean target value of the sample (i.e. calculate the mean of the leaf constituents, make prediction based on the mean)
  • Benefits:
    • Very interpretable
    • Train quickly
    • Can handle non-linear relationships
    • No needs for scaling
  • Challenges:
    • Sensitive to depth
    • Prone to overfitting

Ensemble Models

  • Combine multiple models into a meta-model for better performance
  • Averaging multiple models - less likely to overfit if the models are independent
    • Feed full dataset (or a slice) into multiple models
    • Use Aggregation function (majority, average, weighted averaged, etc.) to determine the overall prediction based on the individual predictions of each model
  • Challenges:
    • Time / compute resources to train
    • Cost of running multiple models
    • Decrease in interpretability

Bagging

  • Bootstrap aggregation - sampling with replacement
  • Pull out a sample from a set for use in a model; replace the pulled out sample
  • Can end up pulling the same sample multiple times
  • Because each model is trained using a different data set, the results can be considered independent
  • The average of the predictions is then taken
    • Build serious of trees, then average the outputs
    • Reduces overfitting

Random Forest

  • Use bagging to train multiple trees
  • Use multiple trees and take the majority vote
  • Each member model may use a different subset of both features and observations to train
  • Decisions driving this:
    • Number of trees
    • Sampling strategy for bagging
      • Bagging samples size as % of total
      • Max number of feature represented in the bagging sample
    • Depth of trees
      • Max depth
      • Min samples per leaf

Clustering

  • Unsupervised learning technique
  • Organize data points into logical groups
  • Sort similar data points into same cluster - and different data points into another
    • Ex: dividing potential customers in groups and targeting each a specific marketing
    • Key decision - determining the basis for similarity in data

K-Means Clustering

  • Sort data based on distance from the nearest cluster center
    • Ex: data points near each other go into one cluster; far from each other - into another cluster
  • Goal - minimize the sum of distance from each data point to the center
    1. Select centers within the cluster (at random)
    2. Assign each data point to the closest data center
    3. Move the center from the original locations to the mean distance location (re-center)
    4. Repeat the above 2 and 3 until the centers are no longer moving
  • Strengths:
    • Easy to implement
    • Quick to run
    • Good point for starting to wok w clustering
  • Weakness:
    • Requires to identify how many clusters will be used ahead of time
    • Doe not work for geographically complex data

No comments:

Post a Comment