Aleks on AWS: Supervised Learning Algorithms - Tree and Ensemble

Tuesday, November 7, 2023

Supervised Learning Algorithms - Tree and Ensemble

Decision Trees

Data analysis - ask series of questions to narrow down on the label (Color? Brown; Has wings? Yes; Answer - Bird)
Splits - the goal is to:

Minimize the number of splits thus making the tree max efficient
Maximize Information Gain (IG) at each split (i.e. ask the right questions)

Impurity - how heavily the data across multiple classes is mixed
IG = Decrease in impurity = Impurity(parent) - Impurity(Child)

Tree depth - hyperparameter; max number of splits that can occur

Analyze the data set - split into two according to the majority of the class present in each leaf
Shallow tree - can underfit the data
Deep tree - can overfit (i.e. every example ends up a separate leaf; introduce a lot of noise)

Traces the through the tree until a leaf is reached. Then use the mean target value of the training points at that leaf as the prediction.

Regression trees

Instead of going by the majority, split by the mean target value of the sample (i.e. calculate the mean of the leaf constituents, make prediction based on the mean)
Benefits:

Very interpretable
Train quickly
Can handle non-linear relationships
No needs for scaling

Challenges:

Sensitive to depth
Prone to overfitting

Ensemble Models

Combine multiple models into a meta-model for better performance
Averaging multiple models - less likely to overfit if the models are independent

Feed full dataset (or a slice) into multiple models
Use Aggregation function (majority, average, weighted averaged, etc.) to determine the overall prediction based on the individual predictions of each model

Challenges:

Time / compute resources to train
Cost of running multiple models
Decrease in interpretability

Bagging

Bootstrap aggregation - sampling with replacement
Pull out a sample from a set for use in a model; replace the pulled out sample
Can end up pulling the same sample multiple times
Because each model is trained using a different data set, the results can be considered independent
The average of the predictions is then taken

Build serious of trees, then average the outputs
Reduces overfitting

Random Forest

Use bagging to train multiple trees
Use multiple trees and take the majority vote
Each member model may use a different subset of both features and observations to train
Decisions driving this:

Number of trees
Sampling strategy for bagging

Bagging samples size as % of total
Max number of feature represented in the bagging sample

Depth of trees

Max depth
Min samples per leaf

Clustering

Unsupervised learning technique
Organize data points into logical groups
Sort similar data points into same cluster - and different data points into another

Ex: dividing potential customers in groups and targeting each a specific marketing
Key decision - determining the basis for similarity in data

K-Means Clustering

Sort data based on distance from the nearest cluster center

Ex: data points near each other go into one cluster; far from each other - into another cluster

Goal - minimize the sum of distance from each data point to the center

Select centers within the cluster (at random)
Assign each data point to the closest data center
Move the center from the original locations to the mean distance location (re-center)
Repeat the above 2 and 3 until the centers are no longer moving

Strengths:

Easy to implement
Quick to run
Good point for starting to wok w clustering

Weakness:

Requires to identify how many clusters will be used ahead of time
Doe not work for geographically complex data

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)