Decision Trees
- Data analysis - ask series of questions to narrow down on the label (Color? Brown; Has wings? Yes; Answer - Bird)
- Splits - the goal is to:
- Minimize the number of splits thus making the tree max efficient
- Maximize Information Gain (IG) at each split (i.e. ask the right questions)
- Impurity - how heavily the data across multiple classes is mixed
- IG = Decrease in impurity = Impurity(parent) - Impurity(Child)
- Tree depth - hyperparameter; max number of splits that can occur
- Analyze the data set - split into two according to the majority of the class present in each leaf
- Shallow tree - can underfit the data
- Deep tree - can overfit (i.e. every example ends up a separate leaf; introduce a lot of noise)
- Traces the through the tree until a leaf is reached. Then use the mean target value of the training points at that leaf as the prediction.
Regression trees
- Instead of going by the majority, split by the mean target value of the sample (i.e. calculate the mean of the leaf constituents, make prediction based on the mean)
- Benefits:
- Very interpretable
- Train quickly
- Can handle non-linear relationships
- No needs for scaling
- Challenges:
- Sensitive to depth
- Prone to overfitting
Ensemble Models
- Combine multiple models into a meta-model for better performance
- Averaging multiple models - less likely to overfit if the models are independent
- Feed full dataset (or a slice) into multiple models
- Use Aggregation function (majority, average, weighted averaged, etc.) to determine the overall prediction based on the individual predictions of each model
- Challenges:
- Time / compute resources to train
- Cost of running multiple models
- Decrease in interpretability
Bagging
- Bootstrap aggregation - sampling with replacement
- Pull out a sample from a set for use in a model; replace the pulled out sample
- Can end up pulling the same sample multiple times
- Because each model is trained using a different data set, the results can be considered independent
- The average of the predictions is then taken
- Build serious of trees, then average the outputs
- Reduces overfitting
Random Forest
- Use bagging to train multiple trees
- Use multiple trees and take the majority vote
- Each member model may use a different subset of both features and observations to train
- Decisions driving this:
- Number of trees
- Sampling strategy for bagging
- Bagging samples size as % of total
- Max number of feature represented in the bagging sample
- Depth of trees
- Max depth
- Min samples per leaf
Clustering
- Unsupervised learning technique
- Organize data points into logical groups
- Sort similar data points into same cluster - and different data points into another
- Ex: dividing potential customers in groups and targeting each a specific marketing
- Key decision - determining the basis for similarity in data
K-Means Clustering
- Sort data based on distance from the nearest cluster center
- Ex: data points near each other go into one cluster; far from each other - into another cluster
- Goal - minimize the sum of distance from each data point to the center
- Select centers within the cluster (at random)
- Assign each data point to the closest data center
- Move the center from the original locations to the mean distance location (re-center)
- Repeat the above 2 and 3 until the centers are no longer moving
- Strengths:
- Easy to implement
- Quick to run
- Good point for starting to wok w clustering
- Weakness:
- Requires to identify how many clusters will be used ahead of time
- Doe not work for geographically complex data
No comments:
Post a Comment