Testing a Model
- Divide data into
- Training Set - build / train the model
- Test Set - model performance evaluation set; data different from Training set
- Need to be careful with data leakage - test data getting into model building process
- This causes performance estimate to be overly optimistic vs. when using new data
- Further split Training set into:
- Training Set
- Validation Set - use for improving the model and monitoring performance on the test set
- And only then test using the Test set - as an unbiased exercise on a separate set of data
- K-Folds Cross Validation
- Another test strategy
- Split the data into subsets ("folds")
- Run multiple iteration of the model using a separate "fold"
- Calculate the Error as the Average Error across K-Folds validation runs
- This approach is industry standard as it:
- Maximizes the use training data
- Provides better insight into performance
Model Evaluation
- Metrics used
- Outcome - business impact, usually $
- Output - model output; usually remains internal, non-customer
- Regression Metrics:
- MSE - Mean Squared Error
- Most popularly used
- Influenced by large errors (these are penalized heavily), scale of data
- This is the model to use of concerned with minimizing large errors on outlier datapoints
- MAE - Mean Absolute Error
- Influenced by scale
- Can be easier to interpret
- Doesn’t go after outlier results as heavily as MSE
- MAPE - Mean Absolute Percent Error
- Converts an error to %
- Easer understood by non-technical clientele
- An error could be small - but when compared to the value of the target, the % could be large
- Coefficient of Determination (R-squared)
- Definition - the proportion of the variation in the dependent variable that is predictable from the independent variable(s)
- Displays how much of the variability in the target variable is explained by the model
- Totals squared deviation from mean (SST) = Sum of squares regression (SSR) + Total squared error (SSE)
- R-squared = SSR/SST = 2- SSE/SSR
- Usually b/w 1 and 0; the closer it is to 1, the better it explains variability in the target variable
Accuracy
- To aim at better accuracy be careful with input data class imbalance (ex: how many days a year a healthy person is sick?)
- Confusion Matrix - True/False Positives vs True/False Negatives
- FPR Rate - False Positives Rate (FPR), i.e. False Positives/Total True + False Positives
- Precision - True Positives Rate (TPR), i.e. True Positives/Total True + False Positives
- Receiver Operating Characteristic (ROC) Curves
- FPR and TPR across different threshold values for the given model
- I.e. set a threshold - a prediction below the threshold is a Negative, above - Positive
- Typical threshold is 0.5
- Use the Positives/Negatives to generate TPR/FPR by comparing these to the actual targets
- Re-run this across different threshold and plot TPR (y-axis) and FPR (x-axis) on a graph - ROC Curve
- Areas Under ROC (AUROC)
- Area under ROC Curve
- Higher AUROC - better quality model
- Precision Recall Curve (PR)
- Measures Precision vs Recall values across multiple thresholds
- Used in situations of high class imbalance (many 0's, few 1's)
- Does not factor True Negatives
Errors
- Common error causes:
- Proper problem faming and metric section
- Data quality
- Feature selection
- Model fit
- Inherent error
No comments:
Post a Comment