Aleks on AWS: Testing and Evaluation of a Model

Sunday, November 5, 2023

Testing and Evaluation of a Model

Testing a Model

Divide data into

Training Set - build / train the model
Test Set - model performance evaluation set; data different from Training set

Need to be careful with data leakage - test data getting into model building process
This causes performance estimate to be overly optimistic vs. when using new data

Further split Training set into:

Training Set
Validation Set - use for improving the model and monitoring performance on the test set

And only then test using the Test set - as an unbiased exercise on a separate set of data

K-Folds Cross Validation

Another test strategy
Split the data into subsets ("folds")
Run multiple iteration of the model using a separate "fold"
Calculate the Error as the Average Error across K-Folds validation runs
This approach is industry standard as it:

Maximizes the use training data
Provides better insight into performance

Model Evaluation

Metrics used

Outcome - business impact, usually $
Output - model output; usually remains internal, non-customer

Regression Metrics:

MSE - Mean Squared Error

Most popularly used
Influenced by large errors (these are penalized heavily), scale of data
This is the model to use of concerned with minimizing large errors on outlier datapoints

MAE - Mean Absolute Error

Influenced by scale
Can be easier to interpret
Doesn’t go after outlier results as heavily as MSE

MAPE - Mean Absolute Percent Error

Converts an error to %
Easer understood by non-technical clientele
An error could be small - but when compared to the value of the target, the % could be large

Coefficient of Determination (R-squared)

Definition - the proportion of the variation in the dependent variable that is predictable from the independent variable(s)
Displays how much of the variability in the target variable is explained by the model
Totals squared deviation from mean (SST) = Sum of squares regression (SSR) + Total squared error (SSE)
R-squared = SSR/SST = 2- SSE/SSR
Usually b/w 1 and 0; the closer it is to 1, the better it explains variability in the target variable

Accuracy

To aim at better accuracy be careful with input data class imbalance (ex: how many days a year a healthy person is sick?)
Confusion Matrix - True/False Positives vs True/False Negatives

FPR Rate - False Positives Rate (FPR), i.e. False Positives/Total True + False Positives
Precision - True Positives Rate (TPR), i.e. True Positives/Total True + False Positives

Receiver Operating Characteristic (ROC) Curves

FPR and TPR across different threshold values for the given model
I.e. set a threshold - a prediction below the threshold is a Negative, above - Positive
Typical threshold is 0.5
Use the Positives/Negatives to generate TPR/FPR by comparing these to the actual targets
Re-run this across different threshold and plot TPR (y-axis) and FPR (x-axis) on a graph - ROC Curve

Areas Under ROC (AUROC)

Area under ROC Curve
Higher AUROC - better quality model

Precision Recall Curve (PR)

Measures Precision vs Recall values across multiple thresholds
Used in situations of high class imbalance (many 0's, few 1's)
Does not factor True Negatives

Errors

Common error causes:

Proper problem faming and metric section
Data quality
Feature selection
Model fit
Inherent error

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)