Thursday, November 9, 2023

Neural Networks

Neural Networks

  • Multilayer perceptron (MLP)
    • Multiple perceptrons are stacked in layers; the outputs of those are fed into another perceptron
    • Multiclass - if multiple output classes are possible, can use multiple units in the output layer
  • Neural Networks
    • Instead of Perceptron in each unit (linear), can use units with activation functions in each layer
      • Sigmoid, Hyperbolic tangent (Tanh), ReLU
    • This will improve modeling for non-linear relationships
    • Network architecture - Forward propagation
      • Input Layer - input data, features
      • Hidden Layer - receives features are multiplied by weights; passes the multiplication results through the Activation Function
      • Output layer - the output of Hidden layer are fed in, multiplied by weights and passed through an Activation Function; prediction is produced
      • In general - there is one input layer, one output layer; the rest are hidden layers
      • If we have more than two classes, need an output unit to get the probability score for each class
    • Strengths
      • Can model complicated relationships with large number of features
      • No or little feature engineering needed
      • Modern tools make these accessible - multiple models exists and are ready to use
    • Weaknesses
      • Computationally expensive
      • Large power consumption
      • Outputs are difficult to interpret - black box producing results
      • Can easily be overfitted, especially if input data set is small

Training Neural Networks

  • Backpropagation
    • Working in reverse to distribute the total output error among layers
    • Calculate the gradient the cost function, i.e. gradient of each error to each weight - perform gradient descend on the weights
  • Approaches to network design
    • Stretch pants - start with a network too large for the problem; work to reduce overfitting
    • Transfer learning - use a pre-build / pre-trained neural network built for  a relevant problem;
      • Do tuning - cut off final layers; add and re-train those

Common Use Cases

Computer Vision
  • Image classification - facial recognition, x-ray analysis
  • Object detection - find objects within an image by drawing boxes around objects and analyzing the inside (ex: self-driving cars)
  • Semantic segmentation - analyze each pixel; find where exactly where the object begins and ends (no boxes)
  • Image generation - atomical images

Convolutional neural network

  • These are most commonly used in image recognition models
  • In a typical set up layers are interconnected
    • Every value in a previous layers is connected by a weight to a value in a following layer
  • The number of features can grow and the number of potential weights can become unmanageable
  • This can weigh on performance and maintenance
  • Convolutional neural network (CNNs) uses additional types of layers:
    • Convolutional Layers
      • Act as filters; a set of weights is applied across the entire data set
      • A node is connected only to nodes with the same weight in the layer before; i.e. not all nodes connect to all nodes
      • In a nut shell, the way this works is:
        • Apply a filter set across a set of values; multiply value by weight; add up the result
        • Shift over and apply the same filter set to the next section within the dataset; sum the result again
        • Combine the summed up set of results into a future map
    • Pooling Layers
      • This can then be flattened even further - as the set of feature maps is then combined into a layer, an additional pooling layer is introduced. A section of a feature map is taken and a mean or max is calculated. The resulting value is used to represent the original section - and is then progressed into the next level. The aim is to minimize the number of weights the model needs to train
      • ImageNet - a database of images that most image recognition models are trained on
        • 14 million images; 20K categories

Natural Language Processing

  • Text classification (ex: spam/not-spam)
  • Sentiment analysis (ex: determine positive/negative sentiment of a consumer towards a product)
  • Search (ex: search for the answer given a question)
  • Machine translation (ex: language translation)
  • Text generation (ex: generated automated email response to an email)

Text Representation

  • Techniques for fitting text into models - converting text to numerical values for use as inputs to a model:
  • Vocabulary - bag of words; each word is assigned a value - count up how many times a word appears in text
  • Embedding - a word or a document is assigned a numerical value aimed to capture the meaning of the word/text
    • Word2Ves, GloVe
  • Attention - used in Transformer models. Transformers use:
    • Word embedding
    • Positional encoding - determining where in a sentence a word is placed
    • Attention - measure of how strongly words within a sentence are related, regardless of their position
      • The man ate food, because he was hungry

Wednesday, November 8, 2023

Deep Learning

Deep learning

  • "A subfield of machine learning that focuses on artificial neural networks and deep neural networks."
  • Neural Network - "computational model that is inspired by the structure and functioning of the human brain" (ChatGTP definition)
  • Neural network with many layer - Deep neural network
  • Pre-history:
    • 1943, Warren McCulloch and Walter Pitts - publish a model for how neurons work together to perform computations
    • 2000's saw a boom in deep learning as more computation power and data became available
  • Deep learning shines when
    • Great amount of data is available
    • Large amount features (ex: unstructured data)
    • Complex relationship b/w features and data
    • Explainability is not highly required (black box!)

Artificial Neurons

  • Perceptron
    • A simple model - set of weight is multiplied by a set of coefficient
    • The result is passed through a threshold function
    • If it >= threshold, the neuron is activated
    • The output is a 0/1 prediction; no probability values for each class is calculated
    • Used for binary classification tasks; in reality - it is a simple linear model
  • Logistic Regression
    • Same as above, but the result of the weights times the coefficients is passed through an Activation (sigmoid) Function prior to going into Threshold Function
    • The model output is a probability from 0 to 1; it is then be converted to a 0 or 1 prediction
    • If probability is >=0.5, prediction is 1, otherwise 0
    • Goal - find the weight values that minimize the cost (error)

Training a Neuron

  • Forward Propagation - first a set of weights is propagated through the model, a prediction is calculated
  • Calculate the cost (error) and the gradient of that cost with respect to each weight; go through each weight and update its value using gradient descend 
  • Eventually the values of weight that result in minimal cost are found - these weights will be used in the final model
  • Need to focus on Learning rate - how big of a step is taken in going though gradient descent
    • If too small - the algorithm will be too slow
    • If too large - the gradient will be jumping all over the place
  • Gradient descent methods
    • Stochastic (SGD)
      • Iterate though observations one at a time; calculate gradient and weights
      • Pros - works well for large data sets, online learning
      • Cons - cannot use algebra operations
    • Batch
      • Use the entire data set in each update; calculate gradient and weights based on al updates in each iteration
      • Pros - can use vectorized matrix operations; use those efficiently
      • Cons - may be impossible to achieve on large data sets due to the compute power required
    • Mini-batch
      • Divide the data set into smaller batches; perform batch gradient descent on each
      • Pros - works well for large data sets, uses vectorized operations
      • Cons - not as good as SGD for online learning (when observations come in one at a time)

Tuesday, November 7, 2023

Supervised Learning Algorithms - Tree and Ensemble

Decision Trees

  • Data analysis - ask series of questions to narrow down on the label (Color? Brown; Has wings? Yes; Answer - Bird)
  • Splits - the goal is to:
    • Minimize the number of splits thus making the tree max efficient
    • Maximize Information Gain (IG) at each split (i.e. ask the right questions)
      • Impurity - how heavily the data across multiple classes is mixed
      • IG = Decrease in impurity = Impurity(parent) - Impurity(Child)
  • Tree depth - hyperparameter; max number of splits that can occur
    • Analyze the data set - split into two according to the majority of the class present in each leaf
    • Shallow tree - can underfit the data
    • Deep tree - can overfit (i.e. every example ends up a separate leaf; introduce a lot of noise)
  • Traces the through the tree until a leaf is reached.  Then use the mean target value of the training points at that leaf as the prediction.

Regression trees

  • Instead of going by the majority, split by the mean target value of the sample (i.e. calculate the mean of the leaf constituents, make prediction based on the mean)
  • Benefits:
    • Very interpretable
    • Train quickly
    • Can handle non-linear relationships
    • No needs for scaling
  • Challenges:
    • Sensitive to depth
    • Prone to overfitting

Ensemble Models

  • Combine multiple models into a meta-model for better performance
  • Averaging multiple models - less likely to overfit if the models are independent
    • Feed full dataset (or a slice) into multiple models
    • Use Aggregation function (majority, average, weighted averaged, etc.) to determine the overall prediction based on the individual predictions of each model
  • Challenges:
    • Time / compute resources to train
    • Cost of running multiple models
    • Decrease in interpretability

Bagging

  • Bootstrap aggregation - sampling with replacement
  • Pull out a sample from a set for use in a model; replace the pulled out sample
  • Can end up pulling the same sample multiple times
  • Because each model is trained using a different data set, the results can be considered independent
  • The average of the predictions is then taken
    • Build serious of trees, then average the outputs
    • Reduces overfitting

Random Forest

  • Use bagging to train multiple trees
  • Use multiple trees and take the majority vote
  • Each member model may use a different subset of both features and observations to train
  • Decisions driving this:
    • Number of trees
    • Sampling strategy for bagging
      • Bagging samples size as % of total
      • Max number of feature represented in the bagging sample
    • Depth of trees
      • Max depth
      • Min samples per leaf

Clustering

  • Unsupervised learning technique
  • Organize data points into logical groups
  • Sort similar data points into same cluster - and different data points into another
    • Ex: dividing potential customers in groups and targeting each a specific marketing
    • Key decision - determining the basis for similarity in data

K-Means Clustering

  • Sort data based on distance from the nearest cluster center
    • Ex: data points near each other go into one cluster; far from each other - into another cluster
  • Goal - minimize the sum of distance from each data point to the center
    1. Select centers within the cluster (at random)
    2. Assign each data point to the closest data center
    3. Move the center from the original locations to the mean distance location (re-center)
    4. Repeat the above 2 and 3 until the centers are no longer moving
  • Strengths:
    • Easy to implement
    • Quick to run
    • Good point for starting to wok w clustering
  • Weakness:
    • Requires to identify how many clusters will be used ahead of time
    • Doe not work for geographically complex data

Monday, November 6, 2023

Linear Regression and Regularization

Types of Algorithms

  • Parametric algorithm (Linear model)
    • Based on a mathematical model that defines the relationship between inputs and outputs
    • Fixed and pre-determined set of parameters and coefficients
    • Can learn quickly
    • Can be too simple for real world - prone to underfitting
  • Non-Parametric algorithm
    • Does not make assumptions about relationship b/w input and output ahead of building a model
    • Flexible and adapts well to non-linear data relationships
    • Performs better than linear
    • Requires more data to train; prone to overfitting

Linear Regression

  • Linear relations b/w features and targets (input and the output that we are looking to predict), defined by set of coefficients
  • Simple, easy to interpret and see the relationship between inputs and outputs
  • Yet is a basis for many more complex models
  • Example: Home Price = # of Bedroom
  • Y = W0 + W1*X; W0 is Bias;  W1 is Coefficient/weight
  • Multiple Linear regression
    • y = W0 + W1*X1 + W2*X2 +…+Wp*Xp
    • Additional W's - ex: House Size; Location, etc.
  • Total Error of a Model
    • SSE - Cost function
      • Sum of Squared Errors - (Sum of Predictions minus Actuals) squared
    • The goal is to minimize it

Polynomial Regression

  • Non-linear relationships can be modeled as well
  • Ex: use a non-linear function to create a feature, x=X or log(x); then use it as input

Regularization

  • When a complex model does not predict well on new data
    • Add a penalty factor to the cost function to penalize feature complexity
    • This would reduce the complexity in the model and decrease the probability of overfitting thus helping the model performance
    • I.e. a coefficient per feature - the more features there are the more coefficients are added to the penalty function, which wight on the overall regression cost
    • Use another variable, lamba, to multiply the sum up the coefficients - this controls the severity of the punishment
    • Penalty factor
      • LASSO regression
        • Penalty factor is the sum of the absolute value of the coefficients, multiplied by lambda
        • Reduces the coefficients of irrelevant features to 0
        • Suitable for a simpler and more inheritable model
      • Ridge regression
        • Penalty factor is the sum of the coefficients squared, multiplied by lambda
        • Reduces them close to 0 but does not completely remove them
        • Suitable when there is complex relationship of target to many features with collinearity/correlation

Logistic Regression

  • Mainly used for classification tasks
  • Linear regression is not the best solution when an outcome of 0 or 1 is required
  • Linear regression produces an actual prediction value, not 1 or 0
  • Solution option - predict probability of p(y=1)
  • Use logistics/sigmoid function - the input for this would be the output of the linear model
  • The output of the sigmoid would be between 0 and 1; this can then be interpreted as the probability of either 0 or 1
  • Gradient descent - "an iterative first-order optimization algorithm, used to find a local minimum/maximum of a given function."
  • Use gradient descent to determine the values of the weights that minimize the cost
    • Update the values of the weights in by a small amount to move closer to the weights which minimize our loss/cost function
    • Need - a variable to use in every step called leaning rate; the gradient of the cost/loss function itself; weight values from the previous iteration
  • To visualize in a graph - iterate against the gradient slowly until reach the bottom (i.e. the minimum cost point)
  • What if there more than 1 or (two classes) - predicting multiple classes?
    • Use softmax function instead of sigmoid - it provides the probability of belonging to each class normalized to sum of 1
    • Then take the class with the highest probability value as our prediction


Sunday, November 5, 2023

Testing and Evaluation of a Model

Testing a Model

  • Divide data into
    • Training Set - build / train the model
    • Test Set - model performance evaluation set; data different from Training set
  • Need to be careful with data leakage - test data getting into model building process
  • This causes performance estimate to be overly optimistic vs. when using new data
    • Further split Training set into:
      • Training Set
      • Validation Set - use for improving the model and monitoring performance on the test set
    • And only then test using the Test set - as an unbiased exercise on a separate set of data
  • K-Folds Cross Validation
    • Another test strategy
    • Split the data into subsets ("folds")
    • Run multiple iteration of the model using a separate "fold"
    • Calculate the Error as the Average Error across K-Folds validation runs
    • This approach is industry standard as it:
      • Maximizes the use training data
      • Provides better insight into performance

Model Evaluation

  • Metrics used
    • Outcome - business impact, usually $
    • Output - model output; usually remains internal, non-customer
  • Regression Metrics:
    • MSE - Mean Squared Error
      • Most popularly used
      • Influenced by large errors (these are penalized heavily), scale of data
      • This is the model to use of concerned with minimizing large errors on outlier datapoints
    • MAE - Mean Absolute Error
      • Influenced by scale
      • Can be easier to interpret
      • Doesn’t go after outlier results as heavily as MSE
    • MAPE - Mean Absolute Percent Error
      • Converts an error to %
      • Easer understood by non-technical clientele
      • An error could be small - but when compared to the value of the target, the % could be large
    • Coefficient of Determination (R-squared)
      • Definition - the proportion of the variation in the dependent variable that is predictable from the independent variable(s)
      • Displays how much of the variability in the target variable is explained by the model
      • Totals squared deviation from mean (SST) = Sum of squares regression (SSR) + Total squared error (SSE)
      • R-squared = SSR/SST = 2- SSE/SSR
      • Usually b/w 1 and 0; the closer it is to 1, the better it explains variability in the target variable

Accuracy

  • To aim at better accuracy be careful with input data class imbalance (ex: how many days a year a healthy person is sick?)
  • Confusion Matrix - True/False Positives vs True/False Negatives
    • FPR Rate - False Positives Rate (FPR), i.e. False Positives/Total True + False Positives
    • Precision - True Positives Rate (TPR), i.e. True Positives/Total True + False Positives
  • Receiver Operating Characteristic (ROC) Curves
    • FPR and TPR across different threshold values for the given model
    • I.e. set a threshold - a prediction below the threshold is a Negative, above - Positive
    • Typical threshold is 0.5
    • Use the Positives/Negatives to generate TPR/FPR by comparing these to the actual targets
    • Re-run this across different threshold and plot TPR (y-axis) and FPR (x-axis) on a graph - ROC Curve
  • Areas Under ROC (AUROC)
    • Area under ROC Curve
    • Higher AUROC - better quality model
  • Precision Recall Curve (PR)
    • Measures Precision vs Recall values across multiple thresholds
    • Used in situations of high class imbalance (many 0's, few 1's)
    • Does not factor True Negatives

Errors

  • Common error causes:
    • Proper problem faming and metric section
    • Data quality
    • Feature selection
    • Model fit
    • Inherent error

Thursday, November 2, 2023

Building a Machine Leaning Model

 Building a Model

  • Creating a model is an interactive process:
    • Gather the data needed
    • Selection of features - collect past observations (houses for sale) and targets (sale process)
    • Choice of algorithm - template for relationship between the input and the output
    • Selection of values for hyperparameters (tuning the dials of the model to make sure it functions)
    • Selection of Loss (cost) function
    • Train the model using past data collected
    • Evaluate the performance

Feature Selection

  • Features - characteristics of the input data
  • Important to define feature - interaction of factors that:
    • Might influence the solution (years the house is build, city, neighborhood)
    • The data that we might be able to collect
  • Methods of selection:
    • Speak w experts - domain expertise
    • Collect data - Visualization
    • Past data collection =- statistical correlation
    • Collect as much as you can, but narrow down the data - Modeling

Algorithm Selection

  • Dependent on the task that needs t be solved
  • Best approach - employ different algorithms and train all
  • Criteria for algorithm selection:
    • Performance Accuracy - expected performance of the model
    • Interpretability - how easy/hard it is to interpret and predict the result of the model computation
    • Computational Efficiency - computational horsepower measure in relation to the result generation

Model Complexity

Depends on:
  • Number of Features
  • Algorithm - linear regression vs neuronetworks
  • Hyperparameter Values

Bias-Variance Trade off

  • Bias - modeling a complex problem using a simple model; the model is incapable of fully capturing the depth and underlying patterns in the data.
  • Variance - sensitivity of the module to small fluctuations in the data; ex: interpreting noise as actual patterns
Typically
  • Simpler model = higher bias; lower variance
  • Complex model = lower bias; higher variance
  • Total Error = Bias`2 + Variance + Inherent Error(noise)`2
  • Need to be careful Underfitting / Overfitting the model 

Wednesday, November 1, 2023

What is Machine Leaning?

What is Machine Leaning?

Field of study that gives computers the ability to learn without being explicitly programmed” – Arthur Samuel, IBM, 1959

  • Machine learning - set of methods & tools which help realize the goal of the field of artificial intelligence
  • Deep learning, or the use of neural networks containing many layers - a subfield of machine learning
  • Computer vision, natural language processing, recommendation systems etc. - sub-fields of AI which rely on machine learning methods

Terminology
  • “Data are characteristics or information, usually numerical, that are collected through observation.” - OECD Glossary of Statistical Terms, https://www.oecd-ilibrary.org/economics/oecd-glossary-of-statistical-terms_9789264055087-en
  • Observations - ex: House
  • Features - ex: Neighborhood, School district, Square footage, Number of bedrooms, Year built
  • Targets - ex: Market sale price
  • Model - an approximation of the relationship between two variables: 
    • Input Var -> Model -> Output Var
    • A model needs:
      • Features - inputs
      • Algorithm - formula
      • Hyperparameter - formula tweaks
      • Loss function - to use to optimize the model