Aleks on AWS: 2023

Thursday, November 9, 2023

Neural Networks

Neural Networks

Multilayer perceptron (MLP)

Multiple perceptrons are stacked in layers; the outputs of those are fed into another perceptron
Multiclass - if multiple output classes are possible, can use multiple units in the output layer

Neural Networks

Instead of Perceptron in each unit (linear), can use units with activation functions in each layer

Sigmoid, Hyperbolic tangent (Tanh), ReLU

This will improve modeling for non-linear relationships
Network architecture - Forward propagation

Input Layer - input data, features
Hidden Layer - receives features are multiplied by weights; passes the multiplication results through the Activation Function
Output layer - the output of Hidden layer are fed in, multiplied by weights and passed through an Activation Function; prediction is produced
In general - there is one input layer, one output layer; the rest are hidden layers
If we have more than two classes, need an output unit to get the probability score for each class

Strengths

Can model complicated relationships with large number of features
No or little feature engineering needed
Modern tools make these accessible - multiple models exists and are ready to use

Weaknesses

Computationally expensive
Large power consumption
Outputs are difficult to interpret - black box producing results
Can easily be overfitted, especially if input data set is small

Training Neural Networks

Backpropagation

Working in reverse to distribute the total output error among layers
Calculate the gradient the cost function, i.e. gradient of each error to each weight - perform gradient descend on the weights

Approaches to network design

Stretch pants - start with a network too large for the problem; work to reduce overfitting
Transfer learning - use a pre-build / pre-trained neural network built for a relevant problem;

Do tuning - cut off final layers; add and re-train those

Common Use Cases

Computer Vision

Image classification - facial recognition, x-ray analysis
Object detection - find objects within an image by drawing boxes around objects and analyzing the inside (ex: self-driving cars)
Semantic segmentation - analyze each pixel; find where exactly where the object begins and ends (no boxes)
Image generation - atomical images

Convolutional neural network

These are most commonly used in image recognition models
In a typical set up layers are interconnected

Every value in a previous layers is connected by a weight to a value in a following layer

The number of features can grow and the number of potential weights can become unmanageable
This can weigh on performance and maintenance
Convolutional neural network (CNNs) uses additional types of layers:

Convolutional Layers

Act as filters; a set of weights is applied across the entire data set
A node is connected only to nodes with the same weight in the layer before; i.e. not all nodes connect to all nodes
In a nut shell, the way this works is:

Apply a filter set across a set of values; multiply value by weight; add up the result
Shift over and apply the same filter set to the next section within the dataset; sum the result again
Combine the summed up set of results into a future map

Pooling Layers

This can then be flattened even further - as the set of feature maps is then combined into a layer, an additional pooling layer is introduced. A section of a feature map is taken and a mean or max is calculated. The resulting value is used to represent the original section - and is then progressed into the next level. The aim is to minimize the number of weights the model needs to train
ImageNet - a database of images that most image recognition models are trained on

14 million images; 20K categories

Natural Language Processing

Text classification (ex: spam/not-spam)
Sentiment analysis (ex: determine positive/negative sentiment of a consumer towards a product)
Search (ex: search for the answer given a question)
Machine translation (ex: language translation)
Text generation (ex: generated automated email response to an email)

Text Representation

Techniques for fitting text into models - converting text to numerical values for use as inputs to a model:
Vocabulary - bag of words; each word is assigned a value - count up how many times a word appears in text
Embedding - a word or a document is assigned a numerical value aimed to capture the meaning of the word/text

Word2Ves, GloVe

Attention - used in Transformer models. Transformers use:

Word embedding
Positional encoding - determining where in a sentence a word is placed
Attention - measure of how strongly words within a sentence are related, regardless of their position

The man ate food, because he was hungry

Wednesday, November 8, 2023

Deep Learning

Deep learning

"A subfield of machine learning that focuses on artificial neural networks and deep neural networks."
Neural Network - "computational model that is inspired by the structure and functioning of the human brain" (ChatGTP definition)
Neural network with many layer - Deep neural network
Pre-history:

1943, Warren McCulloch and Walter Pitts - publish a model for how neurons work together to perform computations
2000's saw a boom in deep learning as more computation power and data became available

Deep learning shines when

Great amount of data is available
Large amount features (ex: unstructured data)
Complex relationship b/w features and data
Explainability is not highly required (black box!)

Artificial Neurons

Perceptron

A simple model - set of weight is multiplied by a set of coefficient
The result is passed through a threshold function
If it >= threshold, the neuron is activated
The output is a 0/1 prediction; no probability values for each class is calculated
Used for binary classification tasks; in reality - it is a simple linear model

Logistic Regression

Same as above, but the result of the weights times the coefficients is passed through an Activation (sigmoid) Function prior to going into Threshold Function
The model output is a probability from 0 to 1; it is then be converted to a 0 or 1 prediction
If probability is >=0.5, prediction is 1, otherwise 0
Goal - find the weight values that minimize the cost (error)

Training a Neuron

Forward Propagation - first a set of weights is propagated through the model, a prediction is calculated
Calculate the cost (error) and the gradient of that cost with respect to each weight; go through each weight and update its value using gradient descend
Eventually the values of weight that result in minimal cost are found - these weights will be used in the final model
Need to focus on Learning rate - how big of a step is taken in going though gradient descent

If too small - the algorithm will be too slow
If too large - the gradient will be jumping all over the place

Gradient descent methods

Stochastic (SGD)

Iterate though observations one at a time; calculate gradient and weights
Pros - works well for large data sets, online learning
Cons - cannot use algebra operations

Batch

Use the entire data set in each update; calculate gradient and weights based on al updates in each iteration
Pros - can use vectorized matrix operations; use those efficiently
Cons - may be impossible to achieve on large data sets due to the compute power required

Mini-batch

Divide the data set into smaller batches; perform batch gradient descent on each
Pros - works well for large data sets, uses vectorized operations
Cons - not as good as SGD for online learning (when observations come in one at a time)

Tuesday, November 7, 2023

Supervised Learning Algorithms - Tree and Ensemble

Decision Trees

Data analysis - ask series of questions to narrow down on the label (Color? Brown; Has wings? Yes; Answer - Bird)
Splits - the goal is to:

Minimize the number of splits thus making the tree max efficient
Maximize Information Gain (IG) at each split (i.e. ask the right questions)

Impurity - how heavily the data across multiple classes is mixed
IG = Decrease in impurity = Impurity(parent) - Impurity(Child)

Tree depth - hyperparameter; max number of splits that can occur

Analyze the data set - split into two according to the majority of the class present in each leaf
Shallow tree - can underfit the data
Deep tree - can overfit (i.e. every example ends up a separate leaf; introduce a lot of noise)

Traces the through the tree until a leaf is reached. Then use the mean target value of the training points at that leaf as the prediction.

Regression trees

Instead of going by the majority, split by the mean target value of the sample (i.e. calculate the mean of the leaf constituents, make prediction based on the mean)
Benefits:

Very interpretable
Train quickly
Can handle non-linear relationships
No needs for scaling

Challenges:

Sensitive to depth
Prone to overfitting

Ensemble Models

Combine multiple models into a meta-model for better performance
Averaging multiple models - less likely to overfit if the models are independent

Feed full dataset (or a slice) into multiple models
Use Aggregation function (majority, average, weighted averaged, etc.) to determine the overall prediction based on the individual predictions of each model

Challenges:

Time / compute resources to train
Cost of running multiple models
Decrease in interpretability

Bagging

Bootstrap aggregation - sampling with replacement
Pull out a sample from a set for use in a model; replace the pulled out sample
Can end up pulling the same sample multiple times
Because each model is trained using a different data set, the results can be considered independent
The average of the predictions is then taken

Build serious of trees, then average the outputs
Reduces overfitting

Random Forest

Use bagging to train multiple trees
Use multiple trees and take the majority vote
Each member model may use a different subset of both features and observations to train
Decisions driving this:

Number of trees
Sampling strategy for bagging

Bagging samples size as % of total
Max number of feature represented in the bagging sample

Depth of trees

Max depth
Min samples per leaf

Clustering

Unsupervised learning technique
Organize data points into logical groups
Sort similar data points into same cluster - and different data points into another

Ex: dividing potential customers in groups and targeting each a specific marketing
Key decision - determining the basis for similarity in data

K-Means Clustering

Sort data based on distance from the nearest cluster center

Ex: data points near each other go into one cluster; far from each other - into another cluster

Goal - minimize the sum of distance from each data point to the center

Select centers within the cluster (at random)
Assign each data point to the closest data center
Move the center from the original locations to the mean distance location (re-center)
Repeat the above 2 and 3 until the centers are no longer moving

Strengths:

Easy to implement
Quick to run
Good point for starting to wok w clustering

Weakness:

Requires to identify how many clusters will be used ahead of time
Doe not work for geographically complex data

Monday, November 6, 2023

Linear Regression and Regularization

Types of Algorithms

Parametric algorithm (Linear model)

Based on a mathematical model that defines the relationship between inputs and outputs
Fixed and pre-determined set of parameters and coefficients
Can learn quickly
Can be too simple for real world - prone to underfitting

Non-Parametric algorithm

Does not make assumptions about relationship b/w input and output ahead of building a model
Flexible and adapts well to non-linear data relationships
Performs better than linear
Requires more data to train; prone to overfitting

Linear Regression

Linear relations b/w features and targets (input and the output that we are looking to predict), defined by set of coefficients
Simple, easy to interpret and see the relationship between inputs and outputs
Yet is a basis for many more complex models
Example: Home Price = # of Bedroom
Y = W0 + W1*X; W0 is Bias; W1 is Coefficient/weight
Multiple Linear regression

y = W0 + W1*X1 + W2*X2 +…+Wp*Xp
Additional W's - ex: House Size; Location, etc.

Total Error of a Model

SSE - Cost function

Sum of Squared Errors - (Sum of Predictions minus Actuals) squared

The goal is to minimize it

Polynomial Regression

Non-linear relationships can be modeled as well
Ex: use a non-linear function to create a feature, x=X or log(x); then use it as input

Regularization

When a complex model does not predict well on new data

Add a penalty factor to the cost function to penalize feature complexity
This would reduce the complexity in the model and decrease the probability of overfitting thus helping the model performance
I.e. a coefficient per feature - the more features there are the more coefficients are added to the penalty function, which wight on the overall regression cost
Use another variable, lamba, to multiply the sum up the coefficients - this controls the severity of the punishment
Penalty factor

LASSO regression

Penalty factor is the sum of the absolute value of the coefficients, multiplied by lambda
Reduces the coefficients of irrelevant features to 0
Suitable for a simpler and more inheritable model

Ridge regression

Penalty factor is the sum of the coefficients squared, multiplied by lambda
Reduces them close to 0 but does not completely remove them
Suitable when there is complex relationship of target to many features with collinearity/correlation

Logistic Regression

Mainly used for classification tasks
Linear regression is not the best solution when an outcome of 0 or 1 is required
Linear regression produces an actual prediction value, not 1 or 0
Solution option - predict probability of p(y=1)
Use logistics/sigmoid function - the input for this would be the output of the linear model
The output of the sigmoid would be between 0 and 1; this can then be interpreted as the probability of either 0 or 1
Gradient descent - "an iterative first-order optimization algorithm, used to find a local minimum/maximum of a given function."
Use gradient descent to determine the values of the weights that minimize the cost

Update the values of the weights in by a small amount to move closer to the weights which minimize our loss/cost function
Need - a variable to use in every step called leaning rate; the gradient of the cost/loss function itself; weight values from the previous iteration

To visualize in a graph - iterate against the gradient slowly until reach the bottom (i.e. the minimum cost point)
What if there more than 1 or (two classes) - predicting multiple classes?

Use softmax function instead of sigmoid - it provides the probability of belonging to each class normalized to sum of 1
Then take the class with the highest probability value as our prediction

Sunday, November 5, 2023

Testing and Evaluation of a Model

Testing a Model

Divide data into

Training Set - build / train the model
Test Set - model performance evaluation set; data different from Training set

Need to be careful with data leakage - test data getting into model building process
This causes performance estimate to be overly optimistic vs. when using new data

Further split Training set into:

Training Set
Validation Set - use for improving the model and monitoring performance on the test set

And only then test using the Test set - as an unbiased exercise on a separate set of data

K-Folds Cross Validation

Another test strategy
Split the data into subsets ("folds")
Run multiple iteration of the model using a separate "fold"
Calculate the Error as the Average Error across K-Folds validation runs
This approach is industry standard as it:

Maximizes the use training data
Provides better insight into performance

Model Evaluation

Metrics used

Outcome - business impact, usually $
Output - model output; usually remains internal, non-customer

Regression Metrics:

MSE - Mean Squared Error

Most popularly used
Influenced by large errors (these are penalized heavily), scale of data
This is the model to use of concerned with minimizing large errors on outlier datapoints

MAE - Mean Absolute Error

Influenced by scale
Can be easier to interpret
Doesn’t go after outlier results as heavily as MSE

MAPE - Mean Absolute Percent Error

Converts an error to %
Easer understood by non-technical clientele
An error could be small - but when compared to the value of the target, the % could be large

Coefficient of Determination (R-squared)

Definition - the proportion of the variation in the dependent variable that is predictable from the independent variable(s)
Displays how much of the variability in the target variable is explained by the model
Totals squared deviation from mean (SST) = Sum of squares regression (SSR) + Total squared error (SSE)
R-squared = SSR/SST = 2- SSE/SSR
Usually b/w 1 and 0; the closer it is to 1, the better it explains variability in the target variable

Accuracy

To aim at better accuracy be careful with input data class imbalance (ex: how many days a year a healthy person is sick?)
Confusion Matrix - True/False Positives vs True/False Negatives

FPR Rate - False Positives Rate (FPR), i.e. False Positives/Total True + False Positives
Precision - True Positives Rate (TPR), i.e. True Positives/Total True + False Positives

Receiver Operating Characteristic (ROC) Curves

FPR and TPR across different threshold values for the given model
I.e. set a threshold - a prediction below the threshold is a Negative, above - Positive
Typical threshold is 0.5
Use the Positives/Negatives to generate TPR/FPR by comparing these to the actual targets
Re-run this across different threshold and plot TPR (y-axis) and FPR (x-axis) on a graph - ROC Curve

Areas Under ROC (AUROC)

Area under ROC Curve
Higher AUROC - better quality model

Precision Recall Curve (PR)

Measures Precision vs Recall values across multiple thresholds
Used in situations of high class imbalance (many 0's, few 1's)
Does not factor True Negatives

Errors

Common error causes:

Proper problem faming and metric section
Data quality
Feature selection
Model fit
Inherent error

Thursday, November 2, 2023

Building a Machine Leaning Model

Building a Model

Creating a model is an interactive process:

Gather the data needed
Selection of features - collect past observations (houses for sale) and targets (sale process)
Choice of algorithm - template for relationship between the input and the output
Selection of values for hyperparameters (tuning the dials of the model to make sure it functions)
Selection of Loss (cost) function
Train the model using past data collected
Evaluate the performance

Feature Selection

Features - characteristics of the input data
Important to define feature - interaction of factors that:

Might influence the solution (years the house is build, city, neighborhood)
The data that we might be able to collect

Methods of selection:

Speak w experts - domain expertise
Collect data - Visualization
Past data collection =- statistical correlation
Collect as much as you can, but narrow down the data - Modeling

Algorithm Selection

Dependent on the task that needs t be solved
Best approach - employ different algorithms and train all
Criteria for algorithm selection:

Performance Accuracy - expected performance of the model
Interpretability - how easy/hard it is to interpret and predict the result of the model computation
Computational Efficiency - computational horsepower measure in relation to the result generation

Model Complexity

Depends on:

Number of Features
Algorithm - linear regression vs neuronetworks
Hyperparameter Values

Bias-Variance Trade off

Bias - modeling a complex problem using a simple model; the model is incapable of fully capturing the depth and underlying patterns in the data.
Variance - sensitivity of the module to small fluctuations in the data; ex: interpreting noise as actual patterns

Typically

Simpler model = higher bias; lower variance
Complex model = lower bias; higher variance
Total Error = Bias`2 + Variance + Inherent Error(noise)`2
Need to be careful Underfitting / Overfitting the model

Wednesday, November 1, 2023

What is Machine Leaning?

What is Machine Leaning?

“Field of study that gives computers the ability to learn without being explicitly programmed” – Arthur Samuel, IBM, 1959

Machine learning - set of methods & tools which help realize the goal of the field of artificial intelligence
Deep learning, or the use of neural networks containing many layers - a subfield of machine learning
Computer vision, natural language processing, recommendation systems etc. - sub-fields of AI which rely on machine learning methods

Terminology

“Data are characteristics or information, usually numerical, that are collected through observation.” - OECD Glossary of Statistical Terms, https://www.oecd-ilibrary.org/economics/oecd-glossary-of-statistical-terms_9789264055087-en
Observations - ex: House
Features - ex: Neighborhood, School district, Square footage, Number of bedrooms, Year built
Targets - ex: Market sale price
Model - an approximation of the relationship between two variables:

Input Var -> Model -> Output Var
A model needs:

Features - inputs
Algorithm - formula
Hyperparameter - formula tweaks
Loss function - to use to optimize the model

Wednesday, July 19, 2023

Snowflake on AWS - Quick Summary

Snowflake - an SQL Compatible database:

Originally redeveloped on AWS, 2012-2014

Added other cloud providers

AWS's customer AND partner

Designed to be a cloud data warehouse - not a data lake

Not aimed an analytics use - doesn't have pipeline features

Lives on S2, Azure Blob storage

Virtual warehouse - doesn’t mean a w/house, just compute

Its own meta data store

Consists of:

Database Layer - Lives on S2, Azure Blob storage
Query Processing Layer - a virtual warehouse - doesn’t mean a w/house, just compute
Cloud Services Layer - its own meta data store

Key features:

Multi-tenant - all customers live in the snowflake VPC
Not available in EVERY region for EVERY cloud provider

Data resides in .fdn format - proprietary, columnar, micro-partitions
Any data loaded - get converted into this format; gets split into columnar micro-partitions
This is done for query optimization purposes - to scan on related portions and get results faster

Multiple Editions: Standard, Enterprise, Business Critical, Virtual Private Snowflake
Higher level editions include all feature of the lower

Pay As You go on demand - credits; there are discounts for pay-upfront
Storage - temp tables, transient, permanent - incur cost
Credits are T-Shirt sized according to the size of DB - M, XL, etc.