Aleks on AWS: Deep Learning

Deep learning

"A subfield of machine learning that focuses on artificial neural networks and deep neural networks."
Neural Network - "computational model that is inspired by the structure and functioning of the human brain" (ChatGTP definition)
Neural network with many layer - Deep neural network
Pre-history:

1943, Warren McCulloch and Walter Pitts - publish a model for how neurons work together to perform computations
2000's saw a boom in deep learning as more computation power and data became available

Artificial Neurons

A simple model - set of weight is multiplied by a set of coefficient
The result is passed through a threshold function
If it >= threshold, the neuron is activated
The output is a 0/1 prediction; no probability values for each class is calculated
Used for binary classification tasks; in reality - it is a simple linear model

Same as above, but the result of the weights times the coefficients is passed through an Activation (sigmoid) Function prior to going into Threshold Function
The model output is a probability from 0 to 1; it is then be converted to a 0 or 1 prediction
If probability is >=0.5, prediction is 1, otherwise 0
Goal - find the weight values that minimize the cost (error)

Training a Neuron

Forward Propagation - first a set of weights is propagated through the model, a prediction is calculated
Calculate the cost (error) and the gradient of that cost with respect to each weight; go through each weight and update its value using gradient descend
Eventually the values of weight that result in minimal cost are found - these weights will be used in the final model
Need to focus on Learning rate - how big of a step is taken in going though gradient descent

Use the entire data set in each update; calculate gradient and weights based on al updates in each iteration
Pros - can use vectorized matrix operations; use those efficiently
Cons - may be impossible to achieve on large data sets due to the compute power required

Divide the data set into smaller batches; perform batch gradient descent on each
Pros - works well for large data sets, uses vectorized operations
Cons - not as good as SGD for online learning (when observations come in one at a time)

Aleks on AWS