Deep learning
- "A subfield of machine learning that focuses on artificial neural networks and deep neural networks."
 - Neural Network - "computational model that is inspired by the structure and functioning of the human brain" (ChatGTP definition)
 - Neural network with many layer - Deep neural network
 - Pre-history:
 - 1943, Warren McCulloch and Walter Pitts - publish a model for how neurons work together to perform computations
 - 2000's saw a boom in deep learning as more computation power and data became available
 - Deep learning shines when
 - Great amount of data is available
 - Large amount features (ex: unstructured data)
 - Complex relationship b/w features and data
 - Explainability is not highly required (black box!)
 
Artificial Neurons
- Perceptron
 - A simple model - set of weight is multiplied by a set of coefficient
 - The result is passed through a threshold function
 - If it >= threshold, the neuron is activated
 - The output is a 0/1 prediction; no probability values for each class is calculated
 - Used for binary classification tasks; in reality - it is a simple linear model
 - Logistic Regression
 - Same as above, but the result of the weights times the coefficients is passed through an Activation (sigmoid) Function prior to going into Threshold Function
 - The model output is a probability from 0 to 1; it is then be converted to a 0 or 1 prediction
 - If probability is >=0.5, prediction is 1, otherwise 0
 - Goal - find the weight values that minimize the cost (error)
 
Training a Neuron
- Forward Propagation - first a set of weights is propagated through the model, a prediction is calculated
 - Calculate the cost (error) and the gradient of that cost with respect to each weight; go through each weight and update its value using gradient descend
 - Eventually the values of weight that result in minimal cost are found - these weights will be used in the final model
 - Need to focus on Learning rate - how big of a step is taken in going though gradient descent
 - If too small - the algorithm will be too slow
 - If too large - the gradient will be jumping all over the place
 - Gradient descent methods
 - Stochastic (SGD)
 - Iterate though observations one at a time; calculate gradient and weights
 - Pros - works well for large data sets, online learning
 - Cons - cannot use algebra operations
 - Batch
 - Use the entire data set in each update; calculate gradient and weights based on al updates in each iteration
 - Pros - can use vectorized matrix operations; use those efficiently
 - Cons - may be impossible to achieve on large data sets due to the compute power required
 - Mini-batch
 - Divide the data set into smaller batches; perform batch gradient descent on each
 - Pros - works well for large data sets, uses vectorized operations
 - Cons - not as good as SGD for online learning (when observations come in one at a time)
 
No comments:
Post a Comment