Deep learning
- "A subfield of machine learning that focuses on artificial neural networks and deep neural networks."
- Neural Network - "computational model that is inspired by the structure and functioning of the human brain" (ChatGTP definition)
- Neural network with many layer - Deep neural network
- Pre-history:
- 1943, Warren McCulloch and Walter Pitts - publish a model for how neurons work together to perform computations
- 2000's saw a boom in deep learning as more computation power and data became available
- Deep learning shines when
- Great amount of data is available
- Large amount features (ex: unstructured data)
- Complex relationship b/w features and data
- Explainability is not highly required (black box!)
Artificial Neurons
- Perceptron
- A simple model - set of weight is multiplied by a set of coefficient
- The result is passed through a threshold function
- If it >= threshold, the neuron is activated
- The output is a 0/1 prediction; no probability values for each class is calculated
- Used for binary classification tasks; in reality - it is a simple linear model
- Logistic Regression
- Same as above, but the result of the weights times the coefficients is passed through an Activation (sigmoid) Function prior to going into Threshold Function
- The model output is a probability from 0 to 1; it is then be converted to a 0 or 1 prediction
- If probability is >=0.5, prediction is 1, otherwise 0
- Goal - find the weight values that minimize the cost (error)
Training a Neuron
- Forward Propagation - first a set of weights is propagated through the model, a prediction is calculated
- Calculate the cost (error) and the gradient of that cost with respect to each weight; go through each weight and update its value using gradient descend
- Eventually the values of weight that result in minimal cost are found - these weights will be used in the final model
- Need to focus on Learning rate - how big of a step is taken in going though gradient descent
- If too small - the algorithm will be too slow
- If too large - the gradient will be jumping all over the place
- Gradient descent methods
- Stochastic (SGD)
- Iterate though observations one at a time; calculate gradient and weights
- Pros - works well for large data sets, online learning
- Cons - cannot use algebra operations
- Batch
- Use the entire data set in each update; calculate gradient and weights based on al updates in each iteration
- Pros - can use vectorized matrix operations; use those efficiently
- Cons - may be impossible to achieve on large data sets due to the compute power required
- Mini-batch
- Divide the data set into smaller batches; perform batch gradient descent on each
- Pros - works well for large data sets, uses vectorized operations
- Cons - not as good as SGD for online learning (when observations come in one at a time)
No comments:
Post a Comment