Introduction
- In deep learning, we have the concept of loss, which tells us how poorly the model is performing at that current instant.
- Now we need to use this loss to train our network such that it performs better.
- Essentially what we need to do is to take the loss and try to minimize it, because a lower loss means our model is going to perform better.
- The process of minimizing (or maximizing) any mathematical expression is called optimization.
- Optimizers are algorithms or methods used to change the attributes of the neural network such as weights and learning rate to reduce the losses.
- Optimizers are used to solve optimization problems by minimizing the function.
We’ll compare below listed types of optimizers
- Gradient Descent
- Stochastic Gradient Descent (SGD)
- Mini Batch Stochastic Gradient Descent (MB-SGD)
- SGD with momentum
- Nesterov Accelerated Gradient (NAG)
- Adaptive Gradient (AdaGrad)
- AdaDelta
- RMSprop
- Adaptive Moment Estimation (Adam)
1. Gradient Descent
- Most Basic & Most Used
- Used in Back Prpogation
- Depends on first order derivative of a loss function
- algorithm: θ=θ−α⋅∇J(θ)
Advantages:
- Easy computation.
- Easy to implement.
- Easy to understand.
Disadvantages:
- May trap at local minima.
- Weights are changed after calculating gradient on the whole dataset. So, if the dataset is too large than this may take years to converge to the minima.
- Requires large memory to calculate gradient on the whole dataset.
2. Stochastic Gradient Descent (SGD)
- An atempt to solve the Slow convergenece of Simple GD.
- Loss Calculation and parameter update after every training observation. (1000 updates for 1000 rows instead of 1 in gradient descent)
- θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples.
Advantages:
- Frequent updates of model parameters hence, converges in less time.
- Requires less memory as no need to store values of loss functions.
- May get new minima’s.
Disadvantages:
- High variance in model parameters.
- May shoot even after achieving global minima.
- To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.
3. Mini Batch Stochastic Gradient Descent (MB-SGD)
- An attempt to solve slow convergence of Simple GD and High variance of SGD.
- Dataset is divided into various batches and after every batch, the parameters are updated.
- θ=θ−α⋅∇J(θ; B(i)), where {B(i)} are the batches of training examples.
Advantages:
- Considered as Best of the gradient descent based algorithms.
- Frequently updates the model parameters and also has less variance.
- Requires medium amount of memory.
All types of Gradient Descent have some challenges:
- Choosing an optimum value of the learning rate. If the learning rate is too small than gradient descent may take ages to converge.
- Have a constant learning rate for all the parameters. There may be some parameters which we may not want to change at the same rate.
- May get trapped at local minima.
4. SGD with momentum
- Another attempt to solve high variance of SGD
- One New Hyperparameter: Momentum (‘γ’) is introduced.
- It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction.
- V(t)=γV(t−1)+α.∇J(θ) and the weights are updated by θ=θ−V(t).
Advantages:
- Reduces the oscillations and high variance of the parameters.
- Converges faster than gradient descent.
Disadvantages:
- One more hyper-parameter is added which needs to be selected manually and accurately.
- If the momentum is too high the algorithm may miss the local minima and may continue to rise up.
5. Nesterov Accelerated Gradient
- Invented to Solve the issue of missing local minima (when momentum is too high) in case of Momentum based optimization.
- In NAG, we’ll calculate the cost based on this future parameter rather than the current one.
- We’ll be using γV(t−1) for modifying the weights so, θ−γV(t−1) approximately tells us the future location. It is a look ahead method.
- V(t)=γV(t−1)+α. ∇J( θ−γV(t−1) ) and then update the parameters using θ=θ−V(t).
Advantages:
- Does not miss the local minima.
- Slows if minima’s are occurring.
Disadvantages:
- Still, the hyperparameter needs to be selected manually.
6. Adaptive Gradient (AdaGrad)
- An attempt to solve the problem of tuning learning rate manually.
- In Adagrad, we change the learning rate ‘η’ for each parameter and at every time step ‘t’.
- η is a learning rate which is modified for given parameter θ(i) at a given time based on previous gradients calculated for given parameter θ(i).
- We store the sum of the squares of the gradients w.r.t. θ(i) up to time step t, while ϵ is a smoothing term that avoids division by zero.
- It makes big updates for less frequent parameters and a small step for frequent parameters.
Advantages:
- Learning rate changes for each training parameter.
- Don’t need to manually tune the learning rate.
- Able to train on sparse data.
Disadvantages:
- Computationally expensive as a need to calculate the second order derivative.
- The learning rate is always decreasing results in slow training.
7. AdaDelta
- It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it.
- Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w.
- In this exponentially moving average is used rather than the sum of all the gradients.
Advantages:
- Now the learning rate does not decay and the training does not stop.
Disadvantages:
- Computationally expensive.
8. RMSprop
- RMSprop in fact is identical to the first update vector of Adadelta that we derived above.
- RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests γ be set to 0.9, while a good default value for the learning rate η is 0.001.
- RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad’s radically diminishing learning rates
9. Adaptive Moment Estimation (Adam)
- Adam (Adaptive Moment Estimation) works with momentums of first and second order.
- In addition to storing an exponentially decaying average of past squared gradients like AdaDelta,
- Adam also keeps track of:
- first moment :- Mean of past gradients M(t)
- second moment:- uncentered Variance of the gradients V(t).
To update the parameter:
Advantages:
- The method is too fast and converges rapidly.
- Rectifies vanishing learning rate, high variance.
Disadvantages:
Computationally costly.
Conclusions
Now observing the above animation and theory we can draw following points:
- It is observed that the SGD algorithm (red) is stuck at a saddle point. So SGD algorithm can only be used for shallow networks.
- All the other algorithms except SGD finally converges one after the other, AdaDelta being the fastest followed by momentum algorithms.
- AdaGrad and AdaDelta algorithm can be used for sparse data.
- Momentum and NAG work well for most cases but is slower.
- Animation for Adam is not available but from the plot above it is observed that it is the fastest algorithm to converge to minima.
- Adam is considered the best algorithm amongst all the algorithms discussed above.