Annealing the learning rate

In the previous post, some learning rate schedulers such as the CosineAnnealing method are introduced that can adjust the learning rate on the predefined schedule. However, when it comes to sparse data, those methods might not be appropriate since they update all parameters with the same schedule.

Per-parameter adaptive learning rates

AdaGrad, RMSProp, and Adam are optimization methods with adaptive learning rates which adjust(adapt) learning rates for parameters respectively, avoiding the model to be stuck with the local optima. Those methods are designed to be robust to such sparsity data and more applicable in practical problems.

Among those optimization algorithms, Adam stands out since:

Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.

Rectified Adaptive Learning Rate (RAdam)

However, the adaptive learning rate methods tend to have large variance especially at the early stage where training data is not sufficient; misleading the model to find the local optima.

Here, warmup - training early stage with lower learning rate - has been widely used as a ‘variance reduction’ (read this for details).

Beyond the heuristic warmup strategies, RAdam -dynamic warmup with no tunable parameters needed- has been proposed.

RAdam deactivates the adaptive learning rate when its variance is divergent, thus avoiding undesired instability in the first few updates. Besides, our method does not require an additional hyperparameter (i.e., Tw) and can automatically adapt to different moving average rules.

In addition, RAdam is shown to be more robust to learning rate variations (the most important hyperparameter) and provides better training accuracy and generalization on a variety of datasets and within a variety of AI architectures.

In PyTorch, torch-optimizer provides RAdam optimizer so we can implement with our data easily.

Reference