Optimizers
Says, you have the parameter W inited for your model and got its gradient stored as ∇ (perhaps from AutoGrad APIs). Here is minimal snippet of getting your parameter W baked by SGD.
julia> using MXNet
julia> opt = SGD(η = 10)
MXNet.mx.SGD(10, 0.0, 0, 0, 0.0001, MXNet.mx.LearningRate.Fixed(10.0), MXNet.mx.Momentum.Null())
julia> decend! = getupdater(opt)
(::updater) (generic function with 1 method)
julia> W = NDArray(Float32[1, 2, 3, 4]);
julia> ∇ = NDArray(Float32[.1, .2, .3, .4]);
julia> decend!(1, ∇, W)
4-element mx.NDArray{Float32,1} @ CPU0:
-0.00100005
-0.00200009
-0.00300002
-0.00400019
# MXNet.mx.AbstractOptimizer — Type.
AbstractOptimizer
Base type for all optimizers.
source
# MXNet.mx.getupdater — Method.
getupdater(optimizer)
A utility function to create an updater function of KVStore, that uses its closure to store all the states needed for each weights.
Ther returned function has following signature:
decend!(index::Int, ∇::NDArray, x::NDArray)
If the optimizer is stateful and need access/store states during updating, index will be the key to access/store states.
source
# MXNet.mx.normgrad! — Method.
normgrad(optimizer, W, ∇)
Get the properly normalized gradient (re-scaled and clipped if necessary).
-
optimizer: the optimizer, should contain the fieldscale,clipandλ. -
W::NDArray: the trainable weights. -
∇::NDArray: the original gradient of the weights.
source
# MXNet.mx.AbstractLearningRateScheduler — Type.
AbstractLearningRateScheduler
Base type for all learning rate scheduler.
source
# MXNet.mx.AbstractMomentumScheduler — Type.
AbstractMomentumScheduler
Base type for all momentum scheduler.
source
# MXNet.mx.OptimizationState — Type.
OptimizationState
Attributes
-
batch_size: The size of the mini-batch used in stochastic training. -
curr_epoch: The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on. -
curr_batch: The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1. -
curr_iter: The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.
source
# MXNet.mx.LearningRate.Exp — Type.
LearningRate.Exp(η₀; γ = 0.9)
Where t is the epoch count, or the iteration count.
source
# MXNet.mx.LearningRate.Fixed — Type.
LearningRate.Fixed(η)
Fixed learning rate scheduler always return the same learning rate.
source
# MXNet.mx.LearningRate.Inv — Type.
LearningRate.Inv(η₀; γ = 0.9, p = 0.5)
Where t is the epoch count, or the iteration count.
source
# Base.get — Method.
get(sched::AbstractLearningRateScheduler)
Returns the current learning rate.
source
# MXNet.mx.Momentum.Fixed — Type.
Momentum.Fixed
Fixed momentum scheduler always returns the same value.
source
# MXNet.mx.Momentum.NadamScheduler — Type.
NadamScheduler(; μ = 0.99, δ = 0.004, γ = 0.5, α = 0.96)
Nesterov-accelerated adaptive momentum scheduler.
Description in Incorporating Nesterov Momentum into Adam.
Where
-
t: iteration count -
μ: default0.99, μ₀ -
δ: default0.004is scheduler decay. -
γ: default0.5 -
α: default0.96
source
# MXNet.mx.Momentum.Null — Type.
Momentum.Null
The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.
source
# Base.get — Method.
get(n::NadamScheduler, t)
Where t is the iteration count.
source
Built-in optimizers
Stochastic Gradient Descent
# MXNet.mx.SGD — Type.
SGD(; kwargs...)
Stochastic gradient descent optimizer.
Vanilla SGD:
SGD with momentum::
Arguments
-
η: default0.01, learning rate. -
μ: default0, the momentum, usually set to0.9in this implementation. -
λ: default0.0001, weight decay is equivalent to adding a global l2 regularizer to the parameters. -
clip: default0, gradient clipping. If positive, will clip the gradient into the bounded range[-clip, clip]. -
scale: default0, gradient rescaling. If != 0, multiply the gradient withscalebefore updating. Often choose to be1.0 / batch_size. If leave it default, high-level API likefit!will set it to1.0 / batch_size, sincefit!knows thebatch_size. -
μ_sched::AbstractMomentumScheduler: defaultMomentum.Null(), a dynamic momentum scheduler. If set, will overwrite themomentumparameter. -
η_sched::AbstractLearningRateScheduler: defaultLearningRate.Fixed(η), a dynamic learning rate scheduler. If set, will overwrite theηparameter.
source
ADAM
# MXNet.mx.ADAM — Type.
ADAM
The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].
ADAM(; kwargs...)
Arguments
-
η: default0.001, learning rate. -
β1: default0.9. -
β2: default0.999. -
ϵ: default1e-8. -
clip: default0, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]. -
scale: default0, gradient rescaling. If != 0, multiply the gradient withscalebefore updating. Often choose to be1.0 / batch_size. If leave it default, high-level API likefit!will set it to1.0 / batch_size, sincefit!knows thebatch_size. -
λ: default0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters. -
η_sched::AbstractLearningRateScheduler: defaultLearningRate.Fixed(η), a dynamic learning rate scheduler. If set, will overwrite theηparameter.
source
AdaGrad
# MXNet.mx.AdaGrad — Type.
AdaGrad(; kwargs...)
Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.
Arguments
-
η: default0.1, learning rate. -
ϵ: default1e-6, small value added for numerical stability. -
clip: default0, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]. -
scale: default0, gradient rescaling. If != 0, multiply the gradient withscalebefore updating. Often choose to be1.0 / batch_size. If leave it default, high-level API likefit!will set it to1.0 / batch_size, sincefit!knows thebatch_size. -
λ: default0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
Using step size η AdaGrad calculates the learning rate for feature i at time step t as:
as such the learning rate is monotonically decreasing. Epsilon is not included in the typical formula, see [2].
References
- Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159.
- Chris Dyer: Notes on AdaGrad. http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf
source
AdaDelta
# MXNet.mx.AdaDelta — Type.
AdaDelta(; kwargs...)
Scale learning rates by the ratio of accumulated gradients to accumulated updates, see [1] and notes for further description.
Attributes
-
η: default1.0, learning rate. -
ρ: default0.95, squared gradient moving average decay factor. -
ϵ: default1e-6, small value added for numerical stability. -
clip: default0, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]. -
scale: default0, gradient rescaling. If != 0, multiply the gradient withscalebefore updating. Often choose to be1.0 / batch_size. If leave it default, high-level API likefit!will set it to1.0 / batch_size, sincefit!knows thebatch_size. -
λ: default0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
ρ should be between 0 and 1. A value of ρ close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.
ρ = 0.95 and ϵ = 1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech). In the paper, no learning rate is considered (so η = 1.0). Probably best to keep it at this value.
ϵ is important for the very first update (so the numerator does not become 0).
Using the step size η and a decay factor ρ the learning rate is calculated as:
References
- Zeiler, M. D. (2012): ADADELTA: An Adaptive Learning Rate Method. arXiv Preprint arXiv:1212.5701.
source
AdaMax
# MXNet.mx.AdaMax — Type.
AdaMax(; kwargs...)
This is a variant of of the Adam algorithm based on the infinity norm. See [1] for further description.
Arguments
-
η: default0.002, learning rate. -
β1: default0.9, exponential decay rate for the first moment estimates. -
β2: default0.999, exponential decay rate for the weighted infinity norm estimates. -
ϵ: default1e-8, small value added for numerical stability. -
clip: default0, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]. -
scale: default0, gradient rescaling. If != 0, multiply the gradient withscalebefore updating. Often choose to be1.0 / batch_size. If leave it default, high-level API likefit!will set it to1.0 / batch_size, sincefit!knows thebatch_size. -
λ: default0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
References
- Kingma, Diederik, and Jimmy Ba (2014): Adam: A Method for Stochastic Optimization. Section 7. http://arxiv.org/abs/1412.6980.
source
RMSProp
# MXNet.mx.RMSProp — Type.
RMSProp(; kwargs...)
Scale learning rates by dividing with the moving average of the root mean squared (RMS) gradients. See [1] for further description.
Arguments
-
η: default0.1, learning rate. -
ρ: default0.9, gradient moving average decay factor. -
ϵ: default1e-8, small value added for numerical stability. -
clip: default0, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]. -
scale: default0, gradient rescaling. If != 0, multiply the gradient withscalebefore updating. Often choose to be1.0 / batch_size. If leave it default, high-level API likefit!will set it to1.0 / batch_size, sincefit!knows thebatch_size. -
λ: default0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
ρ should be between 0 and 1. A value of ρ close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.
Using the step size η and a decay factor ρ the learning rateηₜ` is calculated as:
References
- Tieleman, T. and Hinton, G. (2012): Neural Networks for Machine Learning, Lecture 6.5 - rmsprop. Coursera. http://www.youtube.com/watch?v=O3sxAc4hxZU (formula @5:20)
source
Nadam
# MXNet.mx.Nadam — Type.
Nadam(; kwargs...)
Nesterov Adam optimizer: Adam RMSprop with Nesterov momentum, see [1] and notes for further description.
Arguments
-
η: default0.001, learning rate. -
β1: default0.99. -
β2: default0.999. -
ϵ: default1e-8, small value added for numerical stability. -
clip: default0, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]. -
scale: default0, gradient rescaling. If != 0, multiply the gradient withscalebefore updating. Often choose to be1.0 / batch_size. If leave it default, high-level API likefit!will set it to1.0 / batch_size, sincefit!knows thebatch_size. -
λ: default0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters. -
η_sched::AbstractLearningRateScheduler: defaultnothing, a dynamic learning rate scheduler. If set, will overwrite theηparameter. -
μ_sched::NadamSchedulerdefaultNadamScheduler()of the form.
Notes
Default parameters follow those provided in the paper. It is recommended to leave the parameters of this optimizer at their default values.
References
- Incorporating Nesterov Momentum into Adam.
- On the importance of initialization and momentum in deep learning.
source