Optimizers
Says, you have the parameter W
inited for your model and got its gradient stored as ∇
(perhaps from AutoGrad APIs). Here is minimal snippet of getting your parameter W
baked by SGD
.
julia> using MXNet
julia> opt = SGD(η = 10)
MXNet.mx.SGD(10, 0.0, 0, 0, 0.0001, MXNet.mx.LearningRate.Fixed(10.0), MXNet.mx.Momentum.Null())
julia> decend! = getupdater(opt)
(::updater) (generic function with 1 method)
julia> W = NDArray(Float32[1, 2, 3, 4]);
julia> ∇ = NDArray(Float32[.1, .2, .3, .4]);
julia> decend!(1, ∇, W)
4element mx.NDArray{Float32,1} @ CPU0:
0.00100005
0.00200009
0.00300002
0.00400019
# MXNet.mx.AbstractOptimizer
— Type.
AbstractOptimizer
Base type for all optimizers.
source
# MXNet.mx.getupdater
— Method.
getupdater(optimizer)
A utility function to create an updater function of KVStore
, that uses its closure to store all the states needed for each weights.
Ther returned function has following signature:
decend!(index::Int, ∇::NDArray, x::NDArray)
If the optimizer is stateful and need access/store states during updating, index
will be the key to access/store states.
source
# MXNet.mx.normgrad!
— Method.
normgrad(optimizer, W, ∇)
Get the properly normalized gradient (rescaled and clipped if necessary).

optimizer
: the optimizer, should contain the fieldscale
,clip
andλ
. 
W::NDArray
: the trainable weights. 
∇::NDArray
: the original gradient of the weights.
source
# MXNet.mx.AbstractLearningRateScheduler
— Type.
AbstractLearningRateScheduler
Base type for all learning rate scheduler.
source
# MXNet.mx.AbstractMomentumScheduler
— Type.
AbstractMomentumScheduler
Base type for all momentum scheduler.
source
# MXNet.mx.OptimizationState
— Type.
OptimizationState
Attributes

batch_size
: The size of the minibatch used in stochastic training. 
curr_epoch
: The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on. 
curr_batch
: The current minibatch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no minibatch seen yet. During the first minibatch, the minibatch count will be 1. 
curr_iter
: The current iteration count. One iteration corresponds to one minibatch, but unlike the minibatch count, the iteration count does not reset in each epoch. So it track the total number of minibatches seen so far.
source
# MXNet.mx.LearningRate.Exp
— Type.
LearningRate.Exp(η₀; γ = 0.9)
Where t
is the epoch count, or the iteration count.
source
# MXNet.mx.LearningRate.Fixed
— Type.
LearningRate.Fixed(η)
Fixed learning rate scheduler always return the same learning rate.
source
# MXNet.mx.LearningRate.Inv
— Type.
LearningRate.Inv(η₀; γ = 0.9, p = 0.5)
Where t
is the epoch count, or the iteration count.
source
# Base.get
— Method.
get(sched::AbstractLearningRateScheduler)
Returns the current learning rate.
source
# MXNet.mx.Momentum.Fixed
— Type.
Momentum.Fixed
Fixed momentum scheduler always returns the same value.
source
# MXNet.mx.Momentum.NadamScheduler
— Type.
NadamScheduler(; μ = 0.99, δ = 0.004, γ = 0.5, α = 0.96)
Nesterovaccelerated adaptive momentum scheduler.
Description in Incorporating Nesterov Momentum into Adam.
Where

t
: iteration count 
μ
: default0.99
, μ₀ 
δ
: default0.004
is scheduler decay. 
γ
: default0.5

α
: default0.96
source
# MXNet.mx.Momentum.Null
— Type.
Momentum.Null
The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.
source
# Base.get
— Method.
get(n::NadamScheduler, t)
Where t
is the iteration count.
source
Builtin optimizers
Stochastic Gradient Descent
# MXNet.mx.SGD
— Type.
SGD(; kwargs...)
Stochastic gradient descent optimizer.
Vanilla SGD:
SGD with momentum::
Arguments

η
: default0.01
, learning rate. 
μ
: default0
, the momentum, usually set to0.9
in this implementation. 
λ
: default0.0001
, weight decay is equivalent to adding a global l2 regularizer to the parameters. 
clip
: default0
, gradient clipping. If positive, will clip the gradient into the bounded range[clip, clip]
. 
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, highlevel API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. 
μ_sched::AbstractMomentumScheduler
: defaultMomentum.Null()
, a dynamic momentum scheduler. If set, will overwrite themomentum
parameter. 
η_sched::AbstractLearningRateScheduler
: defaultLearningRate.Fixed(η)
, a dynamic learning rate scheduler. If set, will overwrite theη
parameter.
source
ADAM
# MXNet.mx.ADAM
— Type.
ADAM
The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].
ADAM(; kwargs...)
Arguments

η
: default0.001
, learning rate. 
β1
: default0.9
. 
β2
: default0.999
. 
ϵ
: default1e8
. 
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[clip, clip]
. 
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, highlevel API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. 
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters. 
η_sched::AbstractLearningRateScheduler
: defaultLearningRate.Fixed(η)
, a dynamic learning rate scheduler. If set, will overwrite theη
parameter.
source
AdaGrad
# MXNet.mx.AdaGrad
— Type.
AdaGrad(; kwargs...)
Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.
Arguments

η
: default0.1
, learning rate. 
ϵ
: default1e6
, small value added for numerical stability. 
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[clip, clip]
. 
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, highlevel API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. 
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
Using step size η
AdaGrad calculates the learning rate for feature i
at time step t as:
as such the learning rate is monotonically decreasing. Epsilon is not included in the typical formula, see [2].
References
 Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:21212159.
 Chris Dyer: Notes on AdaGrad. http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf
source
AdaDelta
# MXNet.mx.AdaDelta
— Type.
AdaDelta(; kwargs...)
Scale learning rates by the ratio of accumulated gradients to accumulated updates, see [1] and notes for further description.
Attributes

η
: default1.0
, learning rate. 
ρ
: default0.95
, squared gradient moving average decay factor. 
ϵ
: default1e6
, small value added for numerical stability. 
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[clip, clip]
. 
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, highlevel API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. 
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
ρ
should be between 0 and 1. A value of ρ
close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.
ρ = 0.95
and ϵ = 1e6
are suggested in the paper and reported to work for multiple datasets (MNIST, speech). In the paper, no learning rate is considered (so η = 1.0
). Probably best to keep it at this value.
ϵ
is important for the very first update (so the numerator does not become 0).
Using the step size η
and a decay factor ρ
the learning rate is calculated as:
References
 Zeiler, M. D. (2012): ADADELTA: An Adaptive Learning Rate Method. arXiv Preprint arXiv:1212.5701.
source
AdaMax
# MXNet.mx.AdaMax
— Type.
AdaMax(; kwargs...)
This is a variant of of the Adam algorithm based on the infinity norm. See [1] for further description.
Arguments

η
: default0.002
, learning rate. 
β1
: default0.9
, exponential decay rate for the first moment estimates. 
β2
: default0.999
, exponential decay rate for the weighted infinity norm estimates. 
ϵ
: default1e8
, small value added for numerical stability. 
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[clip, clip]
. 
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, highlevel API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. 
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
References
 Kingma, Diederik, and Jimmy Ba (2014): Adam: A Method for Stochastic Optimization. Section 7. http://arxiv.org/abs/1412.6980.
source
RMSProp
# MXNet.mx.RMSProp
— Type.
RMSProp(; kwargs...)
Scale learning rates by dividing with the moving average of the root mean squared (RMS) gradients. See [1] for further description.
Arguments

η
: default0.1
, learning rate. 
ρ
: default0.9
, gradient moving average decay factor. 
ϵ
: default1e8
, small value added for numerical stability. 
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[clip, clip]
. 
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, highlevel API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. 
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
ρ
should be between 0 and 1. A value of ρ
close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.
Using the step size η
and a decay factor ρ the learning rate
ηₜ` is calculated as:
References
 Tieleman, T. and Hinton, G. (2012): Neural Networks for Machine Learning, Lecture 6.5  rmsprop. Coursera. http://www.youtube.com/watch?v=O3sxAc4hxZU (formula @5:20)
source
Nadam
# MXNet.mx.Nadam
— Type.
Nadam(; kwargs...)
Nesterov Adam optimizer: Adam RMSprop with Nesterov momentum, see [1] and notes for further description.
Arguments

η
: default0.001
, learning rate. 
β1
: default0.99
. 
β2
: default0.999
. 
ϵ
: default1e8
, small value added for numerical stability. 
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[clip, clip]
. 
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, highlevel API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. 
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters. 
η_sched::AbstractLearningRateScheduler
: defaultnothing
, a dynamic learning rate scheduler. If set, will overwrite theη
parameter. 
μ_sched::NadamScheduler
defaultNadamScheduler()
of the form.
Notes
Default parameters follow those provided in the paper. It is recommended to leave the parameters of this optimizer at their default values.
References
 Incorporating Nesterov Momentum into Adam.
 On the importance of initialization and momentum in deep learning.
source