Optimizers
Says, you have the parameter W
inited for your model and got its gradient stored as ∇
(perhaps from AutoGrad APIs). Here is minimal snippet of getting your parameter W
baked by SGD
.
julia> using MXNet
julia> opt = SGD(η = 10)
MXNet.mx.SGD(10, 0.0, 0, 0, 0.0001, MXNet.mx.LearningRate.Fixed(10.0), MXNet.mx.Momentum.Null())
julia> decend! = getupdater(opt)
(::updater) (generic function with 1 method)
julia> W = NDArray(Float32[1, 2, 3, 4]);
julia> ∇ = NDArray(Float32[.1, .2, .3, .4]);
julia> decend!(1, ∇, W)
4-element mx.NDArray{Float32,1} @ CPU0:
-0.00100005
-0.00200009
-0.00300002
-0.00400019
# MXNet.mx.AbstractOptimizer
— Type.
AbstractOptimizer
Base type for all optimizers.
source
# MXNet.mx.getupdater
— Method.
getupdater(optimizer)
A utility function to create an updater function of KVStore
, that uses its closure to store all the states needed for each weights.
Ther returned function has following signature:
decend!(index::Int, ∇::NDArray, x::NDArray)
If the optimizer is stateful and need access/store states during updating, index
will be the key to access/store states.
source
# MXNet.mx.normgrad!
— Method.
normgrad(optimizer, W, ∇)
Get the properly normalized gradient (re-scaled and clipped if necessary).
-
optimizer
: the optimizer, should contain the fieldscale
,clip
andλ
. -
W::NDArray
: the trainable weights. -
∇::NDArray
: the original gradient of the weights.
source
# MXNet.mx.AbstractLearningRateScheduler
— Type.
AbstractLearningRateScheduler
Base type for all learning rate scheduler.
source
# MXNet.mx.AbstractMomentumScheduler
— Type.
AbstractMomentumScheduler
Base type for all momentum scheduler.
source
# MXNet.mx.OptimizationState
— Type.
OptimizationState
Attributes
-
batch_size
: The size of the mini-batch used in stochastic training. -
curr_epoch
: The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on. -
curr_batch
: The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1. -
curr_iter
: The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.
source
# MXNet.mx.LearningRate.Exp
— Type.
LearningRate.Exp(η₀; γ = 0.9)
Where t
is the epoch count, or the iteration count.
source
# MXNet.mx.LearningRate.Fixed
— Type.
LearningRate.Fixed(η)
Fixed learning rate scheduler always return the same learning rate.
source
# MXNet.mx.LearningRate.Inv
— Type.
LearningRate.Inv(η₀; γ = 0.9, p = 0.5)
Where t
is the epoch count, or the iteration count.
source
# Base.get
— Method.
get(sched::AbstractLearningRateScheduler)
Returns the current learning rate.
source
# MXNet.mx.Momentum.Fixed
— Type.
Momentum.Fixed
Fixed momentum scheduler always returns the same value.
source
# MXNet.mx.Momentum.NadamScheduler
— Type.
NadamScheduler(; μ = 0.99, δ = 0.004, γ = 0.5, α = 0.96)
Nesterov-accelerated adaptive momentum scheduler.
Description in Incorporating Nesterov Momentum into Adam.
Where
-
t
: iteration count -
μ
: default0.99
, μ₀ -
δ
: default0.004
is scheduler decay. -
γ
: default0.5
-
α
: default0.96
source
# MXNet.mx.Momentum.Null
— Type.
Momentum.Null
The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.
source
# Base.get
— Method.
get(n::NadamScheduler, t)
Where t
is the iteration count.
source
Built-in optimizers
Stochastic Gradient Descent
# MXNet.mx.SGD
— Type.
SGD(; kwargs...)
Stochastic gradient descent optimizer.
Vanilla SGD:
SGD with momentum::
Arguments
-
η
: default0.01
, learning rate. -
μ
: default0
, the momentum, usually set to0.9
in this implementation. -
λ
: default0.0001
, weight decay is equivalent to adding a global l2 regularizer to the parameters. -
clip
: default0
, gradient clipping. If positive, will clip the gradient into the bounded range[-clip, clip]
. -
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, high-level API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. -
μ_sched::AbstractMomentumScheduler
: defaultMomentum.Null()
, a dynamic momentum scheduler. If set, will overwrite themomentum
parameter. -
η_sched::AbstractLearningRateScheduler
: defaultLearningRate.Fixed(η)
, a dynamic learning rate scheduler. If set, will overwrite theη
parameter.
source
ADAM
# MXNet.mx.ADAM
— Type.
ADAM
The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].
ADAM(; kwargs...)
Arguments
-
η
: default0.001
, learning rate. -
β1
: default0.9
. -
β2
: default0.999
. -
ϵ
: default1e-8
. -
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]
. -
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, high-level API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. -
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters. -
η_sched::AbstractLearningRateScheduler
: defaultLearningRate.Fixed(η)
, a dynamic learning rate scheduler. If set, will overwrite theη
parameter.
source
AdaGrad
# MXNet.mx.AdaGrad
— Type.
AdaGrad(; kwargs...)
Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.
Arguments
-
η
: default0.1
, learning rate. -
ϵ
: default1e-6
, small value added for numerical stability. -
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]
. -
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, high-level API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. -
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
Using step size η
AdaGrad calculates the learning rate for feature i
at time step t as:
as such the learning rate is monotonically decreasing. Epsilon is not included in the typical formula, see [2].
References
- Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159.
- Chris Dyer: Notes on AdaGrad. http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf
source
AdaDelta
# MXNet.mx.AdaDelta
— Type.
AdaDelta(; kwargs...)
Scale learning rates by the ratio of accumulated gradients to accumulated updates, see [1] and notes for further description.
Attributes
-
η
: default1.0
, learning rate. -
ρ
: default0.95
, squared gradient moving average decay factor. -
ϵ
: default1e-6
, small value added for numerical stability. -
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]
. -
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, high-level API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. -
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
ρ
should be between 0 and 1. A value of ρ
close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.
ρ = 0.95
and ϵ = 1e-6
are suggested in the paper and reported to work for multiple datasets (MNIST, speech). In the paper, no learning rate is considered (so η = 1.0
). Probably best to keep it at this value.
ϵ
is important for the very first update (so the numerator does not become 0).
Using the step size η
and a decay factor ρ
the learning rate is calculated as:
References
- Zeiler, M. D. (2012): ADADELTA: An Adaptive Learning Rate Method. arXiv Preprint arXiv:1212.5701.
source
AdaMax
# MXNet.mx.AdaMax
— Type.
AdaMax(; kwargs...)
This is a variant of of the Adam algorithm based on the infinity norm. See [1] for further description.
Arguments
-
η
: default0.002
, learning rate. -
β1
: default0.9
, exponential decay rate for the first moment estimates. -
β2
: default0.999
, exponential decay rate for the weighted infinity norm estimates. -
ϵ
: default1e-8
, small value added for numerical stability. -
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]
. -
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, high-level API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. -
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
References
- Kingma, Diederik, and Jimmy Ba (2014): Adam: A Method for Stochastic Optimization. Section 7. http://arxiv.org/abs/1412.6980.
source
RMSProp
# MXNet.mx.RMSProp
— Type.
RMSProp(; kwargs...)
Scale learning rates by dividing with the moving average of the root mean squared (RMS) gradients. See [1] for further description.
Arguments
-
η
: default0.1
, learning rate. -
ρ
: default0.9
, gradient moving average decay factor. -
ϵ
: default1e-8
, small value added for numerical stability. -
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]
. -
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, high-level API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. -
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
ρ
should be between 0 and 1. A value of ρ
close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.
Using the step size η
and a decay factor ρ the learning rate
ηₜ` is calculated as:
References
- Tieleman, T. and Hinton, G. (2012): Neural Networks for Machine Learning, Lecture 6.5 - rmsprop. Coursera. http://www.youtube.com/watch?v=O3sxAc4hxZU (formula @5:20)
source
Nadam
# MXNet.mx.Nadam
— Type.
Nadam(; kwargs...)
Nesterov Adam optimizer: Adam RMSprop with Nesterov momentum, see [1] and notes for further description.
Arguments
-
η
: default0.001
, learning rate. -
β1
: default0.99
. -
β2
: default0.999
. -
ϵ
: default1e-8
, small value added for numerical stability. -
clip
: default0
, gradient clipping. If positive, will clip the gradient into the range[-clip, clip]
. -
scale
: default0
, gradient rescaling. If != 0, multiply the gradient withscale
before updating. Often choose to be1.0 / batch_size
. If leave it default, high-level API likefit!
will set it to1.0 / batch_size
, sincefit!
knows thebatch_size
. -
λ
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters. -
η_sched::AbstractLearningRateScheduler
: defaultnothing
, a dynamic learning rate scheduler. If set, will overwrite theη
parameter. -
μ_sched::NadamScheduler
defaultNadamScheduler()
of the form.
Notes
Default parameters follow those provided in the paper. It is recommended to leave the parameters of this optimizer at their default values.
References
- Incorporating Nesterov Momentum into Adam.
- On the importance of initialization and momentum in deep learning.
source