colossalai.nn.optimizer
- class colossalai.nn.optimizer.FusedLAMB(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.01, amsgrad=False, adam_w_mode=True, grad_averaging=True, set_grad_none=True, max_grad_norm=1.0, use_nvlamb=False)
Implements LAMB algorithm.
Currently GPU-only. Requires ColossalAI to be installed via
pip install -v --no-cache-dir --global-option="--cuda_ext" ./.This version of fused LAMB implements 2 fusions.
Fusion of the LAMB update’s elementwise operations
A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.
colossalai.nn.optimizer.FusedLAMB’s usage is identical to any ordinary Pytorch optimizercolossalai.nn.optimizer.FusedLAMBmay be used with or without Amp.LAMB was proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.
- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – learning rate. (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its norm. (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-6)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0.01)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond NOT SUPPORTED now! (default: False)
adam_w_mode (boolean, optional) – Apply L2 regularization or weight decay True for decoupled weight decay(also known as AdamW) (default: True)
grad_averaging (bool, optional) – whether apply (1-beta2) to grad when calculating running averages of gradient. (default: True)
set_grad_none (bool, optional) – whether set grad to None when zero_grad() method is called. (default: True)
max_grad_norm (float, optional) – value used to clip global grad norm (default: 1.0)
use_nvlamb (boolean, optional) – Apply adaptive learning rate to 0.0 weight decay parameter (default: False)
- step(closure=None)
Performs a single optimization step.
- Parameters
closure (callable, optional) – A closure that reevaluates the model and returns the loss.
- class colossalai.nn.optimizer.FusedAdam(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, adamw_mode=True, weight_decay=0.0, amsgrad=False, set_grad_none=True)
Implements Adam algorithm.
Currently GPU-only. Requires ColossalAI to be installed via
pip install ..This version of fused Adam implements 2 fusions.
Fusion of the Adam update’s elementwise operations
A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.
colossalai.nn.optimizer.FusedAdammay be used as a drop-in replacement fortorch.optim.AdamW, ortorch.optim.Adamwithadamw_mode=Falsecolossalai.nn.optimizer.FusedAdammay be used with or without Amp.Adam was been proposed in `Adam: A Method for Stochastic Optimization`_.
- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – learning rate. (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in FusedAdam!
adamw_mode (boolean, optional) – Apply L2 regularization or weight decay True for decoupled weight decay(also known as AdamW) (default: True)
set_grad_none (bool, optional) – whether set grad to None when zero_grad() method is called. (default: True)
- step(closure=None, grads=None, output_params=None, scale=None, grad_norms=None)
Performs a single optimization step.
- Parameters
closure (callable, optional) – A closure that reevaluates the model and returns the loss.
The remaining arguments are deprecated, and are only retained (for the moment) for error-checking purposes.
- class colossalai.nn.optimizer.FusedSGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, wd_after_momentum=False, materialize_master_grads=True, set_grad_none=False)
Implements stochastic gradient descent (optionally with momentum).
Currently GPU-only. Requires ColossalAI to be installed via
pip install -v --no-cache-dir --global-option="--cuda_ext" ./.This version of fused SGD implements 2 fusions.
Fusion of the SGD update’s elementwise operations
A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.
colossalai.nn.optimizer.FusedSGDmay be used as a drop-in replacement fortorch.optim.SGDcolossalai.nn.optimizer.FusedSGDmay be used with or without Amp.Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.
- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
Note
The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks. Considering the specific case of Momentum, the update can be written as
\[\begin{split}v = \rho * v + g \\ p = p - lr * v\end{split}\]where p, g, v and \(\rho\) denote the parameters, gradient, velocity, and momentum respectively. This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form
\[\begin{split}v = \rho * v + lr * g \\ p = p - v\end{split}\]The Nesterov version is analogously modified.
- step(closure=None)
Performs a single optimization step.
- Parameters
closure (callable, optional) – A closure that reevaluates the model and returns the loss.
- class colossalai.nn.optimizer.Lamb(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, adam=False)
Implements Lamb algorithm. It has been proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.
- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-6)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
adam (bool, optional) – always use trust ratio = 1, which turns this into Adam. Useful for comparison purposes.
- step(closure=None)
Performs a single optimization step.
- Parameters
closure (callable, optional) – A closure that reevaluates the model and returns the loss.
- class colossalai.nn.optimizer.Lars(params, lr=0.001, momentum=0, eeta=0.001, weight_decay=0, epsilon=0.0)
Implements the LARS optimizer from “Large batch training of convolutional networks”.
- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
momentum (float, optional) – momentum factor (default: 0)
eeta (float, optional) – LARS coefficient as used in the paper (default: 1e-3)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
- step(closure=None)
Performs a single optimization step.
- Parameters
closure (callable, optional) – A closure that reevaluates the model and returns the loss.