colossalai.nn.optimizer.fused_lamb

class colossalai.nn.optimizer.fused_lamb.FusedLAMB(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.01, amsgrad=False, adam_w_mode=True, grad_averaging=True, set_grad_none=True, max_grad_norm=1.0, use_nvlamb=False)

Implements LAMB algorithm.

Currently GPU-only. Requires ColossalAI to be installed via pip install -v --no-cache-dir --global-option="--cuda_ext" ./.

This version of fused LAMB implements 2 fusions.

  • Fusion of the LAMB update’s elementwise operations

  • A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.

colossalai.nn.optimizer.FusedLAMB’s usage is identical to any ordinary Pytorch optimizer

colossalai.nn.optimizer.FusedLAMB may be used with or without Amp.

LAMB was proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes`_.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.

  • lr (float, optional) – learning rate. (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its norm. (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-6)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0.01)

  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond NOT SUPPORTED now! (default: False)

  • adam_w_mode (boolean, optional) – Apply L2 regularization or weight decay True for decoupled weight decay(also known as AdamW) (default: True)

  • grad_averaging (bool, optional) – whether apply (1-beta2) to grad when calculating running averages of gradient. (default: True)

  • set_grad_none (bool, optional) – whether set grad to None when zero_grad() method is called. (default: True)

  • max_grad_norm (float, optional) – value used to clip global grad norm (default: 1.0)

  • use_nvlamb (boolean, optional) – Apply adaptive learning rate to 0.0 weight decay parameter (default: False)

step(closure=None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.