colossalai.nn.optimizer.fused_sgd

class colossalai.nn.optimizer.fused_sgd.FusedSGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, wd_after_momentum=False, materialize_master_grads=True, set_grad_none=False)

Implements stochastic gradient descent (optionally with momentum).

Currently GPU-only. Requires ColossalAI to be installed via pip install -v --no-cache-dir --global-option="--cuda_ext" ./.

This version of fused SGD implements 2 fusions.

  • Fusion of the SGD update’s elementwise operations

  • A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.

colossalai.nn.optimizer.FusedSGD may be used as a drop-in replacement for torch.optim.SGD

colossalai.nn.optimizer.FusedSGD may be used with or without Amp.

Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate

  • momentum (float, optional) – momentum factor (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

Note

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks. Considering the specific case of Momentum, the update can be written as

\[\begin{split}v = \rho * v + g \\ p = p - lr * v\end{split}\]

where p, g, v and \(\rho\) denote the parameters, gradient, velocity, and momentum respectively. This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form

\[\begin{split}v = \rho * v + lr * g \\ p = p - v\end{split}\]

The Nesterov version is analogously modified.

step(closure=None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.