colossalai.utils.gradient_accumulation

colossalai.utils.gradient_accumulation.accumulate_gradient(model, optimizer, dataloader, accumulate_size, gradient_handlers=None, lr_scheduler=None)[source]

Turning model, optimizer, dataloader into corresponding object for gradient accumulation.

Parameters
  • model (torch.nn.Module) – your model object for gradient accumulation.

  • optimizer (torch.optim.Optimizer) – your optimizer object for gradient accumulation.

  • dataloader (torch.utils.data.DataLoader or iterable objects) – your dataloader object, would be called like iter(dataloader)

  • accumulate_size (int) – the number of steps to accumulate gradients

  • gradient_handlers (List[colossalai.engine.BaseGradientHandler]) – list of gradient handler objects. Default is None.

  • lr_scheduler (torch.optim.lr_scheduler or colossalai.nn.lr_scheduler) – your lr_scheduler object for gradient accumulation. Defaults to None.

More details about gradient_handlers could be found in Gradient_handler.

More details about lr_scheduler could be found lr_scheduler. and how to adjust learning rate.

class colossalai.utils.gradient_accumulation.GradAccumDataloader(dataloader, accumulate_size)[source]

A wrapper for dataloader to enable gradient accumulation by dropping the last incomplete steps.

Note

The dataloader would drop the last incomplete steps for gradient accumulation. For example, if a dataloader has 10 batches of data and accumulate size is 4. The model parameters will be updated only twice at step 4 and step 8. The last two batches of data do not form a complete 4-step cycle. Thus, they will be automatically skipped by this class. If the dataloader is not standard PyTorch dataloader, (e.g. Dali dataloader), this class will automatically consume (load data for nothing) the remaining 2 batches.

Parameters
  • optim (Iterable) – Your dataloader object for gradient accumulation.

  • accumulate_size (int) – The number of steps to accumulate gradients.

class colossalai.utils.gradient_accumulation.GradAccumOptimizer(optim, accumulate_size, model=None)[source]

A wrapper for the optimizer to enable gradient accumulation by skipping the steps before accumulation size is reached.

Parameters
  • optim (torch.optim.Optimizer) – Your optimizer object for gradient accumulation.

  • accumulate_size (int) – The number of steps to accumulate gradients.

  • model (torch.nn.Module) – Your model object to check if it is DistributedDataParallel for special handling of no_sync() context.

class colossalai.utils.gradient_accumulation.GradAccumLrSchedulerByStep(lr_scheduler, accumulate_size)[source]

A wrapper for the LR scheduler to enable gradient accumulation by skipping the steps before accumulation size is reached.

Parameters
  • lr_scheduler (torch.optim.lr_scheduler._LRScheduler) – Your lr_scheduler object for gradient accumulation.

  • accumulate_size (int) – The number of steps to accumulate gradients.

class colossalai.utils.gradient_accumulation.GradAccumGradientHandler(grad_handler, accumulate_size)[source]

A wrapper for the gradient handler to enable gradient accumulation by skipping the steps before accumulation size is reached.

Parameters
  • grad_handler (colossalai.engine.BaseGradientHandler) – Your gradient_handler object for gradient accumulation, would be called when achieving accumulate_size.

  • accumulate_size (int) – The number of steps to accumulate gradients.

More details about gradient_handlers could be found in Gradient_handler.