colossalai.engine.gradient_handler

class colossalai.engine.gradient_handler.BaseGradientHandler(model, optimizer)

A basic helper class to handle all-reduce operations of gradients across different parallel groups before optimization.

Parameters
  • model (Module) – Model where the gradients accumulate

  • optimizer (Optimizer) – Optimizer for updating the parameters

abstract handle_gradient()

A method to accumulate gradients across different parallel groups. Users should write their own functions or just use the functions in pre-defined subclasses.

class colossalai.engine.gradient_handler.DataParallelGradientHandler(model, optimizer)

A helper class to handle all-reduce operations in a data parallel group. A all-reduce collective communication will be operated in handle_gradient() among a data parallel group. For better performance, it bucketizes the gradients of all parameters that are the same type to improve the efficiency of communication.

handle_gradient()

A method running a all-reduce operation in a data parallel group.

class colossalai.engine.gradient_handler.ZeROGradientHandler(model, optimizer)

A helper class to handle all-reduce operations in a data parallel group. A all-reduce collective communication will be operated in handle_gradient() among a data parallel group. This class is specialized with ZeRO optimization.

handle_gradient()

A method running a all-reduce operation in a data parallel group.

class colossalai.engine.gradient_handler.PipelineSharedModuleGradientHandler(model, optimizer)

A helper class to handle all-reduce operations in sub parallel groups. A all-reduce collective communication will be operated in handle_gradient() among all sub pipeline parallel groups. For better performance, it bucketizes the gradients of all parameters that are the same type to improve the efficiency of communication.

handle_gradient()

A method running a all-reduce operation in sub pipeline parallel groups.

class colossalai.engine.gradient_handler.MoeGradientHandler(model, optimizer)

A helper class to handle all-reduce operations in a data parallel group and moe model parallel. A all-reduce collective communication will be operated in handle_gradient() among a data parallel group. For better performance, it bucketizes the gradients of all parameters that are the same type to improve the efficiency of communication.

handle_gradient()

A method running an all-reduce operation in a data parallel group. Then running an all-reduce operation for all parameters in experts across moe model parallel group

class colossalai.engine.gradient_handler.SequenceParallelGradientHandler(model, optimizer)

A helper class to handle all-reduce operations in a data parallel group. A all-reduce collective communication will be operated in handle_gradient() among a data parallel group. For better performance, it bucketizes the gradients of all parameters that are the same type to improve the efficiency of communication.

handle_gradient()

A method running a all-reduce operation in a data parallel group.