colossalai.engine.gradient_handler
- class colossalai.engine.gradient_handler.BaseGradientHandler(model, optimizer)
A basic helper class to handle all-reduce operations of gradients across different parallel groups before optimization.
- Parameters
model (Module) – Model where the gradients accumulate
optimizer (Optimizer) – Optimizer for updating the parameters
- abstract handle_gradient()
A method to accumulate gradients across different parallel groups. Users should write their own functions or just use the functions in pre-defined subclasses.
- class colossalai.engine.gradient_handler.DataParallelGradientHandler(model, optimizer)
A helper class to handle all-reduce operations in a data parallel group. A all-reduce collective communication will be operated in
handle_gradient()among a data parallel group. For better performance, it bucketizes the gradients of all parameters that are the same type to improve the efficiency of communication.- handle_gradient()
A method running a all-reduce operation in a data parallel group.
- class colossalai.engine.gradient_handler.ZeROGradientHandler(model, optimizer)
A helper class to handle all-reduce operations in a data parallel group. A all-reduce collective communication will be operated in
handle_gradient()among a data parallel group. This class is specialized with ZeRO optimization.- handle_gradient()
A method running a all-reduce operation in a data parallel group.
A helper class to handle all-reduce operations in sub parallel groups. A all-reduce collective communication will be operated in
handle_gradient()among all sub pipeline parallel groups. For better performance, it bucketizes the gradients of all parameters that are the same type to improve the efficiency of communication.A method running a all-reduce operation in sub pipeline parallel groups.
- class colossalai.engine.gradient_handler.MoeGradientHandler(model, optimizer)
A helper class to handle all-reduce operations in a data parallel group and moe model parallel. A all-reduce collective communication will be operated in
handle_gradient()among a data parallel group. For better performance, it bucketizes the gradients of all parameters that are the same type to improve the efficiency of communication.- handle_gradient()
A method running an all-reduce operation in a data parallel group. Then running an all-reduce operation for all parameters in experts across moe model parallel group
- class colossalai.engine.gradient_handler.SequenceParallelGradientHandler(model, optimizer)
A helper class to handle all-reduce operations in a data parallel group. A all-reduce collective communication will be operated in
handle_gradient()among a data parallel group. For better performance, it bucketizes the gradients of all parameters that are the same type to improve the efficiency of communication.- handle_gradient()
A method running a all-reduce operation in a data parallel group.