colossalai.zero.sharded_optim

class colossalai.zero.sharded_optim.ShardedOptimizerV2(sharded_model, optimizer, cpu_offload=False, gpu_margin_mem_ratio=0.0, initial_scale=4294967296, min_scale=1, growth_factor=2, backoff_factor=0.5, growth_interval=1000, hysteresis=2, max_scale=4294967296, dp_process_group=None, mp_process_group=None)

A wrapper for optimizer. ShardedOptimizerV2 and ShardedModelV2 implement Zero Redundancy Optimizer (ZeRO). By default the ZeRO optimizer stage 3 offload Optimizer States on CPU. We apply the Device-aware Operator Placement technique for OS placement from the following paper. PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management https://arxiv.org/abs/2108.05818 GPU margin space is the remaining space after removing peak non-model data from the overall GPU memory, which is detected by a runtime memory tracer. We place as many OS chunks in the margin space as possible. The size of margin space can be controlled by gpu_margin_mem_ratio If it is set as 0.0, it is the same as classical ZeRO optimizer.

NOTE() You must use ShardedOptimizerV2 with ShardedModelV2.

Parameters
  • sharded_model (ShardedModelV2) – A sharded model initialized by class ShardedModelV2. The optimizer will use the shard strategy provided by sharded model to shard param fp32 tensors.

  • optimizer (Optimizer) – An Optimizer instance.

  • cpu_offload (bool, optional) – Is offloading the optimizer states to CPU.. Defaults to False.

  • gpu_margin_mem_ratio (float, optional) – The ratio of GPU remaining memory (after the first forward-backward) which will be used when using hybrid CPU optimizer. Defaults to 0.0.

  • initial_scale (float, optional) – Initial scale used by DynamicGradScaler. Defaults to 2**32.

  • min_scale (float, optional) – Min scale used by DynamicGradScaler. Defaults to 1.

  • growth_factor (float, optional) – growth_factor used by DynamicGradScaler. Defaults to 2.

  • backoff_factor (float, optional) – backoff_factor used by DynamicGradScaler. Defaults to 0.5.

  • growth_interval (float, optional) – growth_interval used by DynamicGradScaler. Defaults to 1000.

  • hysteresis (float, optional) – hysteresis used by DynamicGradScaler. Defaults to 2.

  • max_scale (int, optional) – max_scale used by DynamicGradScaler. Defaults to 2**32.

  • dp_process_group (Optional[ProcessGroup], optional) – data paralle process group. Defaults to None.

  • mp_process_group (Optional[ProcessGroup], optional) – model paralle process group. Defaults to None.