colossalai.zero.sharded_model

class colossalai.zero.sharded_model.ShardedModelV2(module, shard_strategy, process_group=None, reduce_scatter_process_group=None, reduce_scatter_bucket_size_mb=25, fp32_reduce_scatter=False, offload_config=None, gradient_predivide_factor=1.0, use_memory_tracer=False, reuse_fp16_shard=False)

A wrapper for the PyTorch module shards the model parameters among multiple GPU memory. Only 1/#nproc of parameters, gradients are stored in local CUDA memory, so forward and backward passes can be executed with limited CUDA memory budget.

Note that you must use ShardedModelV2 with ShardedOptimizerV2.

Parameters
  • module (nn.Module) – A sharded module, which must be initialized by ZeroInitContext.

  • shard_strategy (BaseShardStrategy) – A shard strategy to manage shard behavior.

  • process_group (Optional[ProcessGroup], optional) – Data parallel process group. Defaults to None.

  • reduce_scatter_process_group (Optional[ProcessGroup], optional) – Reduce-scatter process group. Generally, it should be None, and it’s the same as process_group. Defaults to None.

  • reduce_scatter_bucket_size_mb (int, optional) – Reduce-scatter bucket size in MB. Defaults to 25.

  • fp32_reduce_scatter (bool, optional) – If set to True, gradients are forced to FP32 before reduce-scatter. Defaults to False.

  • offload_config (Optional[dict], optional) – We currently only support CPU offload. Set to {“device”: “cpu”} to enable CPU offload. Defaults to None.

  • gradient_predivide_factor (Optional[float], optional) – Gradient is divived by this value before reduce-scatter. Defaults to 1.0.

  • use_memory_tracer (bool, optional) – Whether to use memoty tracer. Defaults to False.

  • reuse_fp16_shard (bool, optional) – Whether to reuse fp16 shard for param and grad. Enabling this can reduce GPU memory usage, but you have to make sure you disable it when using gradient accumulation. In this mode, grad will be fp16. Make sure your optimizer supports mixed precision (fp32 param and fp16 grad). We find that PyTorch’s optimizers don’t support mixed precision, so we recommend you enable this only when using our CPUAdam with CPU offload. Defaults to False.