colossalai.zero.sharded_model
colossalai.zero.sharded_model.sharded_model_v2
- class colossalai.zero.sharded_model.sharded_model_v2.ShardedModelV2(module, shard_strategy, process_group=None, reduce_scatter_process_group=None, reduce_scatter_bucket_size_mb=25, fp32_reduce_scatter=False, offload_config=None, gradient_predivide_factor=1.0, use_memory_tracer=False, reuse_fp16_shard=False)[source]
A wrapper for the PyTorch module shards the model parameters among multiple GPU memory. Only 1/#nproc of parameters, gradients are stored in local CUDA memory, so forward and backward passes can be executed with limited CUDA memory budget.
Note that you must use ShardedModelV2 with ShardedOptimizerV2.
- Parameters
module (nn.Module) – A sharded module, which must be initialized by ZeroInitContext.
shard_strategy (BaseShardStrategy) – A shard strategy to manage shard behavior.
process_group (Optional[ProcessGroup], optional) – Data parallel process group. Defaults to None.
reduce_scatter_process_group (Optional[ProcessGroup], optional) – Reduce-scatter process group. Generally, it should be None, and it’s the same as process_group. Defaults to None.
reduce_scatter_bucket_size_mb (int, optional) – Reduce-scatter bucket size in MB. Defaults to 25.
fp32_reduce_scatter (bool, optional) – If set to True, gradients are forced to FP32 before reduce-scatter. Defaults to False.
offload_config (Optional[dict], optional) – We currently only support CPU offload. Set to {“device”: “cpu”} to enable CPU offload. Defaults to None.
gradient_predivide_factor (Optional[float], optional) – Gradient is divived by this value before reduce-scatter. Defaults to 1.0.
use_memory_tracer (bool, optional) – Whether to use memoty tracer. Defaults to False.
reuse_fp16_shard (bool, optional) – Whether to reuse fp16 shard for param and grad. Enabling this can reduce GPU memory usage, but you have to make sure you disable it when using gradient accumulation. In this mode, grad will be fp16. Make sure your optimizer supports mixed precision (fp32 param and fp16 grad). We find that PyTorch’s optimizers don’t support mixed precision, so we recommend you enable this only when using our CPUAdam with CPU offload. Defaults to False.
colossalai.zero.sharded_model.reduce_scatter
- class colossalai.zero.sharded_model.reduce_scatter.ReduceScatterBucketer(bucket_size_mb=25)[source]
Helper for bucketing multiple reduce-scatter operations on small tensors into larger reduce-scatter ops to improve communication efficiency.
Usage:
bucketer = ReduceScatterBucketer() bucketer.reduce_scatter_async( small_tensors, callback_fn=lambda result: print("small") ) bucketer.reduce_scatter_async( big_tensors, callback_fn=lambda result: print("big") ) bucketer.reduce_scatter_async( more_small_tensors, callback_fn=lambda result: print("small2") ) bucketer.flush() # callbacks only guaranteed to be called after flush() # Example output (note that it is out of order, due to bucketing): # big # small # small2
- Parameters
bucket_size_mb (int, Optional) – bucket size for communicating. Buckets are sub-divided based on world_size. Values <= 0 disable bucketing.
- reduce_scatter_async(input_list, group, callback_fn=None)[source]
Reduce-scatter a list of tensors asynchronously, so smaller reductions can be bucketed together. The given callback (
callback_fn) will be called with the reduced result at some later time. Callflush()to force all queued ops and callbacks to be executed.Note that large inputs will be reduced immediately, and this function may also flush the relevant bucket to make room for
input_list.- Parameters
input_list (List[Tensor]) – list of tensors to reduce-scatter. List should contain
group.size()tensors and each tensor should have identical shape, dtype and device.group (ProcessGroup) – process group for reduction
callback_fn (Callable, Optional) – callback function to call after the reduction executes. Function will be called with a single argument corresponding to the reduced result.
colossalai.zero.sharded_model.utils
colossalai.zero.sharded_model.sharded_model
- class colossalai.zero.sharded_model.ShardedModelV2(module, shard_strategy, process_group=None, reduce_scatter_process_group=None, reduce_scatter_bucket_size_mb=25, fp32_reduce_scatter=False, offload_config=None, gradient_predivide_factor=1.0, use_memory_tracer=False, reuse_fp16_shard=False)[source]
A wrapper for the PyTorch module shards the model parameters among multiple GPU memory. Only 1/#nproc of parameters, gradients are stored in local CUDA memory, so forward and backward passes can be executed with limited CUDA memory budget.
Note that you must use ShardedModelV2 with ShardedOptimizerV2.
- Parameters
module (nn.Module) – A sharded module, which must be initialized by ZeroInitContext.
shard_strategy (BaseShardStrategy) – A shard strategy to manage shard behavior.
process_group (Optional[ProcessGroup], optional) – Data parallel process group. Defaults to None.
reduce_scatter_process_group (Optional[ProcessGroup], optional) – Reduce-scatter process group. Generally, it should be None, and it’s the same as process_group. Defaults to None.
reduce_scatter_bucket_size_mb (int, optional) – Reduce-scatter bucket size in MB. Defaults to 25.
fp32_reduce_scatter (bool, optional) – If set to True, gradients are forced to FP32 before reduce-scatter. Defaults to False.
offload_config (Optional[dict], optional) – We currently only support CPU offload. Set to {“device”: “cpu”} to enable CPU offload. Defaults to None.
gradient_predivide_factor (Optional[float], optional) – Gradient is divived by this value before reduce-scatter. Defaults to 1.0.
use_memory_tracer (bool, optional) – Whether to use memoty tracer. Defaults to False.
reuse_fp16_shard (bool, optional) – Whether to reuse fp16 shard for param and grad. Enabling this can reduce GPU memory usage, but you have to make sure you disable it when using gradient accumulation. In this mode, grad will be fp16. Make sure your optimizer supports mixed precision (fp32 param and fp16 grad). We find that PyTorch’s optimizers don’t support mixed precision, so we recommend you enable this only when using our CPUAdam with CPU offload. Defaults to False.