colossalai.zero.sharded_model

colossalai.zero.sharded_model.sharded_model_v2

class colossalai.zero.sharded_model.sharded_model_v2.ShardedModelV2(module, shard_strategy, process_group=None, reduce_scatter_process_group=None, reduce_scatter_bucket_size_mb=25, fp32_reduce_scatter=False, offload_config=None, gradient_predivide_factor=1.0, use_memory_tracer=False, reuse_fp16_shard=False)[source]

A wrapper for the PyTorch module shards the model parameters among multiple GPU memory. Only 1/#nproc of parameters, gradients are stored in local CUDA memory, so forward and backward passes can be executed with limited CUDA memory budget.

Note that you must use ShardedModelV2 with ShardedOptimizerV2.

Parameters
  • module (nn.Module) – A sharded module, which must be initialized by ZeroInitContext.

  • shard_strategy (BaseShardStrategy) – A shard strategy to manage shard behavior.

  • process_group (Optional[ProcessGroup], optional) – Data parallel process group. Defaults to None.

  • reduce_scatter_process_group (Optional[ProcessGroup], optional) – Reduce-scatter process group. Generally, it should be None, and it’s the same as process_group. Defaults to None.

  • reduce_scatter_bucket_size_mb (int, optional) – Reduce-scatter bucket size in MB. Defaults to 25.

  • fp32_reduce_scatter (bool, optional) – If set to True, gradients are forced to FP32 before reduce-scatter. Defaults to False.

  • offload_config (Optional[dict], optional) – We currently only support CPU offload. Set to {“device”: “cpu”} to enable CPU offload. Defaults to None.

  • gradient_predivide_factor (Optional[float], optional) – Gradient is divived by this value before reduce-scatter. Defaults to 1.0.

  • use_memory_tracer (bool, optional) – Whether to use memoty tracer. Defaults to False.

  • reuse_fp16_shard (bool, optional) – Whether to reuse fp16 shard for param and grad. Enabling this can reduce GPU memory usage, but you have to make sure you disable it when using gradient accumulation. In this mode, grad will be fp16. Make sure your optimizer supports mixed precision (fp32 param and fp16 grad). We find that PyTorch’s optimizers don’t support mixed precision, so we recommend you enable this only when using our CPUAdam with CPU offload. Defaults to False.

dump_memory_stats(filename='dump_mem_stats.log')[source]

dummy memory tracer collected infomation to a file. try:

# forward: model(inputs) # backward: optimizer.backward()

except Exception as e:

model.dump_memory_stats() exit(0)

colossalai.zero.sharded_model.reduce_scatter

class colossalai.zero.sharded_model.reduce_scatter.ReduceScatterBucketer(bucket_size_mb=25)[source]

Helper for bucketing multiple reduce-scatter operations on small tensors into larger reduce-scatter ops to improve communication efficiency.

Usage:

bucketer = ReduceScatterBucketer()
bucketer.reduce_scatter_async(
    small_tensors, callback_fn=lambda result: print("small")
)
bucketer.reduce_scatter_async(
    big_tensors, callback_fn=lambda result: print("big")
)
bucketer.reduce_scatter_async(
    more_small_tensors, callback_fn=lambda result: print("small2")
)
bucketer.flush()  # callbacks only guaranteed to be called after flush()
# Example output (note that it is out of order, due to bucketing):
# big
# small
# small2
Parameters

bucket_size_mb (int, Optional) – bucket size for communicating. Buckets are sub-divided based on world_size. Values <= 0 disable bucketing.

reduce_scatter_async(input_list, group, callback_fn=None)[source]

Reduce-scatter a list of tensors asynchronously, so smaller reductions can be bucketed together. The given callback (callback_fn) will be called with the reduced result at some later time. Call flush() to force all queued ops and callbacks to be executed.

Note that large inputs will be reduced immediately, and this function may also flush the relevant bucket to make room for input_list.

Parameters
  • input_list (List[Tensor]) – list of tensors to reduce-scatter. List should contain group.size() tensors and each tensor should have identical shape, dtype and device.

  • group (ProcessGroup) – process group for reduction

  • callback_fn (Callable, Optional) – callback function to call after the reduction executes. Function will be called with a single argument corresponding to the reduced result.

flush()[source]

Reduce-scatter any partial buckets.

free()[source]

Free buffers from all buckets.

colossalai.zero.sharded_model.utils

colossalai.zero.sharded_model.utils.col_model_deepcopy(sharded_model, other_model)[source]

copy param of the ShardedModelV2 to other_model. Note the other_model has to be the same as self.

colossalai.zero.sharded_model.sharded_model

class colossalai.zero.sharded_model.ShardedModelV2(module, shard_strategy, process_group=None, reduce_scatter_process_group=None, reduce_scatter_bucket_size_mb=25, fp32_reduce_scatter=False, offload_config=None, gradient_predivide_factor=1.0, use_memory_tracer=False, reuse_fp16_shard=False)[source]

A wrapper for the PyTorch module shards the model parameters among multiple GPU memory. Only 1/#nproc of parameters, gradients are stored in local CUDA memory, so forward and backward passes can be executed with limited CUDA memory budget.

Note that you must use ShardedModelV2 with ShardedOptimizerV2.

Parameters
  • module (nn.Module) – A sharded module, which must be initialized by ZeroInitContext.

  • shard_strategy (BaseShardStrategy) – A shard strategy to manage shard behavior.

  • process_group (Optional[ProcessGroup], optional) – Data parallel process group. Defaults to None.

  • reduce_scatter_process_group (Optional[ProcessGroup], optional) – Reduce-scatter process group. Generally, it should be None, and it’s the same as process_group. Defaults to None.

  • reduce_scatter_bucket_size_mb (int, optional) – Reduce-scatter bucket size in MB. Defaults to 25.

  • fp32_reduce_scatter (bool, optional) – If set to True, gradients are forced to FP32 before reduce-scatter. Defaults to False.

  • offload_config (Optional[dict], optional) – We currently only support CPU offload. Set to {“device”: “cpu”} to enable CPU offload. Defaults to None.

  • gradient_predivide_factor (Optional[float], optional) – Gradient is divived by this value before reduce-scatter. Defaults to 1.0.

  • use_memory_tracer (bool, optional) – Whether to use memoty tracer. Defaults to False.

  • reuse_fp16_shard (bool, optional) – Whether to reuse fp16 shard for param and grad. Enabling this can reduce GPU memory usage, but you have to make sure you disable it when using gradient accumulation. In this mode, grad will be fp16. Make sure your optimizer supports mixed precision (fp32 param and fp16 grad). We find that PyTorch’s optimizers don’t support mixed precision, so we recommend you enable this only when using our CPUAdam with CPU offload. Defaults to False.

dump_memory_stats(filename='dump_mem_stats.log')[source]

dummy memory tracer collected infomation to a file. try:

# forward: model(inputs) # backward: optimizer.backward()

except Exception as e:

model.dump_memory_stats() exit(0)

colossalai.zero.sharded_model.sharded_grad

colossalai.zero.sharded_model.param_manager