colossalai.engine.schedule

class colossalai.engine.schedule.BaseSchedule(batch_data_process_func=None)[source]

A basic helper class to control the process of training or evaluation. It mainly composes of forward_backward_step for gradient backward and optimizer_step for parameters update. For the convenience to enable FP16, we aggregate all codes that contain the control of FP16 in class schedule.

Parameters
  • batch_data_process_func (Callable, optional) – The preprocessing function which receives a batch of data,

  • load_batch. (and it will be executed in) –

load_batch(data_iter, to_gpu=True)[source]

Loads a batch from data iterator. It returns the data and labels which are already in the same GPU as where the model’s.

Parameters
  • data_iter (Iterable) – Data iterator from which get a batch of data, obtained by calling iter(dataloader).

  • to_gpu (bool, optional) – Whether the data should be moved to GPU

Returns

A tuple of (data, label).

Return type

Tuple (Tensor, torch.Tensor)

pre_processing(engine)[source]

To perform actions before running the schedule.

abstract forward_backward_step(engine, data_iter, forward_only, return_loss=True, return_output_label=True)[source]

The process function over a batch of dataset for training or evaluation.

Parameters
  • engine (colossalai.engine.Engine) – Colossalai engine for training and inference.

  • data_iter (Iterable) – Data iterator from which get a batch of data, obtained by calling iter(dataloader).

  • forward_only (bool) – If True, the process won’t include backward.

  • return_loss (bool, optional) – If False, the loss won’t be returned.

  • return_output_label (bool, optional) – If False, the output and label won’t be returned.

class colossalai.engine.schedule.NonPipelineSchedule(batch_data_process_func=None)[source]

A helper schedule class for no pipeline parallelism running environment. During one process, it loads a batch of dataset and feeds it to the model. After getting the output and calculating the loss, it will use step() to update the parameters if it is in training mode.

Parameters
  • batch_data_process_func (Callable, optional) – The preprocessing function which receives a batch of data,

  • load_batch. (and it will be executed in) –

forward_backward_step(engine, data_iter, forward_only=False, return_loss=True, return_output_label=True)[source]

The process function that loads a batch of dataset and feeds it to the model. The returned labels and loss will None if return_loss is False.

Parameters
  • engine (colossalai.engine.Engine) – Colossalai engine for training and inference.

  • data_iter (Iterable) – Dataloader as the form of an iterator, obtained by calling iter(dataloader).

  • forward_only (bool, optional) – If True, the model is run for the forward pass, else back propagation will be executed.

  • return_loss (bool, optional) – Loss will be returned if True.

  • return_output_label (bool, optional) – Output and label will be returned if True.

Returns

A tuple of (output, label, loss), loss and label could be None.

Return type

Tuple[torch.Tensor]

class colossalai.engine.schedule.PipelineSchedule(num_microbatches, batch_data_process_func=None, tensor_shape=None, scatter_gather_tensors=False)[source]

A helper schedule class for pipeline parallelism running environment. It uses non-interleaved 1F1B strategy. Other properties are similar as NonPipelineSchedule.

Parameters
  • num_microbatches (int) – The number of microbatches.

  • batch_data_process_func (Callable, optional) – The preprocessing function which receives a batch of data, and it will be executed in load_batch.

  • tensor_shape (torch.Size, optional) – Specified shape in pipeline communication.

  • scatter_gather_tensors (bool, optional) – If set to True, communication will be reduced over pipeline when using 1D tensor parallelization.

forward_step(engine, input_tensor, return_tensors, return_output_label=True, accum_loss=None)[source]

Forward step for passed-in model. If it is the first stage, the input tensor is obtained from data_iterator, otherwise the passed-in input_tensor is used. Returns output tensor. This is a helper function and can be ignored by users.

Parameters
  • engine (colossalai.engine.Engine) – Colossalai engine for training and inference.

  • input_tensor (torch.Tensor) – Input tensor for this pipeline stage.

  • return_tensors (List[torch.Tensor]) – A list of tensors to return.

  • return_output_label (bool, optional) – Whether returns output labels.

  • accum_loss (optional) – Where accumulated loss stores.

Returns

output or the loss value of the current pipeline stage.

Return type

torch.Tensor

backward_step(engine, input_tensor, output_tensor, output_tensor_grad)[source]

Backward step through the passed-in output tensor. If it is the last stage, the output_tensor_grad is None, otherwise it is the gradients with respect to stage’s output tensor. Returns the gradients with respect to the input tensor (None if first stage). This is a helper function and can be ignored by users.

Parameters
  • engine (colossalai.engine.Engine) – Colossalai engine for training and inference.

  • input_tensor (torch.Tensor) – input tensor for this pipeline stage.

  • output_tensor (torch.Tensor) – output tensor for this pipeline stage.

  • output_tensor_grad (torch.Tensor) – gradient of output tensor for this pipeline stage.

Returns

gradient of input tensor.

Return type

torch.Tensor

forward_backward_step(engine, data_iter, forward_only=False, return_loss=True, return_output_label=True)[source]

Runs non-interleaved 1F1B schedule, with communication between pipeline stages. Returns a tuple with losses if the last stage, an empty tuple otherwise.

Parameters
  • engine (colossalai.engine.Engine) – Colossalai engine for training and inference.

  • data_iter (Iterable) – Dataloader as the form of an iterator, obtained by calling iter(dataloader).

  • forward_only (bool, optional) – Whether run forward step only. Default is false. If true, no backward will be run.

  • return_loss (bool, optional) – Whether returns the loss value. Default is true.

  • return_output_label (bool, optional) – If False, the output and label won’t be returned.

Returns

A tuple of (output, label, loss), loss and label could be None.

Return type

Tuple[torch.Tensor]