colossalai.utils.checkpointing

colossalai.utils.checkpointing.get_checkpoint_path(checkpoint_dir, epoch, suffix='')

This is a function to generate the checkpoint path from the (checkpoint_dir, epoch, suffix, gpu_parallel_rank) tuple. This is useful during generation and recuperation of the checkpoint.

Parameters
  • checkpoint_dir (str) – Set up a directory for saving checkpoints

  • epoch (int) – Epoch number (indicate how many epochs have you trained this model)

  • suffix (str, optional) – Additional notation to specify the model or checkpoint, defaults to ‘’

Returns

Checkpoint path to be generated

Return type

path

colossalai.utils.checkpointing.get_latest_checkpoint_path(checkpoint_dir, suffix='')

This is a function to retrieve the latest checkpoint path from the (checkpoint_dir, suffix, gpu_parallel_rank) tuple. This is useful during recuperation of the checkpoint, especially when you do not know the epoch number.

Parameters
  • checkpoint_dir (str) – Directory for saving checkpoints

  • suffix (str, optional) – Additional notation to specify the model or checkpoint, defaults to ‘’

Raises

FileNotFoundError – Raise error when we cannot find the latest checkpoint file with inputs given

Returns

The latest checkpoint path to be retrieved

Return type

path

colossalai.utils.checkpointing.get_latest_checkpoint_pattern(suffix='')

Generate Regular expression of latest checkpoint’s pattern

Parameters

suffix (str, optional) – Additional notation to specify the model or checkpoint, defaults to ‘’

Returns

Checkpoint pattern

Return type

regular expression

colossalai.utils.checkpointing.save_checkpoint(checkpoint_path, epoch, model, optimizer, lr_scheduler=None, **kwargs)
Given a directory to store the checkpoints, saves all the training components’ parameters or buffers, such as model,

optimizer, lr_scheduler and etc. into a checkpoint dictionary.

This method can be used for both colosalai nn.BaseModel and normal pytorch nn.Module.

Parameters
  • checkpoint_path (str) – Set up a directory for saving checkpoints

  • epoch (int) – Epoch number (indicate how many epochs have you trained this model)

  • model (torch.nn.Module) – Model to be registered

  • optimizer (torch.optim.Optimizer) – Optimizer to be registered

  • lr_scheduler (torch.optim.lr_scheduler._LRScheduler, optional) – lr_scheduler to be registered, defaults to None

colossalai.utils.checkpointing.load_checkpoint(checkpoint_path, model, optimizer, lr_scheduler=None, finetune=False, strict=True)

Loads the checkpoint file. If finetune is False, then we intend to continue/resume the training process from the checkpoint given. So we copy parameters and buffers from state_dict into these modules(model, optimizer,lr_scheduler)

and its descendants.

If finetune is True, then only the weights and buffers of model should be reload. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s

state_dict() function.

Parameters
  • checkpoint_path (str) – The exact and matched checkpoint_path directory to retrieve appropriate state_dict

  • model (torch.nn.Module) – Model to reload parameters and buffers

  • optimizer (torch.optim.Optimizer) – Optimizer to recuperate

  • lr_scheduler (torch.optim.lr_scheduler._LRScheduler, optional) – lr_scheduler to recuperate, defaults to None

  • finetune (bool, optional) – Whether to finetune the model with new dataset or continue the pre-training, defaults to False

  • strict (bool, optional) – Whether to strictly enforce that the keys in state_dict of the checkpoint match the names of parameters and buffers in model., defaults to True

Raises

ValueError – Raise error if the model/optimizer cannot successfully be recuperated

Returns

(the epoch number of the checkpoint retrieved, the checkpoint retrieved)

Return type

Tuple