colossalai.utils.checkpointing
- colossalai.utils.checkpointing.get_checkpoint_path(checkpoint_dir, epoch, suffix='')
This is a function to generate the checkpoint path from the (checkpoint_dir, epoch, suffix, gpu_parallel_rank) tuple. This is useful during generation and recuperation of the checkpoint.
- Parameters
checkpoint_dir (str) – Set up a directory for saving checkpoints
epoch (int) – Epoch number (indicate how many epochs have you trained this model)
suffix (str, optional) – Additional notation to specify the model or checkpoint, defaults to ‘’
- Returns
Checkpoint path to be generated
- Return type
path
- colossalai.utils.checkpointing.get_latest_checkpoint_path(checkpoint_dir, suffix='')
This is a function to retrieve the latest checkpoint path from the (checkpoint_dir, suffix, gpu_parallel_rank) tuple. This is useful during recuperation of the checkpoint, especially when you do not know the epoch number.
- Parameters
checkpoint_dir (str) – Directory for saving checkpoints
suffix (str, optional) – Additional notation to specify the model or checkpoint, defaults to ‘’
- Raises
FileNotFoundError – Raise error when we cannot find the latest checkpoint file with inputs given
- Returns
The latest checkpoint path to be retrieved
- Return type
path
- colossalai.utils.checkpointing.get_latest_checkpoint_pattern(suffix='')
Generate Regular expression of latest checkpoint’s pattern
- Parameters
suffix (str, optional) – Additional notation to specify the model or checkpoint, defaults to ‘’
- Returns
Checkpoint pattern
- Return type
regular expression
- colossalai.utils.checkpointing.save_checkpoint(checkpoint_path, epoch, model, optimizer, lr_scheduler=None, **kwargs)
- Given a directory to store the checkpoints, saves all the training components’ parameters or buffers, such as model,
optimizer, lr_scheduler and etc. into a checkpoint dictionary.
This method can be used for both colosalai nn.BaseModel and normal pytorch nn.Module.
- Parameters
checkpoint_path (str) – Set up a directory for saving checkpoints
epoch (int) – Epoch number (indicate how many epochs have you trained this model)
model (torch.nn.Module) – Model to be registered
optimizer (torch.optim.Optimizer) – Optimizer to be registered
lr_scheduler (torch.optim.lr_scheduler._LRScheduler, optional) – lr_scheduler to be registered, defaults to None
- colossalai.utils.checkpointing.load_checkpoint(checkpoint_path, model, optimizer, lr_scheduler=None, finetune=False, strict=True)
Loads the checkpoint file. If finetune is False, then we intend to continue/resume the training process from the checkpoint given. So we copy parameters and buffers from state_dict into these modules(model, optimizer,lr_scheduler)
and its descendants.
If finetune is True, then only the weights and buffers of model should be reload. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s
state_dict() function.
- Parameters
checkpoint_path (str) – The exact and matched checkpoint_path directory to retrieve appropriate state_dict
model (torch.nn.Module) – Model to reload parameters and buffers
optimizer (torch.optim.Optimizer) – Optimizer to recuperate
lr_scheduler (torch.optim.lr_scheduler._LRScheduler, optional) – lr_scheduler to recuperate, defaults to None
finetune (bool, optional) – Whether to finetune the model with new dataset or continue the pre-training, defaults to False
strict (bool, optional) – Whether to strictly enforce that the keys in
state_dictof the checkpoint match the names of parameters and buffers in model., defaults to True
- Raises
ValueError – Raise error if the model/optimizer cannot successfully be recuperated
- Returns
(the epoch number of the checkpoint retrieved, the checkpoint retrieved)
- Return type
Tuple