colossalai.nn.layer.parallel_3d
- colossalai.nn.layer.parallel_3d.reduce_by_batch_3d(tensor, input_parallel_mode, weight_parallel_mode, reduce_mean=False)[source]
All-reduce the input from the model parallel region.
- Parameters
input_parallel_mode (
colossalai.context.parallel_mode.ParallelMode) – input parallel mode.weight_parallel_mode (
colossalai.context.parallel_mode.ParallelMode) – weight parallel mode.reduce_mean (bool, optional) – If set to
True, it will divide the output by (input parallel size * weight parallel size), default to False.
Note
The parallel_mode should be concluded in
ParallelMode. More details aboutParallelModecould be found in parallel_mode
- colossalai.nn.layer.parallel_3d.split_tensor_3d(tensor, dim, parallel_mode)[source]
Splits 3D parallel tensor in specified dimension.
- Parameters
tensor (
torch.tensor) – Input tensor.dim (int) – Specified dimension in which to split.
parallel_mode (
colossalai.context.parallel_mode.ParallelMode, optional) – Parallel mode.
- Returns
The tensor has been split.
- Return type
torch.tensor
Note
The parallel_mode should be concluded in
ParallelMode. More details aboutParallelModecould be found in parallel_mode.
- colossalai.nn.layer.parallel_3d.split_batch_3d(input_, dim=0, input_parallel_mode=ParallelMode.PARALLEL_3D_INPUT, weight_parallel_mode=ParallelMode.PARALLEL_3D_WEIGHT)[source]
Splits 3D tensor in batch.
- Parameters
input (
torch.tensor) – Input tensor.dim (int) – Specified dimension in which to split.
input_parallel_mode (
colossalai.context.parallel_mode.ParallelMode, optional) – input parallel mode.weight_parallel_mode (
colossalai.context.parallel_mode.ParallelMode, optional) – weight parallel mode.
- Returns
The tensor has been split.
- Return type
torch.tensor
Note
The parallel_mode should be concluded in
ParallelMode. More details aboutParallelModecould be found in parallel_mode.
- class colossalai.nn.layer.parallel_3d.Linear3D(in_features, out_features, bias=True, dtype=None, weight_initializer=<function kaiming_uniform_.<locals>.initializer>, bias_initializer=<function xavier_uniform_.<locals>.initializer>)[source]
Linear layer for 3D parallelism.
- Parameters
in_features (int) – size of each input sample.
out_features (int) – size of each output sample.
bias (bool, optional) – If set to
False, the layer will not learn an additive bias, defaults toTrue.dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.weight_initializer (
typing.Callable, optional) – The initializer of weight, defaults to kaiming uniform initializer.bias_initializer (
typing.Callable, optional) – The initializer of bias, defaults to xavier uniform initializer.
More details about
initializerplease refer to init.
- class colossalai.nn.layer.parallel_3d.LayerNorm3D(normalized_shape, eps=1e-12, dtype=None)[source]
Layer Normalization for 3D parallelism.
- Parameters
normalized_shape (int) – input shape from an expected input of size. \([* \times \text{normalized_shape}[0] \times \text{normalized_shape}[1] \times \ldots \times \text{normalized_shape}[-1]]\) If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which is expected to be of that specific size.
eps (float, optional) – a value added to the denominator for numerical stability, defaults to 1e-12.
dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.
- class colossalai.nn.layer.parallel_3d.PatchEmbedding3D(img_size, patch_size, in_chans, embed_size, flatten=True, dtype=None, weight_initializer=<function kaiming_uniform_.<locals>.initializer>, bias_initializer=<function xavier_uniform_.<locals>.initializer>, position_embed_initializer=<function zeros_.<locals>.initializer>)[source]
2D Image to Patch Embedding.
- Parameters
img_size (int) – image size.
patch_size (int) – patch size.
in_chans (int) – number of channels of input image.
embed_size (int) – size of embedding.
dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.flatten (bool, optional) – whether to flatten output tensor, defaults to True.
weight_initializer (
typing.Callable, optional) – The initializer of weight, defaults to kaiming uniform initializer.bias_initializer (
typing.Callable, optional) – The initializer of bias, defaults to xavier uniform initializer.position_embed_initializer (
typing.Callable, optional) – The initializer of position embedding, defaults to zeros initializer.
More details about
initializerplease refer to init.
- class colossalai.nn.layer.parallel_3d.Classifier3D(in_features, num_classes, weight=None, bias=True, dtype=None, weight_initializer=<function kaiming_uniform_.<locals>.initializer>, bias_initializer=<function xavier_uniform_.<locals>.initializer>)[source]
Classifier for 3D parallelism.
- Parameters
in_features (int) – size of each input sample.
num_classes (int) – number of classes.
weight (
torch.nn.Parameter, optional) – weight of the classifier, defaults to None.bias (bool, optional) – If set to
False, the layer will not learn an additive bias, defaults toTrue.dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.weight_initializer (
typing.Callable, optional) – The initializer of weight, defaults to kaiming uniform initializer.bias_initializer (
typing.Callable, optional) – The initializer of bias, defaults to xavier uniform initializer.
More details about
initializerplease refer to init.
- class colossalai.nn.layer.parallel_3d.Embedding3D(num_embeddings, embedding_dim, padding_idx=None, dtype=None, weight_initializer=<function normal_.<locals>.initializer>, *args, **kwargs)[source]
Embedding for 3D parallelism.
- Parameters
num_embeddings (int) – number of embeddings.
embedding_dim (int) – dimension of embedding.
padding_idx (int, optional) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training, i.e. it remains as a fixed “pad”, defaults to None.
dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.weight_initializer (
typing.Callable, optional) – he initializer of weight, defaults to normal initializer.
The
argsandkwargsused in :class:torch.nn.functional.embeddingshould contain:max_norm (float, optional): If given, each embedding vector with norm larger than max_norm is renormalized to have norm max_norm. Note: this will modify weight in-place. norm_type (float, optional): The p of the p-norm to compute for the max_norm option. Default 2. scale_grad_by_freq (bool, optional): If given, this will scale gradients by the inverse of frequency of the words in the mini-batch. Default False. sparse (bool, optional): If True, gradient w.r.t. weight will be a sparse tensor. Default False.
More details about
argsandkwargscould be found in Embedding.More details about initializer please refer to init
- class colossalai.nn.layer.parallel_3d.VocabParallelEmbedding3D(num_embeddings, embedding_dim, padding_idx=None, dtype=None, weight_initializer=<function normal_.<locals>.initializer>, *args, **kwargs)[source]
Embedding parallelized in the vocabulary dimension.
- Parameters
num_embeddings (int) – number of embeddings.
embedding_dim (int) – dimension of embedding.
padding_idx (int, optional) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training, i.e. it remains as a fixed “pad”, defaults to None.
dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.weight_initializer (
typing.Callable, optional) – he initializer of weight, defaults to normal initializer.
The
argsandkwargsused in :class:torch.nn.functional.embeddingshould contain:max_norm (float, optional): If given, each embedding vector with norm larger than max_norm is renormalized to have norm max_norm. Note: this will modify weight in-place. norm_type (float, optional): The p of the p-norm to compute for the max_norm option. Default 2. scale_grad_by_freq (bool, optional): If given, this will scale gradients by the inverse of frequency of the words in the mini-batch. Default False. sparse (bool, optional): If True, gradient w.r.t. weight will be a sparse tensor. Default False.
More details about
argsandkwargscould be found in Embedding.More details about initializer please refer to init.
- class colossalai.nn.layer.parallel_3d.VocabParallelClassifier3D(in_features, num_classes, weight=None, bias=True, dtype=None, weight_initializer=<function kaiming_uniform_.<locals>.initializer>, bias_initializer=<function xavier_uniform_.<locals>.initializer>)[source]
Vocab parallel classifier layer for 3D parallelism.
- Parameters
in_features (int) – size of each input sample.
num_classes (int) – number of classes.
weight (
torch.nn.Parameter, optional) – weight of the classifier, defaults to None.bias (bool, optional) – If set to
False, the layer will not learn an additive bias, defaults toTrue.dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.weight_initializer (
typing.Callable, optional) – The initializer of weight, defaults to kaiming uniform initializer.bias_initializer (
typing.Callable, optional) – The initializer of bias, defaults to xavier uniform initializer.
More details about
initializerplease refer to init.