colossalai.nn.layer.parallel_1d
- class colossalai.nn.layer.parallel_1d.Linear1D(in_features, out_features, bias=True, dtype=None, gather_output=False, skip_bias_add=False, weight_initializer=<function kaiming_uniform_.<locals>.initializer>, bias_initializer=<function xavier_uniform_.<locals>.initializer>)[source]
Linear layer for 1D parallelism.
- Parameters
in_features (int) – size of each input sample.
out_features (int) – size of each output sample.
bias (bool, optional) – If set to
False, the layer will not learn an additive bias, defaults toTrue.dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.gather_output (bool, optional) – Whether to call all-gather on output, defaults to False.
skip_bias_add (bool, optional) – If set to
True, it will skip bias add for linear layer, which is preserved for kernel fusion, defaults to Falseweight_initializer (
typing.Callable, optional) – The initializer of weight, defaults to kaiming uniform initializer.bias_initializer (
typing.Callable, optional) – The initializer of bias, defaults to xavier uniform initializer.
More details about
initializerplease refer to init.
- class colossalai.nn.layer.parallel_1d.Linear1D_Col(in_features, out_features, bias=True, dtype=None, gather_output=False, skip_bias_add=False, weight_initializer=<function kaiming_uniform_.<locals>.initializer>, bias_initializer=<function xavier_uniform_.<locals>.initializer>)[source]
Linear layer with column parallelism.
The linear layer is defined as \(Y = XA + b\). A is parallelized along its second dimension as \(A = [A_1, ..., A_p]\).
- Parameters
in_features (int) – size of each input sample.
out_features (int) – size of each output sample.
bias (bool, optional) – If set to
False, the layer will not learn an additive bias, defaults toTrue.dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.gather_output (bool, optional) – If true, call all-gather on output and make Y available to all GPUs, otherwise, every GPU will have its output which is \(Y_i = XA_i\), defaults to False
skip_bias_add (bool, optional) – If set to
True, it will skip bias add for linear layer, which is preserved for kernel fusion, defaults to Falsweight_initializer (
typing.Callable, optional) – The initializer of weight, defaults to kaiming uniform initializer.bias_initializer (
typing.Callable, optional) – The initializer of bias, defaults to xavier uniform initializer.
More details about
initializerplease refer to init.
- class colossalai.nn.layer.parallel_1d.Linear1D_Row(in_features, out_features, bias=True, dtype=None, parallel_input=True, skip_bias_add=False, weight_initializer=<function kaiming_uniform_.<locals>.initializer>, bias_initializer=<function xavier_uniform_.<locals>.initializer>)[source]
Linear layer with row parallelism
- Parameters
in_features (int) – size of each input sample.
out_features (int) – size of each output sample.
bias (bool, optional) – If set to
False, the layer will not learn an additive bias, defaults toTrue.dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.parallel_input (bool, optional) – If set to
True, it’s assumed that the input is split, defaults to False.skip_bias_add (bool, optional) – If set to
True, it will skip bias add for linear layer, which is preserved for kernel fusion, defaults to Falsweight_initializer (
typing.Callable, optional) – The initializer of weight, defaults to kaiming uniform initializer.bias_initializer (
typing.Callable, optional) – The initializer of bias, defaults to xavier uniform initializer.
More details about
initializerplease refer to init.
- class colossalai.nn.layer.parallel_1d.Embedding1D(num_embeddings, embedding_dim, padding_idx=None, dtype=None, weight_initializer=<function normal_.<locals>.initializer>, *args, **kwargs)[source]
Embedding for 1D parallelism.
- Parameters
num_embeddings (int) – number of embeddings.
embedding_dim (int) – dimension of embedding.
padding_idx (int, optional) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training, i.e. it remains as a fixed “pad”, defaults to None.
dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.weight_initializer (
typing.Callable, optional) – he initializer of weight, defaults to normal initializer.
The
argsandkwargsused intorch.nn.functional.embeddingshould contain:max_norm (float, optional): If given, each embedding vector with norm larger than max_norm is renormalized to have norm max_norm. Note: this will modify weight in-place. norm_type (float, optional): The p of the p-norm to compute for the max_norm option. Default 2. scale_grad_by_freq (bool, optional): If given, this will scale gradients by the inverse of frequency of the words in the mini-batch. Default False. sparse (bool, optional): If True, gradient w.r.t. weight will be a sparse tensor. Default False.
More details about
argsandkwargscould be found in Embedding.More details about
initializerplease refer to init
- class colossalai.nn.layer.parallel_1d.Dropout1D(p=0.5, inplace=False)[source]
Dropout layer of 1D parallelism.
- Parameters
p (float, optional) – probability of an element to be zeroed, defaults 0.5.
inplace (bool, optional) – whether to do dropout in-place, default to be False.
- class colossalai.nn.layer.parallel_1d.Classifier1D(in_features, num_classes, weight=None, bias=True, dtype=None, weight_initializer=<function kaiming_uniform_.<locals>.initializer>, bias_initializer=<function xavier_uniform_.<locals>.initializer>)[source]
RowLinear with given weight. Classifier of 1D parallelism.
- Parameters
in_features (int) – size of each input sample.
num_classes (int) – number of classes.
weight (
torch.nn.Parameter, optional) – weight of the classifier, defaults to None.bias (bool, optional) – If set to
False, the layer will not learn an additive bias, defaults toTrue.dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.weight_initializer (
typing.Callable, optional) – The initializer of weight, defaults to kaiming uniform initializer.bias_initializer (
typing.Callable, optional) – The initializer of bias, defaults to xavier uniform initializer.
More details about
initializerplease refer to init.
- class colossalai.nn.layer.parallel_1d.VocabParallelClassifier1D(in_features, num_classes, weight=None, bias=True, dtype=None, weight_initializer=<function kaiming_uniform_.<locals>.initializer>, bias_initializer=<function xavier_uniform_.<locals>.initializer>)[source]
ColLinear with given weight. Classifier of 1D parallelism.
- Parameters
in_features (int) – size of each input sample.
num_classes (int) – number of classes.
weight (
torch.nn.Parameter, optional) – weight of the classifier, defaults to None.bias (bool, optional) – If set to
False, the layer will not learn an additive bias, defaults toTrue.dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.weight_initializer (
typing.Callable, optional) – The initializer of weight, defaults to kaiming uniform initializer.bias_initializer (
typing.Callable, optional) – The initializer of bias, defaults to xavier uniform initializer.
More details about
initializerplease refer to init.
- class colossalai.nn.layer.parallel_1d.VocabParallelEmbedding1D(num_embeddings, embedding_dim, padding_idx=None, dtype=None, weight_initializer=<function normal_.<locals>.initializer>, *args, **kwargs)[source]
Embedding parallelized in the vocabulary dimension.
- Parameters
num_embeddings (int) – number of embeddings.
embedding_dim (int) – dimension of embedding.
padding_idx (int, optional) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training, i.e. it remains as a fixed “pad”, defaults to None.
dtype (
torch.dtype, optional) – The dtype of parameters, defaults to None.weight_initializer (
typing.Callable, optional) – he initializer of weight, defaults to normal initializer.
The
argsandkwargsused in :class:torch.nn.functional.embeddingshould contain:max_norm (float, optional): If given, each embedding vector with norm larger than max_norm is renormalized to have norm max_norm. Note: this will modify weight in-place. norm_type (float, optional): The p of the p-norm to compute for the max_norm option. Default 2. scale_grad_by_freq (bool, optional): If given, this will scale gradients by the inverse of frequency of the words in the mini-batch. Default False. sparse (bool, optional): If True, gradient w.r.t. weight will be a sparse tensor. Default False.
More details about
argsandkwargscould be found in Embedding.More details about initializer please refer to init.