colossalai.nn.layer.parallel_sequence

class colossalai.nn.layer.parallel_sequence.TransformerSelfAttentionRing(hidden_size, num_attention_heads, attention_dropout, attention_mask_func, layer_number, apply_query_key_layer_scaling=False, convert_fp16_to_fp32_in_softmax=False, attn_mask_type=AttnMaskType.padding, masked_softmax_fusion=True, fp16=False, bf16=False)[source]

Parallel self-attention layer abstract class. Self-attention layer takes input with size [b, s, h] and returns output of the same size.

Parameters

hidden_size (int) – hidden size.
num_attention_heads (int) – number of attention heads.
attention_dropout (float) – dropout probability for attention layer.
attention_mask_func (typing.Callable) – Mask function to be applied.
layer_number (int) – number of layers.

class colossalai.nn.layer.parallel_sequence.RingAV(*args, **kwargs)[source]: Calculate AV in a ring-exchange style

class colossalai.nn.layer.parallel_sequence.RingQK(*args, **kwargs)[source]: Calculate QK in a ring-exchange style