colossalai.kernel.cuda_native.multihead_attention

class colossalai.kernel.cuda_native.multihead_attention.Config(max_batch_tokens: int, max_seq_len: int, hidden_size: int, nhead: int, attn_prob_dropout_ratio: float, hidden_dropout_ratio: float, norm_first: bool, fp16: bool)

class colossalai.kernel.cuda_native.multihead_attention.MultiHeadAttention(hidden_size, nhead, batch_size, max_seq_len, dropout=0.0, norm_first=False, fp16=True, pg=None)

Initialize the MultiHeadAttention.

Static variable:

layer_id: The layer-index counter starting from 0 and incrementing by 1 every time a layer object is instantiated, e.g. if a model has 24 transformer layers, layer_id goes from 0 to 23.

Parameters

hidden_size – Total dimension of hidden_size.
nhead – Number of parallel attention heads.
batch_size – Batch Size for one foward
max_seq_len – Max length of input sequence
dropout – Dropout probability
norm_first – perform LayerNorms before attention