colossalai.kernel.cuda_native.scaled_softmax

This code from NVIDIA Megatron with some changes.

class colossalai.kernel.cuda_native.scaled_softmax.AttnMaskType(value): An enumeration.

class colossalai.kernel.cuda_native.scaled_softmax.ScaledUpperTriangMaskedSoftmax(*args, **kwargs)

Fused operation which performs following three operations in sequence

Scale the tensor.

Apply upper triangular mask (typically used in gpt models).

Perform softmax.

class colossalai.kernel.cuda_native.scaled_softmax.ScaledMaskedSoftmax(*args, **kwargs)

Fused operation which performs following three operations in sequence

Scale the tensor.

Apply the mask.

Perform softmax.

class colossalai.kernel.cuda_native.scaled_softmax.FusedScaleMaskSoftmax(input_in_fp16, input_in_bf16, attn_mask_type, scaled_masked_softmax_fusion, mask_func, softmax_in_fp32, scale)

Fused operation: scaling + mask + softmax

Parameters

input_in_fp16 – Flag to indicate if input in fp16 data format.
input_in_bf16 – Flag to indicate if input in bf16 data format.
attn_mask_type – Attention mask type (pad or causal)
scaled_masked_softmax_fusion – Flag to indicate user want to use softmax fusion
mask_func – Mask function to be applied.
softmax_in_fp32 – If True, softmax in performed at fp32 precision.
scale – Scaling factor used in input tensor scaling.