colossalai.kernel.cuda_native.scaled_softmax

This code from NVIDIA Megatron with some changes.

class colossalai.kernel.cuda_native.scaled_softmax.AttnMaskType(value)

An enumeration.

class colossalai.kernel.cuda_native.scaled_softmax.ScaledUpperTriangMaskedSoftmax(*args, **kwargs)

Fused operation which performs following three operations in sequence

  1. Scale the tensor.

  2. Apply upper triangular mask (typically used in gpt models).

  3. Perform softmax.

class colossalai.kernel.cuda_native.scaled_softmax.ScaledMaskedSoftmax(*args, **kwargs)

Fused operation which performs following three operations in sequence

  1. Scale the tensor.

  2. Apply the mask.

  3. Perform softmax.

class colossalai.kernel.cuda_native.scaled_softmax.FusedScaleMaskSoftmax(input_in_fp16, input_in_bf16, attn_mask_type, scaled_masked_softmax_fusion, mask_func, softmax_in_fp32, scale)

Fused operation: scaling + mask + softmax

Parameters
  • input_in_fp16 – Flag to indicate if input in fp16 data format.

  • input_in_bf16 – Flag to indicate if input in bf16 data format.

  • attn_mask_type – Attention mask type (pad or causal)

  • scaled_masked_softmax_fusion – Flag to indicate user want to use softmax fusion

  • mask_func – Mask function to be applied.

  • softmax_in_fp32 – If True, softmax in performed at fp32 precision.

  • scale – Scaling factor used in input tensor scaling.