colossalai.nn.layer.moe

class colossalai.nn.layer.moe.Experts(expert_cls, num_experts, **expert_args)

A wrapper class to create experts. It will create E experts across the moe model parallel group, where E is the number of experts. Every expert is a instence of the class, ‘expert’ in initialization parameters.

Parameters
  • expert – The class of all experts

  • num_experts (int) – The number of experts

  • expert_args – Args used to initialize experts

class colossalai.nn.layer.moe.FFNExperts(num_experts, d_model, d_ff, activation=None, drop_rate=0)

Use torch.bmm to speed up for multiple experts.

class colossalai.nn.layer.moe.TPExperts(num_experts, d_model, d_ff, activation=None, drop_rate=0)

Use tensor parallelism to split each expert evenly, which can deploy experts in case that the number of experts can’t be divied by maximum expert parallel size or maximum expert parallel size can’t be divied by the number of experts.

class colossalai.nn.layer.moe.Top1Router(capacity_factor_train=1.25, capacity_factor_eval=2.0, min_capacity=4, select_policy='first', noisy_func=None, drop_tks=True)

Top1 router that returns the dispatch mask [s, e, c] and combine weight [s, e, c] for routing usage. More deailted function can be found in the paper about Switch Transformer of Google.

Parameters
  • capacity_factor_train (float, optional) – Capacity factor in routing during training

  • capacity_factor_eval (float, optional) – Capacity factor in routing during evaluation

  • min_capacity (int, optional) – The minimum number of the capacity of each expert

  • select_policy (str, optional) – The policy about tokens selection

  • noisy_func (Callable, optional) – Noisy function used in logits

  • drop_tks (bool, optional) – Whether drops tokens in evaluation

class colossalai.nn.layer.moe.Top2Router(capacity_factor_train=1.25, capacity_factor_eval=2.0, min_capacity=4, noisy_func=None, drop_tks=True)

Top2 router that returns the dispatch mask [s, e, c] and combine weight [s, e, c] for routing usage. More deailted function can be found in the paper about ViT-MoE.

Parameters
  • capacity_factor_train (float, optional) – Capacity factor in routing during training

  • capacity_factor_eval (float, optional) – Capacity factor in routing during evaluation

  • min_capacity (int, optional) – The minimum number of the capacity of each expert

  • noisy_func (Callable, optional) – Noisy function used in logits

  • drop_tks (bool, optional) – Whether drops tokens in evaluation

class colossalai.nn.layer.moe.MoeLayer(dim_model, num_experts, router, experts)

A MoE layer, that puts its input tensor to its gate and uses the output logits to router all tokens, is mainly used to exchange all tokens for every expert across the moe tensor group by all to all comunication. Then it will get the output of all experts and exchange the output. At last returns the output of the moe system.

Parameters
  • dim_model (int) – Dimension of model

  • num_experts (int) – The number of experts

  • router (nn.Module) – Instance of router used in routing

  • experts (nn.Module) – Instance of experts generated by Expert

class colossalai.nn.layer.moe.NormalNoiseGenerator(num_experts)

Generates a random noisy mask for logtis tensor.

All noise is generated from a normal distribution (0, 1 / E^2), where E = the number of experts.

Parameters

num_experts (int) – The number of experts

class colossalai.nn.layer.moe.UniformNoiseGenerator(eps=0.01)

Generates a random noisy mask for logtis tensor. copied from mesh tensorflow: Multiply values by a random number between 1-epsilon and 1+epsilon. Makes models more resilient to rounding errors introduced by bfloat16. This seems particularly important for logits.

Parameters

eps (float) – Epsilon in generator

class colossalai.nn.layer.moe.MoeModule(dim_model, num_experts, top_k=1, capacity_factor_train=1.25, capacity_factor_eval=2.0, min_capacity=4, noisy_policy=None, drop_tks=True, use_residual=False, residual_instance=None, expert_instance=None, expert_cls=None, **expert_args)

A class for users to create MoE modules in their models.

Parameters
  • dim_model (int) – Hidden dimension of training model

  • num_experts (int) – The number experts

  • top_k (int, optional) – The number of experts for dispatchment of each token

  • capacity_factor_train (float, optional) – Capacity factor in routing during training

  • capacity_factor_eval (float, optional) – Capacity factor in routing during evaluation

  • min_capacity (int, optional) – The minimum number of the capacity of each expert

  • noisy_policy (str, optional) – The policy of noisy function. Now we have ‘Jitter’ and ‘Gaussian’. ‘Jitter’ can be found in Switch Transformer paper (https://arxiv.org/abs/2101.03961). ‘Gaussian’ can be found in ViT-MoE paper (https://arxiv.org/abs/2106.05974).

  • drop_tks (bool, optional) – Whether drops tokens in evaluation

  • use_residual (bool, optional) – Makes this MoE layer a Residual MoE. More information can be found in Microsoft paper (https://arxiv.org/abs/2201.05596).

  • residual_instance (nn.Module, optional) – The instance of residual module in Resiual MoE

  • expert_instance (MoeExperts, optional) – The instance of experts module in MoeLayer

  • expert_cls (Type[nn.Module], optional) – The class of each expert when no instance is given

  • expert_args (optional) – The args of expert when no instance is given