colossalai.nn.layer.moe.layers
- class colossalai.nn.layer.moe.layers.Top1Router(capacity_factor_train=1.25, capacity_factor_eval=2.0, min_capacity=4, select_policy='first', noisy_func=None, drop_tks=True)
Top1 router that returns the dispatch mask [s, e, c] and combine weight [s, e, c] for routing usage. More deailted function can be found in the paper about Switch Transformer of Google.
- Parameters
capacity_factor_train (float, optional) – Capacity factor in routing during training
capacity_factor_eval (float, optional) – Capacity factor in routing during evaluation
min_capacity (int, optional) – The minimum number of the capacity of each expert
select_policy (str, optional) – The policy about tokens selection
noisy_func (Callable, optional) – Noisy function used in logits
drop_tks (bool, optional) – Whether drops tokens in evaluation
- class colossalai.nn.layer.moe.layers.Top2Router(capacity_factor_train=1.25, capacity_factor_eval=2.0, min_capacity=4, noisy_func=None, drop_tks=True)
Top2 router that returns the dispatch mask [s, e, c] and combine weight [s, e, c] for routing usage. More deailted function can be found in the paper about ViT-MoE.
- Parameters
capacity_factor_train (float, optional) – Capacity factor in routing during training
capacity_factor_eval (float, optional) – Capacity factor in routing during evaluation
min_capacity (int, optional) – The minimum number of the capacity of each expert
noisy_func (Callable, optional) – Noisy function used in logits
drop_tks (bool, optional) – Whether drops tokens in evaluation
- class colossalai.nn.layer.moe.layers.FP32LinearGate(d_model, num_experts)
Gate module used in MOE layer. Just a linear function without bias. But it should be kept as fp32 forever.
- Parameters
d_model (int) – Hidden dimension of training model
num_experts (int) – The number experts
- weight
The weight of linear gate
- Type
ForceFP32Parameter
- class colossalai.nn.layer.moe.layers.MoeLayer(dim_model, num_experts, router, experts)
A MoE layer, that puts its input tensor to its gate and uses the output logits to router all tokens, is mainly used to exchange all tokens for every expert across the moe tensor group by all to all comunication. Then it will get the output of all experts and exchange the output. At last returns the output of the moe system.
- Parameters
dim_model (int) – Dimension of model
num_experts (int) – The number of experts
router (nn.Module) – Instance of router used in routing
experts (nn.Module) – Instance of experts generated by Expert
- class colossalai.nn.layer.moe.layers.MoeModule(dim_model, num_experts, top_k=1, capacity_factor_train=1.25, capacity_factor_eval=2.0, min_capacity=4, noisy_policy=None, drop_tks=True, use_residual=False, residual_instance=None, expert_instance=None, expert_cls=None, **expert_args)
A class for users to create MoE modules in their models.
- Parameters
dim_model (int) – Hidden dimension of training model
num_experts (int) – The number experts
top_k (int, optional) – The number of experts for dispatchment of each token
capacity_factor_train (float, optional) – Capacity factor in routing during training
capacity_factor_eval (float, optional) – Capacity factor in routing during evaluation
min_capacity (int, optional) – The minimum number of the capacity of each expert
noisy_policy (str, optional) – The policy of noisy function. Now we have ‘Jitter’ and ‘Gaussian’. ‘Jitter’ can be found in Switch Transformer paper (https://arxiv.org/abs/2101.03961). ‘Gaussian’ can be found in ViT-MoE paper (https://arxiv.org/abs/2106.05974).
drop_tks (bool, optional) – Whether drops tokens in evaluation
use_residual (bool, optional) – Makes this MoE layer a Residual MoE. More information can be found in Microsoft paper (https://arxiv.org/abs/2201.05596).
residual_instance (nn.Module, optional) – The instance of residual module in Resiual MoE
expert_instance (MoeExperts, optional) – The instance of experts module in MoeLayer
expert_cls (Type[nn.Module], optional) – The class of each expert when no instance is given
expert_args (optional) – The args of expert when no instance is given