colossalai.nn.layer.moe

class colossalai.nn.layer.moe.Experts(expert, num_experts, **expert_args)

A wrapper class to create experts. It will create E experts across the moe model parallel group, where E is the number of experts. Every expert is a instence of the class, ‘expert’ in initialization parameters.

Parameters

expert – The class of all experts
num_experts (int) – The number of experts
expert_args – Args used to initialize experts

class colossalai.nn.layer.moe.Top1Router(capacity_factor_train=1.25, capacity_factor_eval=2.0, min_capacity=4, select_policy='first', noisy_func=None, drop_tks=True)

Top1 router that returns the dispatch mask [s, e, c] and combine weight [s, e, c] for routing usage. More deailted function can be found in the paper about Switch Transformer of Google.

Parameters

capacity_factor_train (float, optional) – Capacity factor in routing of training
capacity_factor_eval (float, optional) – Capacity factor in routing of evaluation
min_capacity (int, optional) – The minimum number of the capacity of each expert
select_policy (str, optional) – The policy about tokens selection
noisy_func (Callable, optional) – Noisy function used in logits
drop_tks (bool, optional) – Whether drops tokens in evaluation

class colossalai.nn.layer.moe.Top2Router(capacity_factor_train=1.25, capacity_factor_eval=2.0, min_capacity=4, noisy_func=None, drop_tks=True)

Top2 router that returns the dispatch mask [s, e, c] and combine weight [s, e, c] for routing usage. More deailted function can be found in the paper about ViT-MoE.

Parameters

capacity_factor_train (float, optional) – Capacity factor in routing of training
capacity_factor_eval (float, optional) – Capacity factor in routing of evaluation
min_capacity (int, optional) – The minimum number of the capacity of each expert
noisy_func (Callable, optional) – Noisy function used in logits
drop_tks (bool, optional) – Whether drops tokens in evaluation

class colossalai.nn.layer.moe.MoeLayer(dim_model, num_experts, router, experts)

A MoE layer, that puts its input tensor to its gate and uses the output logits to router all tokens, is mainly used to exchange all tokens for every expert across the moe tensor group by all to all comunication. Then it will get the output of all experts and exchange the output. At last returns the output of the moe system.

Parameters

dim_model (int) – Dimension of model
num_experts (int) – The number of experts
router (nn.Module) – Instance of router used in routing
experts (nn.Module) – Instance of experts generated by Expert

class colossalai.nn.layer.moe.NormalNoiseGenerator(num_experts)

Generates a random noisy mask for logtis tensor.

All noise is generated from a normal distribution (0, 1 / E^2), where E = the number of experts.

Parameters: num_experts (int) – The number of experts

colossalai.nn.layer.moe.layers