Math for Megatron Mixture-of-Experts (MoE)
notations:
- \(s\) - sequence length
- \(b\) - micro-batch size
- \(h\) - hidden dimension size
- \(L\) - number of transformer layers
- \(P\) - number of parameters
- \(p_{etp}\) - degree of expert tensor parallelism
- \(p_{tp}\) - degree of tensor parallelism
- \(p_{dp}\) - degree of data parallelism
- \(p_{pp}\) - degree of pipeline parallelism
- \(p_{ep}\) - degree of expert parallelism
- \(e_{local\_n}\) - num of local experts
- \(e_i\) - interval of experts in transformer layers
- \(e_{top-k}\) - the top-k number configured in the MoE algorithm
Memory Estimation
The following memory consumption is based on the Megatron-LM GPT Model with experts and distributed optimizer.
The activation memory consumption of the model is:
\[M_{full\_activation} = sbhL * (13 + 19\frac{e_{local\_n} + e_i - 1}{e_i} ) + 2sbh + 4sbv\]With tensor parallelism, sequence parallelism and expert tensor parallelism, the activation memory consumption of the model is:
\[M_{activation} = \frac{M_{full\_activation}}{p_{tp}}\]With tensor parallelism, sequence parallelism and expert tensor parallelism, the static memory consumption of the model is:
\[\begin{aligned} M_{static} &= M_{grad} + M_{model\_state} + M_{optimizer\_{state}} \\ &= \frac{P_{dense} + P_{MoE}}{p_{tp}} * 4 + \frac{P_{dense} + p_{ep}P_{MoE}}{p_{tp}p_{dp}} * 16 \\ &= \frac{P_{dense}}{p_{tp}p_{dp}}(4p_{dp}+16) + \frac{P_{MoE}}{p_{tp}p_{dp}}(4p_{dp}+16p_{ep}) \end{aligned}\] \[\begin{aligned} P_{dense} &= M_{embedding} + (M_{attn} + M_{mlp} * \frac{e_i - 1}{e_i}) * L \\ &= hv + (4h^2 + 3hh_{ff}\frac{e_i - 1}{e_i})L \\ &= 12h^2L(\frac{v}{12hL} + \frac{1}{3} + \frac{e_i - 1}{e_i}) \end{aligned}\] \[\begin{aligned} P_{MoE} &= M_{mlp} * \frac{e_{local\_n}}{e_i} * L \\ &= 2hh_{ff} * \frac{e_{local\_n}}{e_i} * L \\ &= 8h^2L * \frac{e_{local\_n}}{e_i} \end{aligned}\] \[\begin{aligned} P &= M_{embedding} + (M_{attn} + M_{mlp}) * L \\ &= hv + (4h^2 + 3hh_{ff})L \\ &= hv + 16h^2L \end{aligned}\]The total memory consumption of the model is:
\[M_{total} = M_{activation} + M_{static}\]FLOPs Calculation
model FLOPs per iteration:
\[48sbh^2L(\frac{e_{top-k} + e_i - 1}{e_i} + \frac{1}{2} + \frac{s}{4h} + \frac{v}{8hL})\]For the explanation of this formula, the calculation Flops of each transformer layer in GPT model is \(72sbh^2\), \(48sbh^2\) for MLP and \(24sbh^2 + 12s^2bh\) for attention.
For the MoE model, the calculation in the MLP part changes, where the computing Flops of MoE layer become \(e_{top-k}\) times, and the formula becomes
\[\begin{aligned} C &= C_{MoE} \frac{L}{e_i} * e_{top-k} + C_{dense\_mlp} \frac{L(e_i - 1)}{e_i} + C_{attention}L + C_{embedding} \\ &= 48sbh^2L \frac{e_{top-k}}{e_i} + 48sbh^2L \frac{e_i-1}{e_i} + 24sbh^2L + 12s^2bhL + 6sbhv \\ &= 48sbh^2L(\frac{e_{top-k} + e_i - 1}{e_i} + \frac{1}{2} + \frac{s}{4h} + \frac{v}{8hL}) \\ \end{aligned}\]