TL,DR: Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we propose Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintai...
From Sparse to Soft Mixtures of Experts http://t.cn/A60WVfhz ChatPaper综述: 本文提出了一种名为Soft MoE的方法来解决这些问题,同时保持MoEs的优点。Soft MoE通过向每个专家传递所有输入令牌的不同加权组合...
Efficient large scale language modeling with mixtures of experts, 2021. Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models, 2016. Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie ...
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive Mixtures of Local Experts. Neural Computation 3(1), 79–87 (1991) Google Scholar Collobert, R., Bengio, Y., Bengio, S.: Scaling Large Learning Problems with Hard Parallel Mixtures. In: IEEE Int. Conf. of Patt...
Experts obtain infinite noise...C. Stachniss, C. Plagemann, and A. J. Lilienthal, "Learning gas distribution models using sparse Gaussian process mixtures," Auton. Robots, vol. 26, nos. 2-3, pp. 187-202, 2009.Stachniss, C.; Plagemann, C.; Lilienthal, A.J. Learning gas distribution...