Mixture-of-Experts
464 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in Mixture-of-Experts
Libraries
Use these libraries to find Mixture-of-Experts models and implementationsMost implemented papers
Distilling the Knowledge in a Neural Network
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions.
Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts
In this work, we propose a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data.
Gated Multimodal Units for Information Fusion
The Gated Multimodal Unit (GMU) model is intended to be used as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities.
No Language Left Behind: Scaling Human-Centered Machine Translation
Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.
Qwen2 Technical Report
This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models.
Qwen2.5 Technical Report
In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2. 5-Turbo and Qwen2. 5-Plus, both available from Alibaba Cloud Model Studio.
Mixtral of Experts
In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters.
Robust Federated Learning by Mixture of Experts
We present a novel weighted average model based on the mixture of experts (MoE) concept to provide robustness in Federated learning (FL) against the poisoned/corrupted/outdated local models.