DSelect-k

Introduced by Hazimeh et al. in DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

DSelect-k is a continuously differentiable and sparse gate for Mixture-of-experts (MoE), based on a novel binary encoding formulation. Given a user-specified parameter $k$, the gate selects at most $k$ out of the $n$ experts. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. This explicit control over sparsity leads to a cardinality-constrained optimization problem, which is computationally challenging. To circumvent this challenge, the authors use a unconstrained reformulation that is equivalent to the original problem. The reformulated problem uses a binary encoding scheme to implicitly enforce the cardinality constraint. By carefully smoothing the binary encoding variables, the reformulated problem can be effectively optimized using first-order methods such as SGD.

The motivation for this method is that existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods.

Source: DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Recommendation Systems	2	50.00%
Language Modelling	1	25.00%
Multi-Task Learning	1	25.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Mixture-of-Experts