1 code implementation • 14 Mar 2025 • Rachel S. Y. Teo, Tan M. Nguyen
We then propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts (SMoE) whose experts are layers in the pre-trained model.
1 code implementation • 26 Feb 2025 • Dung V. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Rachel S. Y. Teo, Tan M. Nguyen, Linh Duy Tran
In this paper, we introduce CAMEx (\textbf{C}urvature-\textbf{A}ware \textbf{M}erging of \textbf{Ex}perts), a novel expert merging protocol that incorporates natural gradients to account for the non-Euclidean curvature of the parameter manifold.
1 code implementation • 21 Feb 2025 • Stefan K. Nielsen, Rachel S. Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen
Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space.
1 code implementation • 18 Oct 2024 • Rachel S. Y. Teo, Tan M. Nguyen
Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled scalability in deep learning.
1 code implementation • 19 Jun 2024 • Rachel S. Y. Teo, Tan M. Nguyen
In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space.
1 code implementation • 19 Jun 2024 • Stefan K. Nielsen, Laziz U. Abdullaev, Rachel S. Y. Teo, Tan M. Nguyen
Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision.