Search Results for author: Rachel S. Y. Teo

Found 6 papers, 6 papers with code

MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling

1 code implementation14 Mar 2025 Rachel S. Y. Teo, Tan M. Nguyen

We then propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts (SMoE) whose experts are layers in the pre-trained model.

Mixture-of-Experts parameter-efficient fine-tuning +1

CAMEx: Curvature-aware Merging of Experts

1 code implementation26 Feb 2025 Dung V. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Rachel S. Y. Teo, Tan M. Nguyen, Linh Duy Tran

In this paper, we introduce CAMEx (\textbf{C}urvature-\textbf{A}ware \textbf{M}erging of \textbf{Ex}perts), a novel expert merging protocol that incorporates natural gradients to account for the non-Euclidean curvature of the parameter manifold.

Tight Clusters Make Specialized Experts

1 code implementation21 Feb 2025 Stefan K. Nielsen, Rachel S. Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen

Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space.

Clustering Language Modeling +2

MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts

1 code implementation18 Oct 2024 Rachel S. Y. Teo, Tan M. Nguyen

Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled scalability in deep learning.

Language Modeling Language Modelling +2

Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

1 code implementation19 Jun 2024 Rachel S. Y. Teo, Tan M. Nguyen

In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space.

Image Segmentation Language Modeling +2

Elliptical Attention

1 code implementation19 Jun 2024 Stefan K. Nielsen, Laziz U. Abdullaev, Rachel S. Y. Teo, Tan M. Nguyen

Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision.

Image Segmentation Language Modeling +2

Cannot find the paper you are looking for? You can Submit a new open access paper.