1 code implementation • 4 Feb 2024 • Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi
The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding.
no code implementations • 18 Dec 2023 • Amirkeivan Mohtashami, Florian Hartmann, Sian Gooding, Lukas Zilka, Matt Sharifi, Blaise Aguera y Arcas
We present and evaluate two approaches for knowledge transfer between LLMs.
1 code implementation • 27 Nov 2023 • Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut
Large language models (LLMs) can potentially democratize access to medical knowledge.
Ranked #1 on Multiple Choice Question Answering (MCQA) on MedMCQA (Dev Set (Acc-%) metric)
Conditional Text Generation Multiple Choice Question Answering (MCQA)
no code implementations • 16 Oct 2023 • Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi
The race to continually develop ever larger and deeper foundational models is underway.
2 code implementations • 25 May 2023 • Amirkeivan Mohtashami, Martin Jaggi
While Transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts.
no code implementations • 7 Feb 2023 • Amirkeivan Mohtashami, Mauro Verzetti, Paul K. Rubenstein
Learned metrics such as BLEURT have in recent years become widely employed to evaluate the quality of machine translation systems.
no code implementations • 30 May 2022 • Amirkeivan Mohtashami, Martin Jaggi, Sebastian Stich
However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance. We demonstrate the same effects also in the noise-less case, i. e. for full-batch GD.
no code implementations • 16 Jun 2021 • Amirkeivan Mohtashami, Martin Jaggi, Sebastian U. Stich
State-of-the-art training algorithms for deep learning models are based on stochastic gradient descent (SGD).
no code implementations • 3 Mar 2021 • Sebastian U. Stich, Amirkeivan Mohtashami, Martin Jaggi
It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and -- in asynchronous implementations -- on the gradient staleness.
1 code implementation • RC 2020 • Amirkeivan Mohtashami, Ehsan Pajouheshgar, Klim Kireev
We reproduce the results of the paper ”On Warm-Starting Neural Network Training.” In many real-world applications, the training data is not readily available and is accumulated over time.
no code implementations • 25 Sep 2019 • Amir Ali Moinfar, Amirkeivan Mohtashami, Mahdieh Soleymani, Ali Sharifi-Zarchi
Designing the architecture of deep neural networks (DNNs) requires human expertise and is a cumbersome task.