4 code implementations • 5 Sep 2024 • Matteo Pagliardini, Pierre Ablin, David Grangier
This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients.
no code implementations • 4 Feb 2024 • Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi
The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding.
1 code implementation • 27 Nov 2023 • Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut
Large language models (LLMs) can potentially democratize access to medical knowledge.
Ranked #1 on Multiple Choice Question Answering (MCQA) on MedMCQA (Dev Set (Acc-%) metric)
no code implementations • 23 Oct 2023 • Simin Fan, Matteo Pagliardini, Martin Jaggi
Moreover, aiming to generalize to out-of-domain target tasks, which is unseen in the pretraining corpus (OOD domain), DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
no code implementations • 16 Oct 2023 • Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi
Scaling language models to larger and deeper sizes has led to significant boosts in performance.
2 code implementations • 1 Jun 2023 • Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret
While many works have proposed schemes to sparsify the attention patterns and reduce the computational overhead of self-attention, those are often limited by implementations concerns and end up imposing a simple and static structure over the attention matrix.
2 code implementations • 27 Oct 2022 • Tatjana Chavdarova, Tong Yang, Matteo Pagliardini, Michael I. Jordan
We prove the convergence of this method and show that the gap function of the last iterate of the method decreases at a rate of $O(\frac{1}{\sqrt{K}})$ when the operator is $L$-Lipschitz and monotone.
no code implementations • 11 Feb 2022 • Matteo Pagliardini, Gilberto Manunza, Martin Jaggi, Michael I. Jordan, Tatjana Chavdarova
We show that UDP is guaranteed to achieve the maximum margin decision boundary on linear models and that it notably increases it on challenging simulated datasets.
1 code implementation • 9 Feb 2022 • Matteo Pagliardini, Martin Jaggi, François Fleuret, Sai Praneeth Karimireddy
This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features -- present in the training data but absent from the test data -- and (ii) by only leveraging a small subset of predictive features.
1 code implementation • 9 Dec 2021 • Yehao Liu, Matteo Pagliardini, Tatjana Chavdarova, Sebastian U. Stich
Secondly, we show on a 2D toy example that both BNNs and MCDropout do not give high uncertainty estimates on OOD samples.
no code implementations • 29 Sep 2021 • Matteo Pagliardini, Gilberto Manunza, Martin Jaggi, Tatjana Chavdarova
The deep learning models' sensitivity to small input perturbations raises security concerns and limits their use for applications where reliability is critical.
1 code implementation • ICLR 2021 • Tatjana Chavdarova, Matteo Pagliardini, Sebastian U. Stich, Francois Fleuret, Martin Jaggi
Generative Adversarial Networks are notoriously challenging to train.
1 code implementation • NAACL 2019 • Prakhar Gupta, Matteo Pagliardini, Martin Jaggi
Pre-trained word vectors are ubiquitous in Natural Language Processing applications.
5 code implementations • NAACL 2018 • Matteo Pagliardini, Prakhar Gupta, Martin Jaggi
The recent tremendous success of unsupervised word embeddings in a multitude of applications raises the obvious question if similar methods could be derived to improve embeddings (i. e. semantic representations) of word sequences as well.