1 code implementation • Findings (EMNLP) 2021 • Seyed Ali Bahrainian, Martin Jaggi, Carsten Eickhoff
Topic models are useful tools for analyzing and interpreting the main underlying themes of large corpora of text.
no code implementations • 31 Oct 2024 • Atli Kosson, Bettina Messmer, Martin Jaggi
Warmup decreases the update size $\Delta \mathbf{w}_t = \eta_t \mathbf{u}_t$ early in training by using lower values for the learning rate $\eta_t$.
no code implementations • 25 Oct 2024 • El Mahdi Chayti, Nikita Doikov, Martin Jaggi
We propose using a special version of momentum to stabilize the stochastic gradient and Hessian estimates in Newton's method.
1 code implementation • 7 Oct 2024 • Xinyu Zhou, Simin Fan, Martin Jaggi
The family of hyperpower methods are well-known for their rigorous convergence guarantees on matrix inverse approximation, while the matrix multiplication operation can involve intractable memory and computation costs on large-scale models.
1 code implementation • 20 Sep 2024 • Dongyang Fan, Bettina Messmer, Martin Jaggi
Our approach distinguishes generalists and specialists by aggregating certain experts across end users while keeping others localized to specialize in user-specific datasets.
1 code implementation • 9 Sep 2024 • Diba Hashemi, Lie He, Martin Jaggi
Collaborative learning is an important tool to train multiple clients more effectively by enabling communication among clients.
no code implementations • 5 Sep 2024 • El Mahdi Chayti, Martin Jaggi
Learning new tasks by drawing on prior experience gathered from other (related) tasks is a core property of any intelligent system.
no code implementations • 7 Aug 2024 • Beatriz Borges, Negar Foroutan, Deniz Bayazit, Anna Sotnikova, Syrielle Montariol, Tanya Nazaretzky, Mohammadreza Banaei, Alireza Sakhaeirad, Philippe Servant, Seyed Parsa Neshaei, Jibril Frej, Angelika Romanou, Gail Weiss, Sepideh Mamooler, Zeming Chen, Simin Fan, Silin Gao, Mete Ismayilzada, Debjit Paul, Alexandre Schöpfer, Andrej Janchevski, Anja Tiede, Clarence Linden, Emanuele Troiani, Francesco Salvi, Freya Behrens, Giacomo Orsi, Giovanni Piccioli, Hadrien Sevel, Louis Coulon, Manuela Pineros-Rodriguez, Marin Bonnassies, Pierre Hellich, Puck van Gerwen, Sankalp Gambhir, Solal Pirelli, Thomas Blanchard, Timothée Callens, Toni Abi Aoun, Yannick Calvino Alonso, Yuri Cho, Alberto Chiappa, Antonio Sclocchi, Étienne Bruno, Florian Hofhammer, Gabriel Pescia, Geovani Rizk, Leello Dadi, Lucas Stoffl, Manoel Horta Ribeiro, Matthieu Bovel, Yueyang Pan, Aleksandra Radenovic, Alexandre Alahi, Alexander Mathis, Anne-Florence Bitbol, Boi Faltings, Cécile Hébert, Devis Tuia, François Maréchal, George Candea, Giuseppe Carleo, Jean-Cédric Chappelier, Nicolas Flammarion, Jean-Marie Fürbringer, Jean-Philippe Pellet, Karl Aberer, Lenka Zdeborová, Marcel Salathé, Martin Jaggi, Martin Rajman, Mathias Payer, Matthieu Wyart, Michael Gastpar, Michele Ceriotti, Ola Svensson, Olivier Lévêque, Paolo Ienne, Rachid Guerraoui, Robert West, Sanidhya Kashyap, Valerio Piazza, Viesturs Simanis, Viktor Kuncak, Volkan Cevher, Philippe Schwaller, Sacha Friedli, Patrick Jermann, Tanja Käser, Antoine Bosselut
We investigate the potential scale of this vulnerability by measuring the degree to which AI assistants can complete assessment questions in standard university-level STEM courses.
no code implementations • 31 May 2024 • Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, Amir Yazdanbakhsh
In addition, through rigorous analysis, we demonstrate that sparsity and quantization are not orthogonal; their interaction can significantly harm model accuracy, with quantization error playing a dominant role in this degradation.
no code implementations • 29 May 2024 • Simin Fan, Razvan Pascanu, Martin Jaggi
Grokking refers to a sharp rise of the network's generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set.
3 code implementations • 28 May 2024 • Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben allal, Leandro von Werra, Martin Jaggi
Scale has become a main ingredient in obtaining strong machine learning models.
1 code implementation • 2 May 2024 • Youssef Allouah, Anastasia Koloskova, Aymane El Firdoussi, Martin Jaggi, Rachid Guerraoui
Decentralized learning is appealing as it enables the scalable usage of large amounts of distributed data and resources (without resorting to any central entity), while promoting privacy since every user minimizes the direct exposure of their data.
1 code implementation • 15 Apr 2024 • Nicolas Wagner, Dongyang Fan, Martin Jaggi
We explore on-device self-supervised collaborative fine-tuning of large language models with limited local data availability.
3 code implementations • 30 Mar 2024 • Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman
We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits.
no code implementations • 20 Feb 2024 • Dongyang Fan, Bettina Messmer, Martin Jaggi
In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels.
1 code implementation • 6 Feb 2024 • Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, Michael Gastpar
Inspired by the Markovianity of natural languages, we model the data as a Markovian source and utilize this framework to systematically study the interplay between the data-distributional properties, the transformer architecture, the learnt distribution, and the final model performance.
1 code implementation • 5 Feb 2024 • Vinitra Swamy, Syrielle Montariol, Julian Blackwell, Jibril Frej, Martin Jaggi, Tanja Käser
Interpretability for neural networks is a trade-off between three key requirements: 1) faithfulness of the explanation (i. e., how perfectly it explains the prediction), 2) understandability of the explanation by humans, and 3) model performance.
no code implementations • 4 Feb 2024 • Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi
The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding.
1 code implementation • 27 Nov 2023 • Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut
Large language models (LLMs) can potentially democratize access to medical knowledge.
Ranked #1 on Multiple Choice Question Answering (MCQA) on MedMCQA (Dev Set (Acc-%) metric)
no code implementations • 12 Nov 2023 • Seyed Ali Bahrainian, Martin Jaggi, Carsten Eickhoff
We show that our model sets a new state of the art on the NEWTS dataset in terms of topic-focused abstractive summarization as well as a topic-prevalence score.
no code implementations • 23 Oct 2023 • Simin Fan, Matteo Pagliardini, Martin Jaggi
Moreover, aiming to generalize to out-of-domain target tasks, which is unseen in the pretraining corpus (OOD domain), DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
no code implementations • 23 Oct 2023 • Simin Fan, Martin Jaggi
Automatic data selection and curriculum design for training large language models is challenging, with only a few existing methods showing improvements over standard training.
no code implementations • 19 Oct 2023 • Ashok Vardhan Makkuva, Marco Bondaschi, Thijs Vogels, Martin Jaggi, Hyeji Kim, Michael C. Gastpar
On the latter, we obtain $50$-$64 \%$ improvement in perplexity over our baselines for noisy channels.
no code implementations • 16 Oct 2023 • Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi
Scaling language models to larger and deeper sizes has led to significant boosts in performance.
1 code implementation • 25 Sep 2023 • Vinitra Swamy, Malika Satayeva, Jibril Frej, Thierry Bossy, Thijs Vogels, Martin Jaggi, Tanja Käser, Mary-Anne Hartley
Predicting multiple real-world tasks in a single model often requires a particularly diverse feature space.
1 code implementation • 13 Jul 2023 • Linara Adilova, Maksym Andriushchenko, Michael Kamp, Asja Fischer, Martin Jaggi
Averaging neural network parameters is an intuitive method for fusing the knowledge of two independent models.
1 code implementation • 14 Jun 2023 • Mariel Werner, Lie He, Michael Jordan, Martin Jaggi, Sai Praneeth Karimireddy
Identifying clients with similar objectives and learning a model-per-cluster is an intuitive and interpretable approach to personalization in federated learning.
2 code implementations • 1 Jun 2023 • Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret
While many works have proposed schemes to sparsify the attention patterns and reduce the computational overhead of self-attention, those are often limited by implementations concerns and end up imposing a simple and static structure over the attention matrix.
no code implementations • 30 May 2023 • Anastasia Koloskova, Nikita Doikov, Sebastian U. Stich, Martin Jaggi
In machine learning and neural network optimization, algorithms like incremental gradient, and shuffle SGD are popular due to minimizing the number of cache misses and good practical convergence behavior.
1 code implementation • NeurIPS 2023 • Atli Kosson, Martin Jaggi
Finally, we show that we can eliminate all multiplications in the entire training process, including operations in the forward pass, backward pass and optimizer update, demonstrating the first successful training of modern neural network architectures in a fully multiplication-free fashion.
2 code implementations • 26 May 2023 • Atli Kosson, Bettina Messmer, Martin Jaggi
This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation.
1 code implementation • 26 May 2023 • Atli Kosson, Dongyang Fan, Martin Jaggi
Batch Normalization (BN) is widely used to stabilize the optimization process and improve the test performance of deep neural networks.
2 code implementations • 25 May 2023 • Amirkeivan Mohtashami, Martin Jaggi
While Transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts.
no code implementations • 24 Feb 2023 • Maria-Luiza Vladarean, Nikita Doikov, Martin Jaggi, Nicolas Flammarion
This paper studies first-order algorithms for solving fully composite optimization problems over convex and compact sets.
no code implementations • 23 Feb 2023 • El Mahdi Chayti, Nikita Doikov, Martin Jaggi
Our helper framework offers the algorithm designer high flexibility for constructing and analyzing the stochastic Cubic Newton methods, allowing arbitrary size batches, and the use of noisy and possibly biased estimates of the gradients and Hessians, incorporating both the variance reduction and the lazy Hessian updates.
1 code implementation • 5 Jan 2023 • Thijs Vogels, Hadrien Hendrikx, Martin Jaggi
This paper aims to paint an accurate picture of sparsely-connected distributed optimization.
no code implementations • 1 Dec 2022 • Nikita Doikov, El Mahdi Chayti, Martin Jaggi
This provably improves the total arithmetical complexity of second-order algorithms by a factor $\sqrt{d}$.
no code implementations • 20 Nov 2022 • Frédéric Berdoz, Abhishek Singh, Martin Jaggi, Ramesh Raskar
To do so, each client releases averaged last hidden layer activations of similar labels to a central server that only acts as a relay (i. e., is not involved in the training or aggregation of the models).
no code implementations • 19 Nov 2022 • Simla Burcu Harma, Ayan Chakraborty, Nicholas Sperry, Babak Falsafi, Martin Jaggi, Yunho Oh
Based on our findings, we propose Accuracy Booster, a mixed-mantissa HBFP technique that uses 4-bit mantissas for over 99% of all arithmetic operations in training and 6-bit mantissas only in the last epoch and first/last layers.
1 code implementation • 12 Nov 2022 • Cécile Trottet, Thijs Vogels, Martin Jaggi, Mary-Anne Hartley
Data-driven Clinical Decision Support Systems (CDSS) have the potential to improve and standardise care with personalised probabilistic guidance.
1 code implementation • 10 Oct 2022 • Jean Ogier du Terrail, Samy-Safwan Ayed, Edwige Cyffers, Felix Grimberg, Chaoyang He, Regis Loeb, Paul Mangold, Tanguy Marchand, Othmane Marfoq, Erum Mushtaq, Boris Muzellec, Constantin Philippenko, Santiago Silva, Maria Teleńczuk, Shadi Albarqouni, Salman Avestimehr, Aurélien Bellet, Aymeric Dieuleveut, Martin Jaggi, Sai Praneeth Karimireddy, Marco Lorenzi, Giovanni Neglia, Marc Tommasi, Mathieu Andreux
In this work, we propose a novel cross-silo dataset suite focused on healthcare, FLamby (Federated Learning AMple Benchmark of Your cross-silo strategies), to bridge the gap between theory and practice of cross-silo FL.
no code implementations • 16 Jun 2022 • Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi
In this work (i) we obtain a tighter convergence rate of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \sqrt{\tau_{\max}\tau_{avg}}\epsilon^{-1}\right)$ without any change in the algorithm where $\tau_{avg}$ is the average delay, which can be significantly smaller than $\tau_{\max}$.
1 code implementation • 7 Jun 2022 • Thijs Vogels, Hadrien Hendrikx, Martin Jaggi
In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster.
no code implementations • 30 May 2022 • Amirkeivan Mohtashami, Martin Jaggi, Sebastian Stich
However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance. We demonstrate the same effects also in the noise-less case, i. e. for full-batch GD.
no code implementations • NAACL 2022 • Fedor Moiseev, Zhe Dong, Enrique Alfonseca, Martin Jaggi
The models pre-trained on factual triples compare competitively with the ones on natural language sentences that contain the same knowledge.
no code implementations • 13 Apr 2022 • Yatin Dandi, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich
Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs.
no code implementations • 11 Feb 2022 • Matteo Pagliardini, Gilberto Manunza, Martin Jaggi, Michael I. Jordan, Tatjana Chavdarova
We show that UDP is guaranteed to achieve the maximum margin decision boundary on linear models and that it notably increases it on challenging simulated datasets.
1 code implementation • 9 Feb 2022 • Matteo Pagliardini, Martin Jaggi, François Fleuret, Sai Praneeth Karimireddy
This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features -- present in the training data but absent from the test data -- and (ii) by only leveraging a small subset of predictive features.
1 code implementation • 3 Feb 2022 • Lie He, Sai Praneeth Karimireddy, Martin Jaggi
In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs.
no code implementations • NeurIPS 2021 • Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian U. Stich, Ananda Theertha Suresh
Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon.
1 code implementation • 16 Nov 2021 • Vinitra Swamy, Angelika Romanou, Martin Jaggi
In this paper, we compare BERT-based language models through snapshots of acquired knowledge at sequential stages of the training process.
1 code implementation • 10 Nov 2021 • El Mahdi Chayti, Sai Praneeth Karimireddy, Sebastian U. Stich, Nicolas Flammarion, Martin Jaggi
Collaborative training can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user).
no code implementations • 25 Oct 2021 • Felix Grimberg, Mary-Anne Hartley, Sai P. Karimireddy, Martin Jaggi
In federated learning, differences in the data or objectives between the participating nodes motivate approaches to train a personalized machine learning model for each node.
no code implementations • 13 Oct 2021 • Martin Beaussart, Felix Grimberg, Mary-Anne Hartley, Martin Jaggi
Through a series of experiments, we compare our new approach to two recent personalized federated learning methods--Weight Erosion and APFL--as well as two general FL methods--Federated Averaging and SCAFFOLD.
1 code implementation • NeurIPS 2021 • Thijs Vogels, Lie He, Anastasia Koloskova, Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi
A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions.
no code implementations • 29 Sep 2021 • Matteo Pagliardini, Gilberto Manunza, Martin Jaggi, Tatjana Chavdarova
The deep learning models' sensitivity to small input perturbations raises security concerns and limits their use for applications where reliability is critical.
no code implementations • 6 Sep 2021 • Sebastian Bischoff, Stephan Günnemann, Martin Jaggi, Sebastian U. Stich
We consider federated learning (FL), where the training data is distributed across a large number of clients.
1 code implementation • ICCV 2021 • Oguz Kaan Yuksel, Sebastian U. Stich, Martin Jaggi, Tatjana Chavdarova
We find that our latent adversarial perturbations adaptive to the classifier throughout its training are most effective, yielding the first test accuracy improvement results on real-world datasets -- CIFAR-10/100 -- via latent-space perturbations.
1 code implementation • 14 Jul 2021 • David Roschewitz, Mary-Anne Hartley, Luca Corinzia, Martin Jaggi
Thus, enabling the detection of outlier datasets in the federation and also learning the compensation for local data distribution shifts without sharing any original data.
2 code implementations • 14 Jul 2021 • Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Aguera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horvath, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz, Satyen Kale, Sai Praneeth Karimireddy, Jakub Konecny, Sanmi Koyejo, Tian Li, Luyang Liu, Mehryar Mohri, Hang Qi, Sashank J. Reddi, Peter Richtarik, Karan Singhal, Virginia Smith, Mahdi Soltanolkotabi, Weikang Song, Ananda Theertha Suresh, Sebastian U. Stich, Ameet Talwalkar, Hongyi Wang, Blake Woodworth, Shanshan Wu, Felix X. Yu, Honglin Yuan, Manzil Zaheer, Mi Zhang, Tong Zhang, Chunxiang Zheng, Chen Zhu, Wennan Zhu
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection.
no code implementations • 25 Jun 2021 • Yatin Dandi, Luis Barba, Martin Jaggi
A major obstacle to achieving global convergence in distributed and federated learning is the misalignment of gradients across clients, or mini-batches due to heterogeneity and stochasticity of the distributed data.
no code implementations • 16 Jun 2021 • Amirkeivan Mohtashami, Martin Jaggi, Sebastian U. Stich
State-of-the-art training algorithms for deep learning models are based on stochastic gradient descent (SGD).
1 code implementation • ACL 2021 • Prakhar Gupta, Martin Jaggi
The advent of contextual word embeddings -- representations of words which incorporate semantic and syntactic information from their context -- has led to tremendous improvements on a wide variety of NLP tasks.
1 code implementation • ACL 2021 • Zhuoyuan Mao, Prakhar Gupta, Pei Wang, Chenhui Chu, Martin Jaggi, Sadao Kurohashi
Large-scale models for learning fixed-dimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks.
1 code implementation • 15 Apr 2021 • Valerian Rey, Pedro Miguel Sánchez Sánchez, Alberto Huertas Celdrán, Gérôme Bovet, Martin Jaggi
In this context, a framework that uses federated learning to detect malware affecting IoT devices is presented.
no code implementations • 3 Mar 2021 • Sebastian U. Stich, Amirkeivan Mohtashami, Martin Jaggi
It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and -- in asynchronous implementations -- on the gradient staleness.
1 code implementation • 9 Feb 2021 • Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi
In this paper, we investigate and identify the limitation of several decentralized optimization algorithms for different degrees of data heterogeneity.
no code implementations • 9 Feb 2021 • Lingjing Kong, Tao Lin, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich
Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters.
1 code implementation • 5 Feb 2021 • Giovanni Cherubin, Konstantinos Chatzikokolakis, Martin Jaggi
We evaluate our findings empirically, and discuss when methods are suitable for CP optimization.
no code implementations • 1 Jan 2021 • Eliza Wszola, Martin Jaggi, Markus Püschel
Word embeddings have gained increasing popularity in the recent years due to the Word2vec library and its extension fastText that uses subword information.
no code implementations • 1 Jan 2021 • Tao Lin, Lingjing Kong, Anastasia Koloskova, Martin Jaggi, Sebastian U Stich
Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters.
1 code implementation • 18 Dec 2020 • Sai Praneeth Karimireddy, Lie He, Martin Jaggi
Secondly, we prove that even if the aggregation rules may succeed in limiting the influence of the attackers in a single round, the attackers can couple their attacks across time eventually leading to divergence.
1 code implementation • NeurIPS 2020 • Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi
Lossy gradient compression has become a practical tool to overcome the communication bottleneck in centrally coordinated distributed training of machine learning models.
no code implementations • 3 Nov 2020 • Dmitry Kovalev, Anastasia Koloskova, Martin Jaggi, Peter Richtarik, Sebastian U. Stich
Decentralized optimization methods enable on-device training of machine learning models without a central coordinator.
no code implementations • 28 Sep 2020 • Lie He, Sai Praneeth Karimireddy, Martin Jaggi
In Byzantine-robust distributed optimization, a central server wants to train a machine learning model over data distributed across multiple workers.
no code implementations • 19 Sep 2020 • Negar Foroutan Eghlidi, Martin Jaggi
Although distributed training reduces the computation time, the communication overhead associated with the gradient exchange forms a scalability bottleneck for the algorithm.
1 code implementation • 8 Aug 2020 • Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh
Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon.
1 code implementation • 4 Aug 2020 • Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi
Lossy gradient compression has become a practical tool to overcome the communication bottleneck in centrally coordinated distributed training of machine learning models.
2 code implementations • 29 Jun 2020 • Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi
We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer.
1 code implementation • ICLR 2021 • Tatjana Chavdarova, Matteo Pagliardini, Sebastian U. Stich, Francois Fleuret, Martin Jaggi
Generative Adversarial Networks are notoriously challenging to train.
1 code implementation • ICLR 2022 • Sai Praneeth Karimireddy, Lie He, Martin Jaggi
In Byzantine robust distributed or federated learning, a central server wants to train a machine learning model over data distributed across multiple workers.
1 code implementation • NeurIPS 2020 • Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi
In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side.
no code implementations • ICLR 2020 • Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, Martin Jaggi
Deep neural networks often have millions of parameters.
no code implementations • ICML 2020 • Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi
Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data.
no code implementations • 8 Jun 2020 • Lie He, Sai Praneeth Karimireddy, Martin Jaggi
Increasingly machine learning systems are being deployed to edge servers and devices (e. g. mobile phones) and trained in a collaborative manner.
no code implementations • EMNLP 2020 • Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, Hinrich Schütze
We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning.
no code implementations • ICLR 2021 • Namhoon Lee, Thalaiyasingam Ajanthan, Philip H. S. Torr, Martin Jaggi
As a result, we find across various workloads of data set, network model, and optimization algorithm that there exists a general scaling trend between batch size and number of training steps to convergence for the effect of data parallelism, and further, difficulty of training under sparsity.
no code implementations • ICML 2020 • Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, Sebastian U. Stich
Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency.
2 code implementations • 28 Dec 2019 • Ali Sabet, Prakhar Gupta, Jean-Baptiste Cordonnier, Robert West, Martin Jaggi
Recent advances in cross-lingual word embeddings have primarily relied on mapping-based methods, which project pretrained word embeddings from different languages into a shared space through a linear transformation.
Cross-Lingual Document Classification Cross-Lingual Word Embeddings +8
9 code implementations • 10 Dec 2019 • Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D'Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, Sen Zhao
FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches.
1 code implementation • ICLR 2020 • Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi
This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice.
Ranked #166 on Image Classification on CIFAR-10
no code implementations • ICML 2020 • Prabhu Teja Sivaprasad, Florian Mai, Thijs Vogels, Martin Jaggi, François Fleuret
The performance of optimizers, particularly in deep learning, depends considerably on their chosen hyperparameter configuration.
2 code implementations • NeurIPS 2020 • Sidak Pal Singh, Martin Jaggi
Finally, our approach also provides a principled way to combine the parameters of neural networks with different widths, and we explore its application for model compression.
no code implementations • 25 Sep 2019 • Prabhu Teja S*, Florian Mai*, Thijs Vogels, Martin Jaggi, Francois Fleuret
There is no consensus yet on the question whether adaptive gradient methods like Adam are easier to use than non-adaptive optimization methods like SGD.
1 code implementation • ICLR 2020 • Anastasia Koloskova, Tao Lin, Sebastian U. Stich, Martin Jaggi
Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters.
1 code implementation • WS 2019 • Arno Schneuwly, Ralf Grubenmann, Séverine Rion Logean, Mark Cieliebak, Martin Jaggi
We study how language on social media is linked to diseases such as atherosclerotic heart disease (AHD), diabetes and various types of cancer.
1 code implementation • NeurIPS 2019 • Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi
We study gradient compression methods to alleviate the communication bottleneck in data-parallel distributed optimization.
1 code implementation • 2 May 2019 • Eliza Wszola, Celestine Mendler-Dünner, Martin Jaggi, Markus Püschel
A new generation of manycore processors is on the rise that offers dozens and more cores on a chip and, in a sense, fuses host processor and accelerator.
1 code implementation • NAACL 2019 • Prakhar Gupta, Matteo Pagliardini, Martin Jaggi
Pre-trained word vectors are ubiquitous in Natural Language Processing applications.
1 code implementation • 8 Apr 2019 • Martin Josifoski, Ivan S. Paskov, Hristo S. Paskov, Martin Jaggi, Robert West
Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.
no code implementations • 29 Mar 2019 • Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael. I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek Murray, Kunle Olukotun, Dimitris Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Siddhartha Sen, Virginia Smith, Alex Smola, Dawn Song, Evan Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar
Machine learning (ML) techniques are enjoying rapidly increasing adoption.
no code implementations • 26 Feb 2019 • Khalil Mrini, Claudiu Musat, Michael Baeriswyl, Martin Jaggi
We show our model's interpretability by visualizing how our model distributes attention inside a document.
no code implementations • 25 Feb 2019 • Matthias Hüser, Adrian Kündig, Walter Karlen, Valeria De Luca, Martin Jaggi
Approach: We developed a prediction framework that forecasts onsets of acute intracranial hypertension in the next 8 hours.
no code implementations • ICLR 2019 • Yassine Benyahia, Kaicheng Yu, Kamil Bennani-Smires, Martin Jaggi, Anthony Davison, Mathieu Salzmann, Claudiu Musat
We identify a phenomenon, which we refer to as multi-model forgetting, that occurs when sequentially training multiple deep networks with partially-shared parameters; the performance of previously-trained models degrades as one optimizes a subsequent one, due to the overwriting of shared parameters.
1 code implementation • ICLR 2020 • Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, Mathieu Salzmann
Neural Architecture Search (NAS) aims to facilitate the design of deep networks for new tasks.
3 code implementations • 1 Feb 2019 • Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi
We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix.
1 code implementation • NeurIPS 2019 • Jean-Yves Franceschi, Aymeric Dieuleveut, Martin Jaggi
Time series constitute a challenging data type for machine learning algorithms, due to their highly variable lengths and sparse labeling in practice.
1 code implementation • 28 Jan 2019 • Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, Martin Jaggi
These issues arise because of the biased nature of the sign compression operator.
no code implementations • 16 Oct 2018 • Sai Praneeth Karimireddy, Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi
For these problems we provide (i) the first linear rates of convergence independent of $n$, and show that our greedy update rule provides speedups similar to those obtained in the smooth case.
1 code implementation • NeurIPS 2018 • Sebastian U. Stich, Jean-Baptiste Cordonnier, Martin Jaggi
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i. e. algorithms that leverage the compute power of many devices for training.
2 code implementations • 29 Aug 2018 • Sidak Pal Singh, Andreas Hug, Aymeric Dieuleveut, Martin Jaggi
We present a framework for building unsupervised representations of entities and their compositions, where each entity is viewed as a probability distribution rather than a vector embedding.
2 code implementations • ICLR 2020 • Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks.
1 code implementation • NeurIPS 2018 • Lie He, An Bian, Martin Jaggi
Decentralized machine learning is a promising emerging paradigm in view of global challenges of data ownership and privacy.
no code implementations • ICML 2018 • Celestine Dünner, Aurelien Lucchi, Matilde Gargiani, An Bian, Thomas Hofmann, Martin Jaggi
Due to the rapid growth of data and computational resources, distributed optimization has become an active research area in recent years.
no code implementations • 5 Jun 2018 • Sidak Pal Singh, Andreas Hug, Aymeric Dieuleveut, Martin Jaggi
We propose a unified framework for building unsupervised representations of individual objects or entities (and their compositions), by associating with each object both a distributional as well as a point estimate (vector embedding).
no code implementations • 1 Jun 2018 • Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi
We show that Newton's method converges globally at a linear rate for objective functions whose Hessians are stable.
no code implementations • NeurIPS 2018 • Mario Drumond, Tao Lin, Martin Jaggi, Babak Falsafi
We identify block floating point (BFP) as a promising alternative representation since it exhibits wide dynamic range and enables the majority of DNN operations to be performed with fixed-point logic.
no code implementations • ICML 2018 • Francesco Locatello, Anant Raj, Sai Praneeth Karimireddy, Gunnar Rätsch, Bernhard Schölkopf, Sebastian U. Stich, Martin Jaggi
Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affine invariant sublinear $\mathcal{O}(1/t)$ rates on smooth objectives and linear convergence on strongly convex objectives.
3 code implementations • CONLL 2018 • Kamil Bennani-Smires, Claudiu Musat, Andreea Hossmann, Michael Baeriswyl, Martin Jaggi
EmbedRank achieves higher F-scores than graph-based state of the art systems on standard datasets and is suitable for real-time processing of large amounts of Web data.
1 code implementation • 14 Nov 2017 • Chenxin Ma, Martin Jaggi, Frank E. Curtis, Nathan Srebro, Martin Takáč
In this paper, an accelerated variant of CoCoA+ is proposed and shown to possess a convergence rate of $\mathcal{O}(1/t^2)$ in terms of reducing suboptimality.
no code implementations • NeurIPS 2017 • Sebastian U. Stich, Anant Raj, Martin Jaggi
Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications.
1 code implementation • NeurIPS 2017 • Celestine Dünner, Thomas Parnell, Martin Jaggi
We propose a generic algorithmic building block to accelerate training of machine learning models on heterogeneous compute systems.
2 code implementations • 21 Jul 2017 • Pascal Kaiser, Jan Dirk Wegner, Aurelien Lucchi, Martin Jaggi, Thomas Hofmann, Konrad Schindler
We adapt a state-of-the-art CNN architecture for semantic segmentation of buildings and roads in aerial images, and compare its performance when using different training data sets, ranging from manually labeled, pixel-accurate ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations.
no code implementations • 11 Jul 2017 • Mikhail A. Langovoy, Akhilesh Gotmare, Martin Jaggi
We consider learning of fundamental properties of communities in large noisy networks, in the prototypical situation where the nodes or users are split into two classes according to a binary property, e. g., according to their opinions or preferences on a topic.
no code implementations • ICML 2017 • Sebastian U. Stich, Anant Raj, Martin Jaggi
We propose a new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization.
no code implementations • NeurIPS 2017 • Francesco Locatello, Michael Tschannen, Gunnar Rätsch, Martin Jaggi
Greedy optimization methods such as Matching Pursuit (MP) and Frank-Wolfe (FW) algorithms regained popularity in recent years due to their simplicity, effectiveness and theoretical guarantees.
1 code implementation • ACL 2017 • Tina Fang, Martin Jaggi, Katerina Argyraki
Motivated by concerns for user privacy, we design a steganographic system ("stegosystem") that enables two users to exchange encrypted messages without an adversary detecting that such an exchange is taking place.
5 code implementations • NAACL 2018 • Matteo Pagliardini, Prakhar Gupta, Martin Jaggi
The recent tremendous success of unsupervised word embeddings in a multitude of applications raises the obvious question if similar methods could be derived to improve embeddings (i. e. semantic representations) of word sequences as well.
no code implementations • 7 Mar 2017 • Dmytro Perekrestenko, Volkan Cevher, Martin Jaggi
Coordinate descent methods employ random partial updates of decision variables in order to solve huge-scale convex optimization problems.
1 code implementation • 7 Mar 2017 • Jan Deriu, Aurelien Lucchi, Valeria De Luca, Aliaksei Severyn, Simon Müller, Mark Cieliebak, Thomas Hofmann, Martin Jaggi
This paper presents a novel approach for multi-lingual sentiment classification in short texts.
no code implementations • 21 Feb 2017 • Francesco Locatello, Rajiv Khanna, Michael Tschannen, Martin Jaggi
Two of the most fundamental prototypes of greedy optimization are the matching pursuit and Frank-Wolfe algorithms.
2 code implementations • 7 Nov 2016 • Virginia Smith, Simone Forte, Chenxin Ma, Martin Takac, Michael. I. Jordan, Martin Jaggi
The scale of modern datasets necessitates the development of efficient distributed optimization methods for machine learning.
no code implementations • 23 Sep 2016 • Anant Raj, Jakob Olbrich, Bernd Gärtner, Bernhard Schölkopf, Martin Jaggi
We propose a new framework for deriving screening rules for convex optimization problems.
no code implementations • 16 Feb 2016 • Celestine Dünner, Simone Forte, Martin Takáč, Martin Jaggi
We propose an algorithm-independent framework to equip existing optimization methods with primal-dual certificates.
no code implementations • 12 Feb 2016 • Rajiv Khanna, Michael Tschannen, Martin Jaggi
Efficiently representing real world data in a succinct and parsimonious manner is of central importance in many fields.
2 code implementations • 13 Dec 2015 • Virginia Smith, Simone Forte, Michael. I. Jordan, Martin Jaggi
Despite the importance of sparsity in many large-scale applications, there are few methods for distributed optimization of sparsity-inducing objectives.
1 code implementation • 13 Dec 2015 • Chenxin Ma, Jakub Konečný, Martin Jaggi, Virginia Smith, Michael. I. Jordan, Peter Richtárik, Martin Takáč
To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally, and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods.
1 code implementation • NeurIPS 2015 • Simon Lacoste-Julien, Martin Jaggi
In this paper, we highlight and clarify several variants of the Frank-Wolfe optimization algorithm that have been successfully applied in practice: away-steps FW, pairwise FW, fully-corrective FW and Wolfe's minimum norm point algorithm, and prove for the first time that they all enjoy global linear convergence, under a weaker condition than strong convexity of the objective.
1 code implementation • 12 Feb 2015 • Chenxin Ma, Virginia Smith, Martin Jaggi, Michael. I. Jordan, Peter Richtárik, Martin Takáč
Distributed optimization methods for large-scale machine learning suffer from a communication bottleneck.
no code implementations • NeurIPS 2014 • Martin Jaggi, Virginia Smith, Martin Takáč, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, Michael. I. Jordan
Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning.
no code implementations • 5 Mar 2013 • Martin Jaggi
As a consequence, many existing optimization algorithms for both SVMs and Lasso can also be applied to the respective other problem instances.