Search Results for author: Martin Jaggi

Found 131 papers, 68 papers with code

Self-Supervised Neural Topic Modeling

1 code implementation Findings (EMNLP) 2021 Seyed Ali Bahrainian, Martin Jaggi, Carsten Eickhoff

Topic models are useful tools for analyzing and interpreting the main underlying themes of large corpora of text.

Clustering Topic Models

Towards an empirical understanding of MoE design choices

no code implementations20 Feb 2024 Dongyang Fan, Bettina Messmer, Martin Jaggi

In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels.

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

1 code implementation6 Feb 2024 Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, Michael Gastpar

Inspired by the Markovianity of natural languages, we model the data as a Markovian source and utilize this framework to systematically study the interplay between the data-distributional properties, the transformer architecture, the learnt distribution, and the final model performance.

InterpretCC: Conditional Computation for Inherently Interpretable Neural Networks

1 code implementation5 Feb 2024 Vinitra Swamy, Julian Blackwell, Jibril Frej, Martin Jaggi, Tanja Käser

Real-world interpretability for neural networks is a tradeoff between three concerns: 1) it requires humans to trust the explanation approximation (e. g. post-hoc approaches), 2) it compromises the understandability of the explanation (e. g. automatically identified feature masks), and 3) it compromises the model performance (e. g. decision trees).

News Classification

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

1 code implementation4 Feb 2024 Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi

The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding.

Controllable Topic-Focused Abstractive Summarization

no code implementations12 Nov 2023 Seyed Ali Bahrainian, Martin Jaggi, Carsten Eickhoff

We show that our model sets a new state of the art on the NEWTS dataset in terms of topic-focused abstractive summarization as well as a topic-prevalence score.

Abstractive Text Summarization

Irreducible Curriculum for Language Model Pretraining

no code implementations23 Oct 2023 Simin Fan, Martin Jaggi

Automatic data selection and curriculum design for training large language models is challenging, with only a few existing methods showing improvements over standard training.

Language Modelling

DoGE: Domain Reweighting with Generalization Estimation

no code implementations23 Oct 2023 Simin Fan, Matteo Pagliardini, Martin Jaggi

Moreover, aiming to generalize to out-of-domain target tasks, which is unseen in the pretraining corpus (OOD domain), DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.

Domain Generalization Language Modelling

CoTFormer: More Tokens With Attention Make Up For Less Depth

no code implementations16 Oct 2023 Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi

The race to continually develop ever larger and deeper foundational models is underway.

MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks

1 code implementation25 Sep 2023 Vinitra Swamy, Malika Satayeva, Jibril Frej, Thierry Bossy, Thijs Vogels, Martin Jaggi, Tanja Käser, Mary-Anne Hartley

Predicting multiple real-world tasks in a single model often requires a particularly diverse feature space.

Layer-wise Linear Mode Connectivity

1 code implementation13 Jul 2023 Linara Adilova, Maksym Andriushchenko, Michael Kamp, Asja Fischer, Martin Jaggi

Averaging neural network parameters is an intuitive method for fusing the knowledge of two independent models.

Federated Learning Linear Mode Connectivity

Provably Personalized and Robust Federated Learning

1 code implementation14 Jun 2023 Mariel Werner, Lie He, Michael Jordan, Martin Jaggi, Sai Praneeth Karimireddy

Identifying clients with similar objectives and learning a model-per-cluster is an intuitive and interpretable approach to personalization in federated learning.

Clustering Personalized Federated Learning +1

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

1 code implementation1 Jun 2023 Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret

While many works have proposed schemes to sparsify the attention patterns and reduce the computational overhead of self-attention, those are often limited by implementations concerns and end up imposing a simple and static structure over the attention matrix.

Language Modelling

On Convergence of Incremental Gradient for Non-Convex Smooth Functions

no code implementations30 May 2023 Anastasia Koloskova, Nikita Doikov, Sebastian U. Stich, Martin Jaggi

In machine learning and neural network optimization, algorithms like incremental gradient, and shuffle SGD are popular due to minimizing the number of cache misses and good practical convergence behavior.

Multiplication-Free Transformer Training via Piecewise Affine Operations

1 code implementation NeurIPS 2023 Atli Kosson, Martin Jaggi

Finally, we show that we can eliminate all multiplications in the entire training process, including operations in the forward pass, backward pass and optimizer update, demonstrating the first successful training of modern neural network architectures in a fully multiplication-free fashion.

Ghost Noise for Regularizing Deep Neural Networks

1 code implementation26 May 2023 Atli Kosson, Dongyang Fan, Martin Jaggi

Batch Normalization (BN) is widely used to stabilize the optimization process and improve the test performance of deep neural networks.

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

2 code implementations26 May 2023 Atli Kosson, Bettina Messmer, Martin Jaggi

This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation.

L2 Regularization

Landmark Attention: Random-Access Infinite Context Length for Transformers

2 code implementations25 May 2023 Amirkeivan Mohtashami, Martin Jaggi

While Transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts.

Retrieval

Linearization Algorithms for Fully Composite Optimization

no code implementations24 Feb 2023 Maria-Luiza Vladarean, Nikita Doikov, Martin Jaggi, Nicolas Flammarion

This paper studies first-order algorithms for solving fully composite optimization problems over convex and compact sets.

Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods

no code implementations23 Feb 2023 El Mahdi Chayti, Nikita Doikov, Martin Jaggi

Our helper framework offers the algorithm designer high flexibility for constructing and analyzing the stochastic Cubic Newton methods, allowing arbitrary size batches, and the use of noisy and possibly biased estimates of the gradients and Hessians, incorporating both the variance reduction and the lazy Hessian updates.

Auxiliary Learning

Second-order optimization with lazy Hessians

no code implementations1 Dec 2022 Nikita Doikov, El Mahdi Chayti, Martin Jaggi

This provably improves the total arithmetical complexity of second-order algorithms by a factor $\sqrt{d}$.

Scalable Collaborative Learning via Representation Sharing

no code implementations20 Nov 2022 Frédéric Berdoz, Abhishek Singh, Martin Jaggi, Ramesh Raskar

To do so, each client releases averaged last hidden layer activations of similar labels to a central server that only acts as a relay (i. e., is not involved in the training or aggregation of the models).

Federated Learning Knowledge Distillation +1

Accuracy Boosters: Epoch-Driven Mixed-Mantissa Block Floating-Point for DNN Training

no code implementations19 Nov 2022 Simla Burcu Harma, Ayan Chakraborty, Babak Falsafi, Martin Jaggi, Yunho Oh

The unprecedented growth in DNN model complexity, size, and amount of training data has led to a commensurate increase in demand for computing and a search for minimal encoding.

Modular Clinical Decision Support Networks (MoDN) -- Updatable, Interpretable, and Portable Predictions for Evolving Clinical Environments

1 code implementation12 Nov 2022 Cécile Trottet, Thijs Vogels, Martin Jaggi, Mary-Anne Hartley

Data-driven Clinical Decision Support Systems (CDSS) have the potential to improve and standardise care with personalised probabilistic guidance.

Privacy Preserving

Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

no code implementations16 Jun 2022 Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

In this work (i) we obtain a tighter convergence rate of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \sqrt{\tau_{\max}\tau_{avg}}\epsilon^{-1}\right)$ without any change in the algorithm where $\tau_{avg}$ is the average delay, which can be significantly smaller than $\tau_{\max}$.

Avg Federated Learning

Beyond spectral gap: The role of the topology in decentralized learning

1 code implementation7 Jun 2022 Thijs Vogels, Hadrien Hendrikx, Martin Jaggi

In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster.

Distributed Optimization

Special Properties of Gradient Descent with Large Learning Rates

no code implementations30 May 2022 Amirkeivan Mohtashami, Martin Jaggi, Sebastian Stich

However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance. We demonstrate the same effects also in the noise-less case, i. e. for full-batch GD.

SKILL: Structured Knowledge Infusion for Large Language Models

no code implementations NAACL 2022 Fedor Moiseev, Zhe Dong, Enrique Alfonseca, Martin Jaggi

The models pre-trained on factual triples compare competitively with the ones on natural language sentences that contain the same knowledge.

Knowledge Graphs TriviaQA

Data-heterogeneity-aware Mixing for Decentralized Learning

no code implementations13 Apr 2022 Yatin Dandi, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich

Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs.

Improving Generalization via Uncertainty Driven Perturbations

no code implementations11 Feb 2022 Matteo Pagliardini, Gilberto Manunza, Martin Jaggi, Michael I. Jordan, Tatjana Chavdarova

We show that UDP is guaranteed to achieve the maximum margin decision boundary on linear models and that it notably increases it on challenging simulated datasets.

Agree to Disagree: Diversity through Disagreement for Better Transferability

1 code implementation9 Feb 2022 Matteo Pagliardini, Martin Jaggi, François Fleuret, Sai Praneeth Karimireddy

This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features -- present in the training data but absent from the test data -- and (ii) by only leveraging a small subset of predictive features.

Out of Distribution (OOD) Detection

Byzantine-Robust Decentralized Learning via ClippedGossip

1 code implementation3 Feb 2022 Lie He, Sai Praneeth Karimireddy, Martin Jaggi

In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs.

Federated Learning

Breaking the centralized barrier for cross-device federated learning

no code implementations NeurIPS 2021 Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian U. Stich, Ananda Theertha Suresh

Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon.

Federated Learning

Interpreting Language Models Through Knowledge Graph Extraction

1 code implementation16 Nov 2021 Vinitra Swamy, Angelika Romanou, Martin Jaggi

In this paper, we compare BERT-based language models through snapshots of acquired knowledge at sequential stages of the training process.

Language Modelling

Linear Speedup in Personalized Collaborative Learning

1 code implementation10 Nov 2021 El Mahdi Chayti, Sai Praneeth Karimireddy, Sebastian U. Stich, Nicolas Flammarion, Martin Jaggi

Collaborative training can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user).

Federated Learning Stochastic Optimization

Optimal Model Averaging: Towards Personalized Collaborative Learning

no code implementations25 Oct 2021 Felix Grimberg, Mary-Anne Hartley, Sai P. Karimireddy, Martin Jaggi

In federated learning, differences in the data or objectives between the participating nodes motivate approaches to train a personalized machine learning model for each node.

Federated Learning

WAFFLE: Weighted Averaging for Personalized Federated Learning

no code implementations13 Oct 2021 Martin Beaussart, Felix Grimberg, Mary-Anne Hartley, Martin Jaggi

Through a series of experiments, we compare our new approach to two recent personalized federated learning methods--Weight Erosion and APFL--as well as two general FL methods--Federated Averaging and SCAFFOLD.

Personalized Federated Learning

RelaySum for Decentralized Deep Learning on Heterogeneous Data

1 code implementation NeurIPS 2021 Thijs Vogels, Lie He, Anastasia Koloskova, Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi

A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions.

Improved Generalization-Robustness Trade-off via Uncertainty Targeted Attacks

no code implementations29 Sep 2021 Matteo Pagliardini, Gilberto Manunza, Martin Jaggi, Tatjana Chavdarova

The deep learning models' sensitivity to small input perturbations raises security concerns and limits their use for applications where reliability is critical.

On Second-order Optimization Methods for Federated Learning

no code implementations6 Sep 2021 Sebastian Bischoff, Stephan Günnemann, Martin Jaggi, Sebastian U. Stich

We consider federated learning (FL), where the training data is distributed across a large number of clients.

Federated Learning Specificity

Semantic Perturbations with Normalizing Flows for Improved Generalization

1 code implementation ICCV 2021 Oguz Kaan Yuksel, Sebastian U. Stich, Martin Jaggi, Tatjana Chavdarova

We find that our latent adversarial perturbations adaptive to the classifier throughout its training are most effective, yielding the first test accuracy improvement results on real-world datasets -- CIFAR-10/100 -- via latent-space perturbations.

Data Augmentation

IFedAvg: Interpretable Data-Interoperability for Federated Learning

1 code implementation14 Jul 2021 David Roschewitz, Mary-Anne Hartley, Luca Corinzia, Martin Jaggi

Thus, enabling the detection of outlier datasets in the federation and also learning the compensation for local data distribution shifts without sharing any original data.

Federated Learning

Implicit Gradient Alignment in Distributed and Federated Learning

no code implementations25 Jun 2021 Yatin Dandi, Luis Barba, Martin Jaggi

A major obstacle to achieving global convergence in distributed and federated learning is the misalignment of gradients across clients, or mini-batches due to heterogeneity and stochasticity of the distributed data.

Federated Learning

Masked Training of Neural Networks with Partial Gradients

no code implementations16 Jun 2021 Amirkeivan Mohtashami, Martin Jaggi, Sebastian U. Stich

State-of-the-art training algorithms for deep learning models are based on stochastic gradient descent (SGD).

Model Compression

Obtaining Better Static Word Embeddings Using Contextual Embedding Models

1 code implementation ACL 2021 Prakhar Gupta, Martin Jaggi

The advent of contextual word embeddings -- representations of words which incorporate semantic and syntactic information from their context -- has led to tremendous improvements on a wide variety of NLP tasks.

Computational Efficiency Word Embeddings

Lightweight Cross-Lingual Sentence Representation Learning

1 code implementation ACL 2021 Zhuoyuan Mao, Prakhar Gupta, Pei Wang, Chenhui Chu, Martin Jaggi, Sadao Kurohashi

Large-scale models for learning fixed-dimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks.

Contrastive Learning Document Classification +4

Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates

no code implementations3 Mar 2021 Sebastian U. Stich, Amirkeivan Mohtashami, Martin Jaggi

It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and -- in asynchronous implementations -- on the gradient staleness.

Consensus Control for Decentralized Deep Learning

no code implementations9 Feb 2021 Lingjing Kong, Tao Lin, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich

Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters.

Quasi-Global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data

1 code implementation9 Feb 2021 Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi

In this paper, we investigate and identify the limitation of several decentralized optimization algorithms for different degrees of data heterogeneity.

Faster Training of Word Embeddings

no code implementations1 Jan 2021 Eliza Wszola, Martin Jaggi, Markus Püschel

Word embeddings have gained increasing popularity in the recent years due to the Word2vec library and its extension fastText that uses subword information.

Word Embeddings

On the Effect of Consensus in Decentralized Deep Learning

no code implementations1 Jan 2021 Tao Lin, Lingjing Kong, Anastasia Koloskova, Martin Jaggi, Sebastian U Stich

Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters.

Learning from History for Byzantine Robust Optimization

1 code implementation18 Dec 2020 Sai Praneeth Karimireddy, Lie He, Martin Jaggi

Secondly, we prove that even if the aggregation rules may succeed in limiting the influence of the attackers in a single round, the attackers can couple their attacks across time eventually leading to divergence.

Federated Learning Stochastic Optimization

Practical Low-Rank Communication Compression in Decentralized Deep Learning

1 code implementation NeurIPS 2020 Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi

Lossy gradient compression has become a practical tool to overcome the communication bottleneck in centrally coordinated distributed training of machine learning models.

Byzantine-Robust Learning on Heterogeneous Datasets via Resampling

no code implementations28 Sep 2020 Lie He, Sai Praneeth Karimireddy, Martin Jaggi

In Byzantine-robust distributed optimization, a central server wants to train a machine learning model over data distributed across multiple workers.

Distributed Optimization

Sparse Communication for Training Deep Networks

no code implementations19 Sep 2020 Negar Foroutan Eghlidi, Martin Jaggi

Although distributed training reduces the computation time, the communication overhead associated with the gradient exchange forms a scalability bottleneck for the algorithm.

Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning

1 code implementation8 Aug 2020 Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon.

Federated Learning

PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning

2 code implementations4 Aug 2020 Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi

Lossy gradient compression has become a practical tool to overcome the communication bottleneck in centrally coordinated distributed training of machine learning models.

Multi-Head Attention: Collaborate Instead of Concatenate

2 code implementations29 Jun 2020 Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi

We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer.

Machine Translation Translation

Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing

1 code implementation ICLR 2022 Sai Praneeth Karimireddy, Lie He, Martin Jaggi

In Byzantine robust distributed or federated learning, a central server wants to train a machine learning model over data distributed across multiple workers.

Distributed Optimization Federated Learning

Ensemble Distillation for Robust Model Fusion in Federated Learning

1 code implementation NeurIPS 2020 Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi

In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side.

BIG-bench Machine Learning Federated Learning +1

Extrapolation for Large-batch Training in Deep Learning

no code implementations ICML 2020 Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi

Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data.

Secure Byzantine-Robust Machine Learning

no code implementations8 Jun 2020 Lie He, Sai Praneeth Karimireddy, Martin Jaggi

Increasingly machine learning systems are being deployed to edge servers and devices (e. g. mobile phones) and trained in a collaborative manner.

BIG-bench Machine Learning

Masking as an Efficient Alternative to Finetuning for Pretrained Language Models

no code implementations EMNLP 2020 Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, Hinrich Schütze

We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning.

Understanding the Effects of Data Parallelism and Sparsity on Neural Network Training

no code implementations ICLR 2021 Namhoon Lee, Thalaiyasingam Ajanthan, Philip H. S. Torr, Martin Jaggi

As a result, we find across various workloads of data set, network model, and optimization algorithm that there exists a general scaling trend between batch size and number of training steps to convergence for the effect of data parallelism, and further, difficulty of training under sparsity.

Network Pruning

A Unified Theory of Decentralized SGD with Changing Topology and Local Updates

no code implementations ICML 2020 Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, Sebastian U. Stich

Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency.

Stochastic Optimization

Robust Cross-lingual Embeddings from Parallel Sentences

2 code implementations28 Dec 2019 Ali Sabet, Prakhar Gupta, Jean-Baptiste Cordonnier, Robert West, Martin Jaggi

Recent advances in cross-lingual word embeddings have primarily relied on mapping-based methods, which project pretrained word embeddings from different languages into a shared space through a linear transformation.

Cross-Lingual Document Classification Cross-Lingual Word Embeddings +7

On the Relationship between Self-Attention and Convolutional Layers

1 code implementation ICLR 2020 Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi

This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice.

Image Classification

Optimizer Benchmarking Needs to Account for Hyperparameter Tuning

no code implementations ICML 2020 Prabhu Teja Sivaprasad, Florian Mai, Thijs Vogels, Martin Jaggi, François Fleuret

The performance of optimizers, particularly in deep learning, depends considerably on their chosen hyperparameter configuration.

Benchmarking

Model Fusion via Optimal Transport

2 code implementations NeurIPS 2020 Sidak Pal Singh, Martin Jaggi

Finally, our approach also provides a principled way to combine the parameters of neural networks with different widths, and we explore its application for model compression.

Continual Learning Model Compression +2

On the Tunability of Optimizers in Deep Learning

no code implementations25 Sep 2019 Prabhu Teja S*, Florian Mai*, Thijs Vogels, Martin Jaggi, Francois Fleuret

There is no consensus yet on the question whether adaptive gradient methods like Adam are easier to use than non-adaptive optimization methods like SGD.

Decentralized Deep Learning with Arbitrary Communication Compression

1 code implementation ICLR 2020 Anastasia Koloskova, Tao Lin, Sebastian U. Stich, Martin Jaggi

Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters.

Correlating Twitter Language with Community-Level Health Outcomes

1 code implementation WS 2019 Arno Schneuwly, Ralf Grubenmann, Séverine Rion Logean, Mark Cieliebak, Martin Jaggi

We study how language on social media is linked to diseases such as atherosclerotic heart disease (AHD), diabetes and various types of cancer.

Clustering regression +2

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

1 code implementation NeurIPS 2019 Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi

We study gradient compression methods to alleviate the communication bottleneck in data-parallel distributed optimization.

Distributed Optimization

On Linear Learning with Manycore Processors

1 code implementation2 May 2019 Eliza Wszola, Celestine Mendler-Dünner, Martin Jaggi, Markus Püschel

A new generation of manycore processors is on the rise that offers dozens and more cores on a chip and, in a sense, fuses host processor and accelerator.

Crosslingual Document Embedding as Reduced-Rank Ridge Regression

1 code implementation8 Apr 2019 Martin Josifoski, Ivan S. Paskov, Hristo S. Paskov, Martin Jaggi, Robert West

Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.

Document Embedding regression +2

Forecasting intracranial hypertension using multi-scale waveform metrics

no code implementations25 Feb 2019 Matthias Hüser, Adrian Kündig, Walter Karlen, Valeria De Luca, Martin Jaggi

Approach: We developed a prediction framework that forecasts onsets of acute intracranial hypertension in the next 8 hours.

Time Series Analysis

Overcoming Multi-Model Forgetting

no code implementations ICLR 2019 Yassine Benyahia, Kaicheng Yu, Kamil Bennani-Smires, Martin Jaggi, Anthony Davison, Mathieu Salzmann, Claudiu Musat

We identify a phenomenon, which we refer to as multi-model forgetting, that occurs when sequentially training multiple deep networks with partially-shared parameters; the performance of previously-trained models degrades as one optimizes a subsequent one, due to the overwriting of shared parameters.

Neural Architecture Search

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

3 code implementations1 Feb 2019 Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix.

Stochastic Optimization

Unsupervised Scalable Representation Learning for Multivariate Time Series

2 code implementations NeurIPS 2019 Jean-Yves Franceschi, Aymeric Dieuleveut, Martin Jaggi

Time series constitute a challenging data type for machine learning algorithms, due to their highly variable lengths and sparse labeling in practice.

BIG-bench Machine Learning Representation Learning +2

Efficient Greedy Coordinate Descent for Composite Problems

no code implementations16 Oct 2018 Sai Praneeth Karimireddy, Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

For these problems we provide (i) the first linear rates of convergence independent of $n$, and show that our greedy update rule provides speedups similar to those obtained in the smooth case.

Sparsified SGD with Memory

1 code implementation NeurIPS 2018 Sebastian U. Stich, Jean-Baptiste Cordonnier, Martin Jaggi

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i. e. algorithms that leverage the compute power of many devices for training.

Distributed Optimization Quantization

Context Mover's Distance & Barycenters: Optimal Transport of Contexts for Building Representations

2 code implementations29 Aug 2018 Sidak Pal Singh, Andreas Hug, Aymeric Dieuleveut, Martin Jaggi

We present a framework for building unsupervised representations of entities and their compositions, where each entity is viewed as a probability distribution rather than a vector embedding.

Sentence Sentence Embedding +1

Don't Use Large Mini-Batches, Use Local SGD

2 code implementations ICLR 2020 Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks.

COLA: Decentralized Linear Learning

1 code implementation NeurIPS 2018 Lie He, An Bian, Martin Jaggi

Decentralized machine learning is a promising emerging paradigm in view of global challenges of data ownership and privacy.

BIG-bench Machine Learning CoLA +2

A Distributed Second-Order Algorithm You Can Trust

no code implementations ICML 2018 Celestine Dünner, Aurelien Lucchi, Matilde Gargiani, An Bian, Thomas Hofmann, Martin Jaggi

Due to the rapid growth of data and computational resources, distributed optimization has become an active research area in recent years.

Distributed Optimization Second-order methods

Wasserstein is all you need

no code implementations5 Jun 2018 Sidak Pal Singh, Andreas Hug, Aymeric Dieuleveut, Martin Jaggi

We propose a unified framework for building unsupervised representations of individual objects or entities (and their compositions), by associating with each object both a distributional as well as a point estimate (vector embedding).

Sentence

Global linear convergence of Newton's method without strong-convexity or Lipschitz gradients

no code implementations1 Jun 2018 Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi

We show that Newton's method converges globally at a linear rate for objective functions whose Hessians are stable.

regression

Training DNNs with Hybrid Block Floating Point

no code implementations NeurIPS 2018 Mario Drumond, Tao Lin, Martin Jaggi, Babak Falsafi

We identify block floating point (BFP) as a promising alternative representation since it exhibits wide dynamic range and enables the majority of DNN operations to be performed with fixed-point logic.

On Matching Pursuit and Coordinate Descent

no code implementations ICML 2018 Francesco Locatello, Anant Raj, Sai Praneeth Karimireddy, Gunnar Rätsch, Bernhard Schölkopf, Sebastian U. Stich, Martin Jaggi

Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affine invariant sublinear $\mathcal{O}(1/t)$ rates on smooth objectives and linear convergence on strongly convex objectives.

Simple Unsupervised Keyphrase Extraction using Sentence Embeddings

3 code implementations CONLL 2018 Kamil Bennani-Smires, Claudiu Musat, Andreea Hossmann, Michael Baeriswyl, Martin Jaggi

EmbedRank achieves higher F-scores than graph-based state of the art systems on standard datasets and is suitable for real-time processing of large amounts of Web data.

Keyphrase Extraction Sentence +1

An Accelerated Communication-Efficient Primal-Dual Optimization Framework for Structured Machine Learning

1 code implementation14 Nov 2017 Chenxin Ma, Martin Jaggi, Frank E. Curtis, Nathan Srebro, Martin Takáč

In this paper, an accelerated variant of CoCoA+ is proposed and shown to possess a convergence rate of $\mathcal{O}(1/t^2)$ in terms of reducing suboptimality.

BIG-bench Machine Learning Distributed Optimization

Safe Adaptive Importance Sampling

no code implementations NeurIPS 2017 Sebastian U. Stich, Anant Raj, Martin Jaggi

Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications.

Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems

1 code implementation NeurIPS 2017 Celestine Dünner, Thomas Parnell, Martin Jaggi

We propose a generic algorithmic building block to accelerate training of machine learning models on heterogeneous compute systems.

BIG-bench Machine Learning

Learning Aerial Image Segmentation from Online Maps

2 code implementations21 Jul 2017 Pascal Kaiser, Jan Dirk Wegner, Aurelien Lucchi, Martin Jaggi, Thomas Hofmann, Konrad Schindler

We adapt a state-of-the-art CNN architecture for semantic segmentation of buildings and roads in aerial images, and compare its performance when using different training data sets, ranging from manually labeled, pixel-accurate ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations.

General Classification Image Segmentation +2

Unsupervised robust nonparametric learning of hidden community properties

no code implementations11 Jul 2017 Mikhail A. Langovoy, Akhilesh Gotmare, Martin Jaggi

We consider learning of fundamental properties of communities in large noisy networks, in the prototypical situation where the nodes or users are split into two classes according to a binary property, e. g., according to their opinions or preferences on a topic.

Approximate Steepest Coordinate Descent

no code implementations ICML 2017 Sebastian U. Stich, Anant Raj, Martin Jaggi

We propose a new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization.

Computational Efficiency regression

Greedy Algorithms for Cone Constrained Optimization with Convergence Guarantees

no code implementations NeurIPS 2017 Francesco Locatello, Michael Tschannen, Gunnar Rätsch, Martin Jaggi

Greedy optimization methods such as Matching Pursuit (MP) and Frank-Wolfe (FW) algorithms regained popularity in recent years due to their simplicity, effectiveness and theoretical guarantees.

Generating Steganographic Text with LSTMs

1 code implementation ACL 2017 Tina Fang, Martin Jaggi, Katerina Argyraki

Motivated by concerns for user privacy, we design a steganographic system ("stegosystem") that enables two users to exchange encrypted messages without an adversary detecting that such an exchange is taking place.

Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features

5 code implementations NAACL 2018 Matteo Pagliardini, Prakhar Gupta, Martin Jaggi

The recent tremendous success of unsupervised word embeddings in a multitude of applications raises the obvious question if similar methods could be derived to improve embeddings (i. e. semantic representations) of word sequences as well.

Sentence Sentence Embeddings +1

Faster Coordinate Descent via Adaptive Importance Sampling

no code implementations7 Mar 2017 Dmytro Perekrestenko, Volkan Cevher, Martin Jaggi

Coordinate descent methods employ random partial updates of decision variables in order to solve huge-scale convex optimization problems.

A Unified Optimization View on Generalized Matching Pursuit and Frank-Wolfe

no code implementations21 Feb 2017 Francesco Locatello, Rajiv Khanna, Michael Tschannen, Martin Jaggi

Two of the most fundamental prototypes of greedy optimization are the matching pursuit and Frank-Wolfe algorithms.

Screening Rules for Convex Problems

no code implementations23 Sep 2016 Anant Raj, Jakob Olbrich, Bernd Gärtner, Bernhard Schölkopf, Martin Jaggi

We propose a new framework for deriving screening rules for convex optimization problems.

Primal-Dual Rates and Certificates

no code implementations16 Feb 2016 Celestine Dünner, Simone Forte, Martin Takáč, Martin Jaggi

We propose an algorithm-independent framework to equip existing optimization methods with primal-dual certificates.

BIG-bench Machine Learning

Pursuits in Structured Non-Convex Matrix Factorizations

no code implementations12 Feb 2016 Rajiv Khanna, Michael Tschannen, Martin Jaggi

Efficiently representing real world data in a succinct and parsimonious manner is of central importance in many fields.

L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework

2 code implementations13 Dec 2015 Virginia Smith, Simone Forte, Michael. I. Jordan, Martin Jaggi

Despite the importance of sparsity in many large-scale applications, there are few methods for distributed optimization of sparsity-inducing objectives.

Distributed Optimization

Distributed Optimization with Arbitrary Local Solvers

1 code implementation13 Dec 2015 Chenxin Ma, Jakub Konečný, Martin Jaggi, Virginia Smith, Michael. I. Jordan, Peter Richtárik, Martin Takáč

To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally, and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods.

Distributed Optimization

On the Global Linear Convergence of Frank-Wolfe Optimization Variants

1 code implementation NeurIPS 2015 Simon Lacoste-Julien, Martin Jaggi

In this paper, we highlight and clarify several variants of the Frank-Wolfe optimization algorithm that have been successfully applied in practice: away-steps FW, pairwise FW, fully-corrective FW and Wolfe's minimum norm point algorithm, and prove for the first time that they all enjoy global linear convergence, under a weaker condition than strong convexity of the objective.

Adding vs. Averaging in Distributed Primal-Dual Optimization

1 code implementation12 Feb 2015 Chenxin Ma, Virginia Smith, Martin Jaggi, Michael. I. Jordan, Peter Richtárik, Martin Takáč

Distributed optimization methods for large-scale machine learning suffer from a communication bottleneck.

Distributed Optimization

An Equivalence between the Lasso and Support Vector Machines

no code implementations5 Mar 2013 Martin Jaggi

As a consequence, many existing optimization algorithms for both SVMs and Lasso can also be applied to the respective other problem instances.

Cannot find the paper you are looking for? You can Submit a new open access paper.