Search Results for author: Amirkeivan Mohtashami

Found 11 papers, 4 papers with code

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

1 code implementation4 Feb 2024 Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi

The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding.

CoTFormer: More Tokens With Attention Make Up For Less Depth

no code implementations16 Oct 2023 Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi

The race to continually develop ever larger and deeper foundational models is underway.

Landmark Attention: Random-Access Infinite Context Length for Transformers

2 code implementations25 May 2023 Amirkeivan Mohtashami, Martin Jaggi

While Transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts.

Retrieval

Learning Translation Quality Evaluation on Low Resource Languages from Large Language Models

no code implementations7 Feb 2023 Amirkeivan Mohtashami, Mauro Verzetti, Paul K. Rubenstein

Learned metrics such as BLEURT have in recent years become widely employed to evaluate the quality of machine translation systems.

Machine Translation Translation

Special Properties of Gradient Descent with Large Learning Rates

no code implementations30 May 2022 Amirkeivan Mohtashami, Martin Jaggi, Sebastian Stich

However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance. We demonstrate the same effects also in the noise-less case, i. e. for full-batch GD.

Masked Training of Neural Networks with Partial Gradients

no code implementations16 Jun 2021 Amirkeivan Mohtashami, Martin Jaggi, Sebastian U. Stich

State-of-the-art training algorithms for deep learning models are based on stochastic gradient descent (SGD).

Model Compression

Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates

no code implementations3 Mar 2021 Sebastian U. Stich, Amirkeivan Mohtashami, Martin Jaggi

It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and -- in asynchronous implementations -- on the gradient staleness.

[Re] Warm-Starting Neural Network Training

1 code implementation RC 2020 Amirkeivan Mohtashami, Ehsan Pajouheshgar, Klim Kireev

We reproduce the results of the paper ”On Warm-Starting Neural Network Training.” In many real-world applications, the training data is not readily available and is accumulated over time.

Data Augmentation

A Gradient-Based Approach to Neural Networks Structure Learning

no code implementations25 Sep 2019 Amir Ali Moinfar, Amirkeivan Mohtashami, Mahdieh Soleymani, Ali Sharifi-Zarchi

Designing the architecture of deep neural networks (DNNs) requires human expertise and is a cumbersome task.

Cannot find the paper you are looking for? You can Submit a new open access paper.