Search Results for author: Sebastian Jaszczur

Found 6 papers, 3 papers with code

Scaling Laws for Fine-Grained Mixture of Experts

1 code implementation • 12 Feb 2024 • Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur

Our findings not only show that MoE models consistently outperform dense Transformers but also highlight that the efficiency gap between dense and MoE models widens as we scale up the model size and training budget.

145

Paper
Code

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

1 code implementation • 8 Jan 2024 • Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Michał Krutul, Jakub Krajewski, Szymon Antoniak, Piotr Miłoś, Marek Cygan, Sebastian Jaszczur

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers.

145

Paper
Code

Structured Packing in LLM Training Improves Long Context Utilization

no code implementations • 28 Dec 2023 • Konrad Staniszewski, Szymon Tworkowski, Yu Zhao, Sebastian Jaszczur, Henryk Michalewski, Łukasz Kuciński, Piotr Miłoś

Recent developments in long-context large language models have attracted considerable attention.

Information Retrieval Retrieval

Paper
Add Code

Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

1 code implementation • 24 Oct 2023 • Szymon Antoniak, Sebastian Jaszczur, Michał Krutul, Maciej Pióro, Jakub Krajewski, Jan Ludziejewski, Tomasz Odrzygóźdź, Marek Cygan

The operation of matching experts and tokens is discrete, which makes MoE models prone to issues like training instability and uneven expert utilization.

Language Modelling Large Language Model

145

Paper
Code

Sparse is Enough in Scaling Transformers

no code implementations • NeurIPS 2021 • Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva

We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size.

Text Summarization

Paper
Add Code

Neural heuristics for SAT solving

no code implementations • 27 May 2020 • Sebastian Jaszczur, Michał Łuszczyk, Henryk Michalewski

We use neural graph networks with a message-passing architecture and an attention mechanism to enhance the branching heuristic in two SAT-solving algorithms.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.