Papers with Code Newsletter #16

Welcome to the 16th issue of the Papers with Code newsletter. In this edition, we cover:

some of the latest developments in language modeling,
efficient Transformer models for long text modeling,
advancements in code understanding and generation,
top trending ML papers of August 2021,
... and much more.

The Latest in Language Modeling 💥

In this special edition of the newsletter we cover a few of the biggest and most recent developments in the world of language modeling. From improving zero-shot learning of language models to making them more efficient for long text modeling.

Finetuned Language Models are Zero-Shot Learners

TOP: overview of instruction tuning and the FLAN model. BOTTOM: performance of zero-shot FLAN on unseen tasks, compared to GPT-3 zero-shot and few-shot. Figure source: Wei et al. (2021)

Models like GPT-3 attain remarkable results on few-shot learning. However, on tasks such as reading comprehension, GPT-3's zero-shot performance is much worse than few-shot performance. One reason could be that without few-shot exemplars it is more difficult for models to perform well on prompts that are dissimilar to the pretraining data. Wei et al. (2021) propose a simple method to improve zero-shot performance of large language models.

The proposed model, FLAN, takes a pretrained language model of 137B parameters and performs instruction tuning. Instruction tuning is a method for fine-tuning the model on a mixture of NLP tasks described via natural language instructions. Tasks are grouped into clusters by types and each cluster is hold out for evaluation while instruction tuning is performed on all other clusters. The idea is that instruction tuning should help improve the ability of the model to respond to NLP instructions. By teaching the LM to perform tasks described via instructions it could learn to follow instructions even on unseen tasks (example illustrated in figure above). The proposed FLAN models outperform zero-shot GPT-3 on 19 of 25 tasks and few-shot GPT-3 by a large margin on a number of tasks such as StoryCloze and BoolQ.

Paper, Code, and Results

Efficient Transformers for Long Text Modeling

Architecture of Fastformer. Figure source: Wu et al. (2021)

While Transformers have shown effective at text understanding, they are still inefficient or ineffective on long sequences due to quadratic complexity to input sequence length. Wu et al. (2021) recently proposed an efficient Transformer, Fastformer, based on additive attention.

The additive attention mechanism is used to model global contexts and then transform each token representation based on its interaction with global context representations. Several operations in Fastformer, as shown in the figure above, ensures that contextual information in the input sequence can be effectively captured. This mechanism achieves effective context modeling with linear complexity. The results show that Fastformer is more efficient and achieves competitive performance on several datasets, including high performance for news recommendation on the MIND benchmark.

You may also like:

PermuteFormer - Chen et al. (2021) recently proposed the PermuteFormer, which is based on Performer with relative position encoding that scales linearly on long sequences. This model applies position-dependent transformation on queries and keys to encode positional information into the attention module. The efficient relative position encoding in the PermuteFormer allows it to run with almost no computational overhead while outperforming vanilla Transformers on many tasks.

ALiBi - Press et al. (2021) introduced a simple and efficient method, Attention with Linear Biases (ALiBi), that enables for extrapolation to inputs longer that the ones observed during training. This is simply done by replacing the position method for one that allows for extrapolation. ALiBi does not add position embeddings to the word embeddings; instead, it biases the query-key attention scores with a term proportional to their distance. Results show that a 1.3B parameter model trained on input sequences with length 1024 can extrapolate effectively to input sequences of length 2048. It achieves similar performance to a model trained on inputs of length 2048 that uses sinusoidal position embedding.

∞-former - Several of the recent works that aim to effectively model long-term memories have finite memory capacity that forces to drop older information. Martins et al. (2021) recently proposed a method called ∞-former that extends the vanilla transformer with an unbounded long-term memory. It uses continuous-space attention mechanism which makes the attention complexity independent of length of the context. This allows the model to attend to arbitrarily long contexts while keeping a fixed computation budget.

T5 for Code Understanding and Generation

CodeT5 architecture for code understanding and generation tasks. Figure source: Wang et al. (2021)

Systems for code understanding and generation are widely being researched as language models' capabilities keep improving. Due in part to architectural limitations, models like BERT and GPT are suboptimal for generation and understanding tasks, respectively. Specially designed models for code-related tasks like CodeBERT employ conventional NLP pretraining techniques that may not be optimal for capturing rich structural information in code. Wang et al. (2021) recently proposed CodeT5, a model that better captures rich code semantics and support both code-related understanding and generation tasks.

CodeT5 is built on top of the capabilities of T5 and represents a pretrained encoder-decoder model that considers token type in code. CodeT5 leverages informative identifiers by training the model, via an identifier-aware objective, to distinguish identifier tokens and recover them when masked. Developers use these identifiers to make code more understandable and preserve rich code semantics. The model also leverages the natural language - programming language (i.e., comment - code) pairs available in source code to learn better cross-modal alignment. Results show that CodeT5 attains state-of-the-art results on several sub-tasks in the CodeXGLUE benchmark.

Paper & Code

Top Trending Papers of August 2021 🏆

Below we highlight the top trending papers of August 2021 on Papers with Code:

📄 Paint Transformer - Liu et al. (2021) - 332 ★

📄 Task-aligned One-stage Object Detection - Feng et al. (2021) - 150 ★

📄 Image Restoration Using Swin Transformer - Liang et al. (2021) - 385 ★

📄 Sketch Your Own GAN - Wang et al. (2021) - 422 ★

📄 You Only Look Once for Panoptic Driving Perception - Wu et al. (2021) - 556 ★

📄 Segmenting Objects with Transformers - Guo et al. (2021) - 85 ★

📄 Open-source Codebase for Cross-modal Analytics - Li et al. (2021) - 722 ★

📄 An Automated Video Action Recognition System - Zha et al. (2021) - 212 ★

📄 Improving NLP Accuracy over OCR Documents - Gupte et al. (2021) - 159 ★

📄 High-Resolution Video Matting with Temporal Guidance - Lin et al. (2021) - 716 ★

Trending Libraries and Datasets 🛠

Trending datasets

MOD - is a large-scale open-domain multimodal dialogue dataset incorporating internet memes into ~606K utterances.

Common Objects in 3D - a large-scale dataset with real multi-view images of object categories annotated with camera poses and ground truth 3D point clouds.

FinQA - is a new large-scale dataset with Question-Answering pairs over financial reports, written by financial experts.

Trending libraries/tools

Cockpit - a collection of instruments that enable a closer look into the inner workings of learning systems, and provide a more informative and meaningful status report for practitioners.

SummerTime - is a new toolkit for text summarization, including various models, datasets and evaluation metrics, for a full spectrum of summarization-related tasks.

CARLA - A Python library to benchmark algorithmic recourse and counterfactual explanation algorithms.

MWPToolkit - an open-source framework for deep learning-based math word problem solvers.

Community Highlights ✍️

We would like to thank:

@EvgeniiZh for contributing to several benchmarks, including adding results for the Panoptic SegFormer model that achieves state-of-the-art results on panoptic segmentation for COCO minival.
@daniel.koguciuk for contributing to benchmarks and datasets, including the addition of the PDS-COCO dataset for homography estimation learning.
@j3soon for contributing to several datasets and benchmarks, including the addition of the StarCraft Multi-Agent Challenge dataset.
@rafaellcampos for indexing ESPADA, a new aerial image dataset for depth image estimation from a single aerial image.
@rohand24, @arne, and @picekl for contributing to several datasets, new papers and results to Papers with Code.

Special thanks to the hundreds of other contributors for their contributions to Papers with Code.