Papers with Code Newsletter #25

Welcome to the 25th issue of the Papers with Code newsletter. In this edition, we cover:

a novel method for improving OOD detection,
a unified multimodal pretraining framework,
a summary of how vision transformers work,
some of the latest state-of-the-art results on Papers with Code,
...and much more.

Improving OOD Detection

VOS framework. Figure source: Du et al. (2022)

Bringing the capabilities of neural networks into real world applications and settings demands safe deployment. As a result, there has been lots of interest in improving out-of-distribution (OOD) detection. One of the main challenges is that models lack supervision signals from unknown data and real outlier datasets are not feasible in practice. To address this challenge, Du et al. (2022) presents VOS, a framework for OOD detection, in the vision domain, by adaptively synthesizing virtual outliers.

The objective is to make the synthesized virtual outliers aid in regularizing the model's decision boundary during training. This approach can be used for any ID data without manual data collection or cleaning which differs from previous methods that require diverse auxiliary image datasets. VOS ultimately helps to estimate a compact decision boundary between ID and OOD data. Feature representations of ID objects are modelled as class-conditional Gaussians and virtual outliers are sampled from the low-likelihood region. They are both used to produce the uncertainty loss for regularization and trained jointly with object detection loss. Results suggest that VOS is more advantageous than generating outliers directly or using noise as outliers. It achieves state-of-the-art results on object detection and further evaluated on common OOD detection benchmarks.

Paper & Code

How Do Vision Transformers Work?

Shows that MSAs flatten loss landscapes viewed from different aspects. Source: Park and Kim (2022)

One of the biggest trends in computer vision research is the adoption of Transformer-based models for addressing a broad range of vision tasks. In particular, there has been a lot of success in applying multi-head self-attention (MSA) for computer vision. Vision transformers including the original ViT, multi-scale ViTs, hybrid ViTs with convolution, and self-supervised ViTs already show remarkable progress in many areas of computer vision research. Driven by this trend and success of MSAs and ViTs, Park and Kim (2022) present new research that aims to investigate more closely how MSAs work and better understand them. The following is a short summary of the findings:

MSA improves accuracy and generalization by flattening the loss landscapes, mainly attributed to data specificity.
ViTs suffer from non-convex losses; large datasets and smoothing methods alleviate the problem.
MSAs and Convs exhibit opposite behaviours that are beneficial in computer vision, potentially explaining why these two frameworks have shown to be complementary in previous works.
Multi-stage neural networks benefit from MSA at the end of the stage and outperform alternatives that use Conv blocks instead. The effect is seen in both large and small data regimes.

Paper & Code

A Unified MultiModal Model

The OFA Architecture. Figure source: Wang et al. (2022)

A grand goal of AI is to build unified systems that can support and generalize to many tasks and modalities. Several developments such as the Transformer architecture, pretrain-finetune paradigm, prompt/instruction tuning for zero/few-shot capabilities offer opportunities to build unified systems. Wang et al., (2022) defines a set of properties that omnipotent models should have and presents an omni-model, called OFA ("One for All"), for multimodal pretraining.

OFA aims to unify architecture, task, and modality and supports the following properties: task-agnostic, modality-agnostic, and task comprehensiveness. OFA involves a simple sequence-to-sequence learning framework with instruction-based training. It unifies understanding and generation tasks (e.g., image generation and visual grounding) and supports multimodality and uni-modality. Experimental results show that OFA attains competitive performance on multimodal benchmarks such as image captioning and text-to-image generation. It also achieves competitive performance in zero-shot learning and transfers well to unseen tasks with new task instructions.

Paper & Code

New Results on Papers with Code

M MLU - a trending benchmark this week is the MMLU benchmark dataset for evaluating multi-task language understanding. Methods that evaluate on this task include OpenAI's GPT-3, DeepMind's Gopher, and the recently released EleutherAI's GPT-NeoX-20B.

cpt-text XL - uses contrastive pretraining on unsupervised data at scale to learn high quality vector representations of text and code. These representations were used to attain competitive results across many tasks, including zero-shot text search on BEIR, code search on CodeSearchNet, and linear-probe classification on SentEval.

MaskGIT - proposes a novel image synthesis paradigm using a bidirectional transformer decoder. It outperforms other generative models on ImageNet and accelerates autoregressive decoding by up to 64x.

SimVLM - a simple pretraining framework to reduce training complexity by exploiting large-scale weak supervision for visual language model pretraining. It achieves state-of-the-art results on several benchmarks like VQA, SNLI-VE, NLVR2, and more.

AlphaCode - introduces a new system for code generation that can create novel solutions to more complex programming problems that require reasoning. It proposes a new dataset to evaluate on (CodeContests) and outperforms other approaches for code generation on APPS.

Trending Research Datasets and Tools

Datasets

Data Science Problems - a new dataset to evaluate a natural language code generation model on real data science pedagogical notebooks.

ProteinKG25 - a large scale KG dataset with aligned descriptions and protein sequences respectively to GO terms and protein entities. It contain ~4.9M triplets, ~612K entities and 31 relations.

Met - a large-scale dataset for instance-level recognition in the artwork domain. It consists of 400K images from more than 224K classes.

Tools

EvoJAX - a scalable, general-purpose, hardware-accelerated neuroevolution toolkit.

Ivy - a unified machine learning framework to enable portability of ML codebases.

PromptSource - a new system for creating, sharing, and using natural language prompts.

OMLT - an new optimization and machine learning toolkit.

textless-lib - a library for textless spoken language processing.

Community Highlights ✍️

We would like to thank:

@ysharma1126 for contributing several datasets like KITTI-Masks and 3DIdent.
@Elias_Ramzi for several contributions including many new updates to the Image Retrieval on iNaturalist benchmark.
@rainjiang11 and @JonasGeiping for several paper and code contributions.
@ZiuxuanKe for significant changes to Continual Learning to ASC leaderboard.
@damirko, @jiachens, @czhang017, and all our contributors for their ongoing contributions to Papers with Code.

Papers with Code Partners with OpenReview

The team is excited to announce a new partnership with OpenReview where you can now find and access links to code and datasets directly from OpenReview papers. Click here for an example.

---

See previous issues

Join us on Slack, LinkedIn, and Twitter