August 12, 2021

Papers with Code Newsletter #15

Welcome to the 15th issue of the Papers with Code newsletter. In this edition, we cover:

  • recent papers that propose general-purpose neural networks,
  • a technique to improve contrastive learning based on feature-level data transformations,
  • on improving NLP accuracy over OCR documents,
  • other trending research papers of the past week,
  • ... and much more.

Trending Papers with Code 📄

Designing General-Purpose Neural Networks [DL]

There has been a lot of interest in developing general-purpose neural networks over the past few years. In this edition of the newsletter, we highlight a few recent papers focusing on building more general-purpose ML systems that are capable of handling diverse inputs and output tasks. 

Perceiver IO

The Perceiver IO architecture. Figure source: Jaegle et al. (2021)

Several of the machine learning systems built today are designed to handle a specific type of input and output associated with a single task. This makes these systems constrained and grow in complexity as more support for different types of inputs and outputs are considered. An ideal system could be a single neural network architecture that could handle diverse input modalities and output tasks. Jaegle et al. (2021) propose a general-purpose neural network architecture, Perceiver IO, that performs well for structured input modalities and output tasks.

Perceiver IO is built on top of Perceiver and is designed to easily integrate and transform arbitrary information for arbitrary tasks. The original Perceiver architecture supports a wide range of modalities but can only handle simple output spaces like classification. This follow-up work develops a new mechanism to decode structured outputs (e.g., language and audiovisual sequences) directly from the Perceiver latent space, easily allowing support for a variety of new domains without sacrificing the processing capabilities of the Perceiver architecture. A cross-attention mechanism is used to map from latents to arbitrary sized and structured outputs. It uses a query system that specifies the semantics needed for outputs on a wide range of domains. Perceiver IO achieves strong results on several tasks with highly structured outputs, it matches a BERT baseline on the GLUE benchmark without needing input tokenization, and attain state-of-the-art results on Sintel optical flow estimation.

🔗 Paper & Code

Multimodal Self-Supervised Learning

Overview of the VATT architecture and the self-supervised, multimodal learning strategy. Figure source: Akbari et al. (2021)

Another similar line of work aims to push the capabilities of modern neural network architectures by learning multimodal representations. Like the Perceiver IO architecture, such models aim to solve tasks in several domains. As an example, one recent architecture, called VATT, is trained to learn multimodal representations from unlabeled data using Transformer architectures.

VATT takes raw signals as inputs and extracts multimodal representations. Multimodal contrastive losses are used to train the model end-to-end. The model performs well on a variety of downstream tasks such as video action recognition, audio event classification, text-to-video retrieval, among other tasks. An interesting idea proposed in the paper is to allow sharing of weights among the three modalities (video, audio, and text). This was used to test whether there exists a single, general-purpose model for all the modalities. The architecture still includes layers of tokenization and linear projection for each modality. The results show that the modality-agnostic models perform comparable to the modality-specific ones.

VATT focuses on a fixed and predefined set of modalities by means of domain-specific networks. This places constraints on the VATT model to handle diverse inputs and output tasks, making it difficult to adapt to new settings. The Perceiver IO architecture, on the other hand, is able to map arbitrary inputs to arbitrary outputs in a domain agnostic process, all in a scalable way. The Perceiver IO and VATT are just some of the recent ideas among several existing works that seek to explore the general purpose capabilities of modern neural networks.

🔗 Paper & Code

Improving Contrastive Learning [ML]

Feature Transformation Contrastive learning pipeline. Figure source: Zhu et al. (2021)

Contrastive learning has been successful for unsupervised feature learning. The design of positive and negative (pos/neg) pairs is key in contrastive learning. Most of the existing approaches use data augmentation to acquire pos/neg pairs. Although effective, these augmentation strategies are human sourced and may lack interpretability. Zhu et al. (2021) recently proposed feature-level data transformations (i.e., feature transformations) to enhance the generic contrastive self-supervised learning.

The feature transformations aim at providing more explainable or effective pos/neg pairs and enhance the feature embedding. The pos/neg pair score distributions are visualized during training. This scheme helps explain how parameter values affect model performance. It also helps to observe the characteristics of the pairs to invent even more effective feature transformations. Overall, the tool helps discover the proposed novel Feature Transformation, including positive extrapolation where more hard positives are created for training. The feature transformations allow the system to learn more view-invariant and discriminative representations. Feature Transformation improves at least 6.0% accuracy on ImageNet-100 over MoCo. The visualization tools and code are open-sourced. 

🔗 Paper & Code

Improving NLP Accuracy over OCR documents [NLP]

Action prediction model architecture. Figure source: Gupte et al. (2021)

Digitization of documents is crucial for digital transformation, but an essential component, optical character recognition (OCR), is not perfect. Fidelity of scanned documents make some commercialized OCR systems inaccurate. Gupte et al. (2021) recently proposed an effective system for mitigating OCR errors for downstream tasks like Named Entity Recognition (NER). 

The authors contribute a document synthesis pipeline to deal with the data scarcity problem, and produce realistic but degraded data with NER labels. A text restoration model is then trained to predict actions that are required for restoration, similar to approaches that predict characters at each time step (see architecture preview above). The proposed action prediction model can restore clean text from OCR output and mitigate the downstream NER accuracy degradation. Overall, results show that the text restoration model can significantly close the NER accuracy gap caused by OCR errors. The document synthesis pipeline has been open-sourced (find link below).

🔗 Paper & Code

Other Trending ML Papers

Below are other notable trending papers published in the past week:

✍️ Sketch Your Own GAN - presents a method for rewriting GANs with one or more sketches, to make GANs training easier for novice users. 

🎨 Paint Transformer - proposes a novel Transformer-based framework to predict parameters of a stroke set with a feed forward network. 

🔮 Unifying Nonlocal Blocks for Neural Networks - provides a new perspective to interpret nonlocal-based blocks and theoretically analyze their property. 

🩺 Domain Generalization via Gradient Surgery - presents a method to characterize the conflicting gradients emerging in domain shift scenarios and ways to alleviate their effect.

💡 Mitigating Dataset harms requires stewardship: Lessons from 1000 paper - reports that the creation of derivative datasets and models, the lack of clarity of licences, and other factors can introduce a wide range of ethical concerns. 

See Top 📈 and Hot 🔥 for the most recent trending papers. 

ML Benchmark Analysis


A recent paper uses the Papers with Code public benchmarks dataset to highlight several findings on how collaboration plays a key role on improving the state-of-the-art in machine learning. Check it out.

Trending Libraries and Datasets 🛠

Trending datasets

Q-Pain - a question-answering dataset to assess social bias in pain management.

AGAR - a microbial colonies dataset cultured on an agar plate for training deep learning detection models. 

CIRR - a dataset of open-domain, real-life images with human-generated modification sentences, which support research on one-shot composed image retrieval, dialogue systems, and fine-grained visiolinguistic reasoning. 

Trending libraries/tools

Summary Explorer -  a tool that allows manual inspection of text summarization systems by compiling the outputs of 55 single document summarization approaches.

Solo-learn - a Python-based library of self-supervised methods for visual representation learning; implemented using PyTorch and PyTorch Lightning. 

iART - an open web platform, with a search engine, for art-historical research to facilitate the process of comparative vision.

Community Highlights ✍️

We would like to thank:

Special thanks to the hundreds of other contributors for their contributions to Papers with Code.

---

We would be happy to hear your thoughts and suggestions on the newsletter. Please reply to elvis@paperswithcode.com.

🔗 See previous issues

Join us on Slack and Twitter