December 02, 2021

Papers with Code Newsletter #21

Welcome to the 21st issue of the Papers with Code newsletter. This week, we cover:

  • a unified approach for visual synthesis tasks,
  • techniques for scaling vision models,
  • top trending papers of November 2021,
  • new state-of-the-art results,
  • ... and much more.

Improving Visual Synthesis 🏞

Examples of typical visual generation and manipulation tasks supported by the NÜWA model.

As visual data becomes more available and popular on the Web, there is need to build better systems that can generate new or manipulate visual data for a variety of visual scenarios. Wu et al. (2021) propose a new multimodal pre-trained model called NÜWA, a general 3D transformer that support multiple modalities at the same time for visual synthesis tasks. 

What it is: NÜWA consists of an adaptive encoder that takes either text or visual input, and a pre-trained decoder shared by 8 visual tasks. To reduce computational complexity and improve visual quality of results, a 3D Nearby Attention mechanism (3DNA) is proposed. 3DNA considers the locality characteristics for both spatial and temporal axes to better deal with the nature of the visual data. (See full architecture below). NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, and other visual tasks. It also shows good zero-shot capabilities for both text-guided image manipulation and text-guided video manipulation.

Overview of NÜWA. Figure source: Wu et al. (2021)    

Also good to know: Previous methods based on VQ-VAE such as DALL-E and CogView have already shown that large-scale pretraining can be applied to visual synthesis tasks. However, one limitation of these models is that they treat modalities separately. NÜWA, on the other hand, benefits from both image and video data as shown in the figure above. Another difference with NÜWA is that it leverages VQ-GAN instead of VQ-VAE for visual tokenization, which the author report can lead to better generation quality. The unified model provides a glimpse into a future where AI-enabled platforms can enable content creators in creative ways and enable visual world creation.

Scaling Up Vision Models ⚡️

Overview of adaptions proposed in Swin Transformer V2. Figure source: Liu et al. (2021)

In our previous issues, we have regularly discussed new techniques for scaling up large NLP language models. On the other hand, the scaling up of vision models has been lagging behind. A few works have attempted to scale vision Transformers through large-scale labelled image datasets and applied only to image classification. Some reports in the literature point to instability issues in training at scale. It is also not clear how the model can be effectively transferred across window resolutions. To address some of these issues, Liu et al. (2021) recently presented some techniques for effectively scaling up vision models.

Why it matters: Firstly, to improve the capacity and stability of a large vision model like Swin Transformer a post normalization technique and a scaled cosine attention approach are employed. To effectively transfer models pre-trained at low-resolution images to their higher-resolution counterparts, a log-spaced continuous relative position bias technique is employed. (See the adaptions in the figure above). In short, several techniques are presented for scaling Swin Transformer up to 3B parameters and enabling training with higher resolution images. The resulting architecture is called Swin Transformer V2 which achieves new records on various vision benchmarks. (See summary of results here).

You might also like: Another recent paper aims to train scalable vision leaners through masked autoencoding (MAE). The authors propose a simple MAE approach: during pretraining, a large random subset of image patches is masked out and missing pixels are reconstructed. In the encoder-decoder framework, the encoder is applied only to the visible subset of patches while the decoder processes encoded patches and mask tokens to reconstruct the original images in pixels. The masking of patches yields a self-supervisory task and allows the efficient and effective training of large vision models. After pre-training, only the encoder part is used to produce representations for several recognition tasks where the model achieves high performance. Check summary of results here.

Overview of MAE architecture for scaling vision learners. Figure source: He et al. (2021)

Top 10 Trending Papers of November 🏆

Here are the top ten trending papers of November 2021 on Papers with Code:

📄 Swin Transformer V2

📄 MetaFormer is Actually What You Need for Vision

📄 Neural Visual World Creation

📄 Masked Autoencoders are Scalable Vision Learners

📄 Attention Mechanisms in Computer Vision: A Survey

📄 Florence: A New Foundation Model for Computer Vision

📄 Restormer

📄 StyleGAN of all Trades

📄 FastFlow

📄 Rethinking KeyPoint Representations 

New Results on Papers with Code 📈

🏞 BASIC: Presents a combined scaling method for state-of-the-art zero-shot transfer image classification on ImageNet. 

🔨 Restormer: Introduces an efficient-based Transformer model for high-resolution image restoration. It outperforms previous models on several image restoration tasks such as defocus deblurring and image denoising. 

⚙️ ML-Decoder: Proposes a new attention-based classification head and involves redesign of the decoder architecture for state-of-the-art multi-label classification on MS-COCO and other image datasets.

Browse all state-of-the-art results reported on Papers with Code here.

Trending Research Datasets and Tools 🛠

Datasets

RedCaps - is a large-scale dataset of 12M image-text pairs collected from Reddit. 

CytoImageNet - a large-scale pretraining dataset for bioimage transfer learning.

LSUI -  a large-scale underwater image dataset including 5K image pairs, involving richer underwater scenes. 

Follow more of our latest datasets on Twitter.

Tools

TorchGeo - a Python library for integrating spatial data into the PyTorch deep learning ecosystem.

tsflex - a domain-independent Python toolkit for processing and feature extraction for time series.

Community Highlights ✍️

We would like to thank:

  • @LintaoPeng for contributing the new Large-Scale Underwater Image Dataset for underwater image restoration.
  • @kjan for contributing several community implementations of papers like "PonderNet: Learning to Ponder".
  • @lclissa for indexing new paper and dataset on research related to automatic cell counting in fluorescent microscopy using deep learning.
  • @aprimpeli for contributing to several Datasets and Leaderboards.

Special thanks to all of our contributors for their ongoing contributions to Papers with Code.

More from Papers with Code 🗣


Papers with Code now Maintaining aideadline.es

We are now helping to maintain AI Conference Deadlines which makes it easy to find and follow conferences in areas such as machine learning and NLP. Thanks to Abhishek Das for this great initiative.

---

We would be happy to hear your thoughts and suggestions on the newsletter. Please reply to elvis@paperswithcode.com.

See previous issues

Join us on Slack, LinkedIn, and Twitter