May 18, 2022

Papers with Code Newsletter #30

Welcome to the 30th issue of the Papers with Code newsletter. This week we cover:

  • a large open pre-trained transformer language model
  • a generalist agent
  • a new approach for few-shot parameter efficient finetuning
  • state-of-the-art ML methods of the week and more.

OPT

(Left) Multi-shot performance. (Right) Zero-shot evaluation averages. Source: Zhang et al. (2022)

Large scale language models (LLMs) show a lot of potential to improve on many diverse and new tasks and are demonstrating great capabilities for zero and few-shot learning. A majority of the previous large-scale models are either inaccessible or impose large costs to use or study. To address the accessibility challenge, Zhang et al. (2022) release a new open pre-trained transformer-based language model called OPT.

OPT follows other efforts focusing on open-sourcing LLMs such as GPT-Neo. OPT includes a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. OPT-175B requires only 1/7th the carbon footprint of GPT-3 in terms of development. The release also includes a logbook documenting the challenges faced with training such large scale models. The paper includes several experiments involving zero-shot and few-shot learning. A comprehensive bias and toxicity evaluation is also reported to understand the potential harms of LLMs. The openly available models provide the research community with the opportunity to further study capabilities and limitations of LLMs.

Paper & Code

Gato 🐈

Training phase of Gato. Source: Reed et al. (2022)

Reed et al. (2022) propose a new approach, inspired by large-scale language models, that acts a single generalist agent. The agent, called Gato, is built to work as a multi-modal, multi-task, multi-embodiment generalist policy. 

Gato has the ability to perform all sorts of general tasks ranging from playing Atari to chatting to stacking blocks with a real robot arm. Data is serialized into tokens and processed by a transformer neural network similar to large language models. Interleaved tokens and previously sampled actions are consumed to produce the next action autoregressively. New actions are applied to the environment to obtain new observations, and the process is repeated; this represents the control policy. Gato is trained on 604 distinct tasks with varying modalities and performs well on robotic stacking tasks and other generation tasks such as image captioning and chitchat.

Paper & Results

Few-shot Parameter-Efficient Finetuning

Figure of $(IA)^3$ and loss terms in T-Few recipe. Liu et al. (2022)

Liu et al. (2022) recently published a report comparing few-shot in-context learning (ICL) and parameter-efficient fine-tuning (PEFT) in terms of effectiveness and cost efficiency. The main findings indicate that PEFT can offer better accuracy and lower computational costs. This is not surprising given that in ICL examples in the context need to be processed every time the model makes prediction. 

The paper proposes a new recipe, T-Few, that allows a model to obtain high accuracy while incurring lower computation and storage costs. This involves a model that attains strong performance on any new task without manual tuning, which is important for few-shot settings. T0 is chosen as the base model and a new PEFT method $(IA)^3$ is proposed to produce better performance than fine-tuning the full model. $(IA)^3$ rescale inner activations with learned vectors and only introduces a "tiny amount of additional parameters" resulting in better computational efficiency. T-Few uses 1000x fewer FLOPs during inference than few-shot ICL with GPT-3. It's applied to the RAFT benchmark where it achieves "super-human" performance.

Paper, Code & Results

New on Papers with Code 📈

New Papers & Results


CoCa framework for pretraining image-text foundation models. Source: Yu et al. (2022)

CoCa - a new foundation model that achieves new SoTA on ImageNet (91% top-1 accuracy); proposes a minimal strategy to jointly pretrain an image-text encoder-decoder foundation model with contrastive loss and captioning loss.

Sequencer - an LSTM-based architecture for image classification that serves as an alternative to ViT; models long-range dependencies using LSTMs rather than self-attention layers; a 54M parameters Sequencer model achieves 84.6% top-1 accuracy on ImageNet-1K.

Unifying Language Learning Paradigms - proposes a pretraining objective that combines diverse pre-training paradigms together, resulting in a more generalized and unified approach for self-supervision in NLP; a 20B parameter model achieves competitive results across well established NLP tasks including high performance on the SCROLLS long text reasoning benchmark.

Wav2Seq framework for pretraining to transcribe audio inputs into sequences of pseudo language tokens. Source: Wu et al. (2022) 

Wav2Seq - a self-supervised approach for pre-training a speech-to-text encoder-decoder model for speech data; it is pre-trained to transcribe audio inputs into pseudo subword sequences; Wav2Seq achieves competitive results compared to other models like CTC, including high performance on benchmarks like SLUE-VoxPopuli for spoken NER.

ConvMAE - introduces a new framework involving a multi-scale hybrid convolution-transformer to improve representations via mask auto-encoding scheme; instead of relying on the original masking, a masked convolution is introduced to improve computational efficiency and fine-tuning accuracy.

MuQAR - a multimodal quasi auto-regressive deep learning architecture for forecasting the visual popularity of new fashion products that lack historical data; MuQAR is evaluated on fashion image datasets and surpasses current state-of-the-art on the VISUELLE dataset.

DoubleMatch - a new semi-supervised learning algorithm which combines a pseudo-labeling technique with a self-supervised loss; this enables a model to use all unlabeled data during training; training times are reduced with this model and it achieves state-of-the-art performance on several benchmarks datasets.

RepSurf - an approach for novel representations of point clouds; it aims to address a challenge which is present in other previous works: learning local structural information; it can be used as a plug-and-play module for point cloud models; achieves high performance on several point cloud benchmarks.

Benchmark, Datasets & Tools


WebVidVQA3M Dataset preview. Source: Yang et al. 2022

WebVidVQA3M - a new dataset generated using question generation neural models with 3M video-question-answer triplets.

CiteSum - a new benchmark for citation text-guided scientific extreme summarization.

CLUES - a new benchmark for classifier learning using language explanations.

---

📈  Browse SoTA 

🏆  Trending

🔥  Top Social

---

See previous issues

Join us on Slack, LinkedIn, and Twitter