Papers with Code Newsletter #36
Welcome to the 36th issue of the Papers with Code newsletter. This week, we cover:
- extending transformers for long input summarization,
- a conversational agent that continually learns,
- a method for personalizing text-to-image generation,
- new state-of-the-art results,
- ... and much more.
Extending Transformers for Long Input Summarization
Model scores on SCROLLS summarization tasks. The proposed model outperforms (PEGASUS-X) other models at comparable sizes. Figure source: Phang et al. (2022)
While transformer-based approaches are effective at tackling natural language tasks, long sequences inputs are still a challenge. Several previous approaches have been proposed to deal more effectively and efficiently with longer sequences (e.g., BigBird). Phang et al. (2022) investigates more closely what model architectural changes and pretraining paradigms can efficiently adapt a pre-trained Transformer for long input summarization. Here is a summary of the findings and results:
- Local attention provides a strong baseline and adding global tokens significantly improves performance; both type of models are found to be resource-efficient.
- Staggering local attention blocks, which allows for cross-block interactions, improves performance with minimal additional computational cost or complexity.
- Larger block sizes and/or number of global tokens also leads to performance improvements.
- Sinusoidal position encodings is still a good choice for dealing with long inputs sequences.
- Allocating some portion of training to long-input pretraining was observed to improve performance. However, exclusively doing long pretraining doesn't favor performance.
- Cross-attention can be dropped for a fraction of decoder layers to reduce memory consumption but at the cost of performance regression.
BlenderBot 3 module execution flow. Figure source: Shuster et al. (2022)
One area of research that could potentially lead to more improved conversational agents is to use systems that continually improve models over time and incorporating user feedback. Shuster et al. (2022) recently proposed BlenderBot 3 (BB3), a 175B parameter dialogue model for open-domain conversations that learns continually to responsibly engage.
BB3 is built on top of OPT, several modules, and fine-tuned on different tasks that enable completion of goals through feedback. Various other models of its kind are trained with human-annotated supervised targets, which could lead to mismatch in desires of organic users. BB3, on the other hand, aims to collect feedback and continually learns from organic users through interactions. Feedback learning helps the systems to align more with users and improves results. The chatbot is available in the US at this link: https://blenderbot.ai/. Findings, detailed results, model design, model safety, and future plans/releases of are discussed further in the paper.
An Image is Worth One Word
Discovered pseudo words for concepts (left) are used to compose new sentences leading to creative novel scenes. Figure source: Gal et al. (2022)
Recently, we have seen the rise of text-to-image models that allow users to synthesize novel scenes and rich images using different styles. In terms of the artistic creation process using these generative models, coming up with effective text descriptions to render a desired target remains a challenge. It's also unclear how to generate images of specific unique concepts, incorporate modifications on appearance, and compose them in different roles and novel scenes. Gal et al. (2022) recently proposed a new approach to tackle these challenges and allow for more creative freedom with these generative systems.
In summary, this work takes a few images for a concept and learn to represent it through new "words" in the embedding space of a frozen text-to-image model. Through a process referred to as "textual inversions", the goal is to find new pseudo-words in the embedding space that can capture high-level semantics and fine visual details. The goal is then to use these words to compose new sentences to guide novel personalized creations. Results demonstrate that this approach for personalizing text-to-image generation can provide high visual fidelity and enables robust editing of scenes.
New on Papers with Code
Below are some of the new papers and results on Papers with Code.
Papers & Results
MinVIS architecture for video instance segmentation. Figure source: Huang et al. (2022)
MinVIS - a minimal video instance segmentation framework, without video-based training, that produces state-of-the-art performance and is comparable to fully-supervised approaches.
LLM.int8() - a new quantization procedure that allows large scale model checkpoints (16/32-bit) to be loaded and converted to Int8. This allows access to large language models (LLMs) that could not be accessed due to limited GPU memory; it enables LLMs with 175B parameters like OPT to be effectively used to perform inference without any performance degradation.
XCLIP - proposes an approach for expanding language-image pretrained models for general video recognition; it can generalize to different video recognition scenarios and achieves top performance on the Kinetics benchmark.
Prompt Tuning for Generative Multimodal Pretrained Models - demonstrates that prompt tuning can achieve comparable performance with fine-tuning on generative multimodal pretrained models.
Benchmark, Datasets & Tools
OpenMedIA - an open-source medical image analysis toolbox and benchmark containing deep learning methods for medical image analysis under heterogeneous AI computing platforms.
ferret - a new framework for benchmarking explainers based on Transformers.