Search Results for author: Mohit Bansal

Found 269 papers, 177 papers with code

An Overview of Uncertainty Calibration for Text Classification and the Role of Distillation

no code implementations ACL (RepL4NLP) 2021 Han Guo, Ramakanth Pasunuru, Mohit Bansal

Many recalibration methods have been proposed in the literature for quantifying predictive uncertainty and calibrating model outputs, with varying degrees of complexity.

text-classification Text Classification

Enhancing Knowledge Selection for Grounded Dialogues via Document Semantic Graphs

no code implementations NAACL 2022 Sha Li, Mahdi Namazifar, Di Jin, Mohit Bansal, Heng Ji, Yang Liu, Dilek Hakkani-Tur

In this work, we propose to automatically convert the background knowledge documents into document semantic graphs and then perform knowledge selection over such graphs.

Multi-Task Learning Response Generation +1

Continual Few-Shot Learning for Text Classification

1 code implementation EMNLP 2021 Ramakanth Pasunuru, Veselin Stoyanov, Mohit Bansal

In this work, we propose a continual few-shot learning (CFL) task, in which a system is challenged with a difficult phenomenon and asked to learn to correct mistakes with only a few (10 to 15) training examples.

continual few-shot learning Few-Shot Learning +4

Inducing Transformer’s Compositional Generalization Ability via Auxiliary Sequence Prediction Tasks

1 code implementation EMNLP 2021 Yichen Jiang, Mohit Bansal

Motivated by the failure of a Transformer model on the SCAN compositionality challenge (Lake and Baroni, 2018), which requires parsing a command into actions, we propose two auxiliary sequence prediction tasks as additional training supervision.

NDH-Full: Learning and Evaluating Navigational Agents on Full-Length Dialogue

1 code implementation EMNLP 2021 Hyounghun Kim, Jialu Li, Mohit Bansal

In this paper, we explore the Navigation from Dialogue History (NDH) task, which is based on the Cooperative Vision-and-Dialogue Navigation (CVDN) dataset, and present a state-of-the-art model which is built upon Vision-Language transformers.

Data Augmentation Dynamic Time Warping +1

Integrating Visuospatial, Linguistic, and Commonsense Structure into Story Visualization

1 code implementation EMNLP 2021 Adyasha Maharana, Mohit Bansal

Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story).

Dense Captioning Image Generation +1

On Curriculum Learning for Commonsense Reasoning

1 code implementation NAACL 2022 Adyasha Maharana, Mohit Bansal

Hence, we examine the effect of a human-like easy-to-difficult curriculum during finetuning of language models for commonsense reasoning tasks.

Learning-To-Rank Natural Language Understanding +1

GraDA: Graph Generative Data Augmentation for Commonsense Reasoning

1 code implementation COLING 2022 Adyasha Maharana, Mohit Bansal

Recent advances in commonsense reasoning have been fueled by the availability of large-scale human annotated datasets.

Data Augmentation Knowledge Graphs

Multimodal Intent Discovery from Livestream Videos

no code implementations Findings (NAACL) 2022 Adyasha Maharana, Quan Tran, Franck Dernoncourt, Seunghyun Yoon, Trung Bui, Walter Chang, Mohit Bansal

We construct and present a new multimodal dataset consisting of software instructional livestreams and containing manual annotations for both detailed and abstract procedural intent that enable training and evaluation of joint video and text understanding models.

Intent Discovery Video Summarization +1

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

no code implementations15 Apr 2024 Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal

Ctrl-Adapter provides diverse capabilities including image control, video control, video control with sparse frames, multi-condition control, compatibility with different backbones, adaptation to unseen control conditions, and video editing.

Image Generation Video Editing +1

Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts

1 code implementation31 Mar 2024 Qin Liu, Jaemin Cho, Mohit Bansal, Marc Niethammer

In light of this, we reintroduce this dense design into the generalist models, to facilitate the development of generalist models with high segmentation quality.

Image Segmentation Interactive Segmentation +2

EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents

no code implementations18 Mar 2024 Abhay Zala, Jaemin Cho, Han Lin, Jaehong Yoon, Mohit Bansal

Instead of directly employing LLMs as agents, can we use LLMs' reasoning capabilities to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at?

Reinforcement Learning (RL) World Knowledge

DAM: Dynamic Adapter Merging for Continual Video QA Learning

1 code implementation13 Mar 2024 Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, Gedas Bertasius

Our DAM model outperforms prior state-of-the-art continual learning approaches by 9. 1% while exhibiting 1. 9% less forgetting on 6 VidQA datasets spanning various domains.

Continual Learning Image Classification +2

SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

no code implementations11 Mar 2024 Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal

In this paper, we introduce SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data, a novel paradigm to improve the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets, with skill-specific expert learning and merging.

In-Context Learning

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

no code implementations4 Mar 2024 David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest.

Math Phrase Grounding +2

Evaluating Very Long-Term Conversational Memory of LLM Agents

no code implementations27 Feb 2024 Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang

Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions.

Avg Multi-modal Dialogue Generation +1

Soft Self-Consistency Improves Language Model Agents

1 code implementation20 Feb 2024 Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Current "sample and select" methods such as self-consistency (SC) rely on majority voting to score answers.

Language Modelling valid

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

1 code implementation19 Feb 2024 Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, Kaidi Xu

As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial.

Card Games Logical Reasoning

Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings

1 code implementation9 Feb 2024 Yichen Jiang, Xiang Zhou, Mohit Bansal

Transformers generalize to novel compositions of structures and entities after being trained on a complex dataset, but easily overfit on datasets of insufficient complexity.

Machine Translation Quantization +2

CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion

1 code implementation8 Feb 2024 Shoubin Yu, Jaehong Yoon, Mohit Bansal

Furthermore, we propose a fusion module designed to compress multimodal queries, maintaining computational efficiency in the LLM while combining additional modalities.

Computational Efficiency Optical Flow Estimation +2

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

no code implementations5 Feb 2024 Jialu Li, Aishwarya Padmakumar, Gaurav Sukhatme, Mohit Bansal

Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions.

Language Modelling Masked Language Modeling +2

MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

1 code implementation2 Feb 2024 Justin Chih-Yao Chen, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal

Experiments on seven widely-used commonsense and math reasoning benchmarks show that MAGDi improves the reasoning capabilities of smaller models, outperforming several methods that distill from a single teacher and multiple teachers.

Language Modelling Large Language Model +1

ReGAL: Refactoring Programs to Discover Generalizable Abstractions

1 code implementation29 Jan 2024 Elias Stengel-Eskin, Archiki Prasad, Mohit Bansal

While large language models (LLMs) are increasingly being used for program synthesis, they lack the global view needed to develop useful abstractions; they generally predict programs one at a time, often repeating the same functionality.

Date Understanding Program Synthesis

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

1 code implementation19 Jan 2024 Xiyao Wang, YuHang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang

However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated.

Language Modelling Large Language Model

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

1 code implementation12 Jan 2024 Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe

In this paper, we present the surprising conclusion that current language models often generalize relatively well from easy to hard data, even performing as well as "oracle" models trained on hard data.

General Knowledge In-Context Learning +1

A Simple LLM Framework for Long-Range Video Question-Answering

1 code implementation28 Dec 2023 Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius

Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost.

Large Language Model Long-range modeling +2

Merging by Matching Models in Task Parameter Subspaces

1 code implementation7 Dec 2023 Derek Tam, Mohit Bansal, Colin Raffel

Model merging aims to cheaply combine individual task-specific models into a single multitask model.

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

no code implementations30 Nov 2023 Zineng Tang, ZiYi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal

We present CoDi-2, a versatile and interactive Multimodal Large Language Model (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm.

Image Generation In-Context Learning +3

Debiasing Multimodal Models via Causal Information Minimization

1 code implementation28 Nov 2023 Vaidehi Patil, Adyasha Maharana, Mohit Bansal

In this paper, we study bias arising from confounders in a causal graph for multimodal data and examine a novel approach that leverages causally-motivated information minimization to learn the confounder representations.

Visual Question Answering (VQA)

ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

1 code implementation22 Nov 2023 Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal

Despite the efficiency of PEFT methods, the size of expert models can make it onerous to retrieve expert models per query over high-latency networks like the Internet or serve multiple experts on a single GPU.

Language Modelling Quantization

Multimodal Representation Learning by Alternating Unimodal Adaptation

1 code implementation17 Nov 2023 Xiaohui Zhang, Jaehong Yoon, Mohit Bansal, Huaxiu Yao

This optimization process is controlled by a gradient modification mechanism to prevent the shared head from losing previously acquired information.

Representation Learning

ADaPT: As-Needed Decomposition and Planning with Language Models

no code implementations8 Nov 2023 Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, Tushar Khot

Large Language Models (LLMs) are increasingly being used for interactive decision-making tasks requiring planning and adapting to the environment.

Decision Making

Data Factors for Better Compositional Generalization

1 code implementation8 Nov 2023 Xiang Zhou, Yichen Jiang, Mohit Bansal

However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets show better generalization ability.


Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation

no code implementations27 Oct 2023 Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, Su Wang

With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above.

Question Answering Question Generation +3

Branch-Solve-Merge Improves Large Language Model Evaluation and Generation

no code implementations23 Oct 2023 Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, Xian Li

Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria.

Language Modelling Large Language Model +1

DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning

no code implementations18 Oct 2023 Abhay Zala, Han Lin, Jaemin Cho, Mohit Bansal

In the first stage, we use LLMs to generate and iteratively refine 'diagram plans' (in a planner-auditor feedback loop) which describe all the entities (objects and text labels), their relationships (arrows or lines), and their bounding box layouts.

D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning

1 code implementation11 Oct 2023 Adyasha Maharana, Prateek Yadav, Mohit Bansal

There are two dominant approaches: (1) geometry-based data selection for maximizing data diversity in the coreset, and (2) functions that assign difficulty scores to samples based on training dynamics.

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

1 code implementation9 Oct 2023 Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

An increasing number of vision-language tasks can be handled with little to no training, i. e., in a zero and few-shot manner, by marrying large language models (LLMs) to vision encoders, resulting in large vision-language models (LVLMs).

Language Modelling Question Answering +2

ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models

no code implementations4 Oct 2023 Yi-Lin Sung, Jaehong Yoon, Mohit Bansal

We first determine the sparsity ratios of different layers or blocks by leveraging the global importance score, which is efficiently computed based on the zeroth-order approximation of the global model gradients.

Model Compression

Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy

1 code implementation2 Oct 2023 Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, Tianlong Chen

Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks, however, they have issues like (a) High Memory Usage, due to duplication of the network layers into multiple copies as experts; and (b) Redundancy in Experts, as common learning-based routing policies suffer from representational collapse.

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks

1 code implementation29 Sep 2023 Vaidehi Patil, Peter Hase, Mohit Bansal

Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time.

Model Editing

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

no code implementations26 Sep 2023 Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal

Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation.

Image Generation Video Generation

Scaling Data Generation in Vision-and-Language Navigation

1 code implementation ICCV 2023 Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao

Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents.

Imitation Learning Vision and Language Navigation +1

On Conditional and Compositional Language Model Differentiable Prompting

no code implementations4 Jul 2023 Jonathan Pilault, Can Liu, Mohit Bansal, Markus Dreyer

Prompts have been shown to be an effective method to adapt a frozen Pretrained Language Model (PLM) to perform well on downstream tasks.

Few-Shot Learning Language Modelling +1

Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization

1 code implementation15 Jun 2023 Swarnadeep Saha, Peter Hase, Mohit Bansal

We first show that teacher LLMs can indeed intervene on student reasoning to improve their performance.

TIES-Merging: Resolving Interference When Merging Models

2 code implementations NeurIPS 2023 Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, Mohit Bansal

To address this, we propose our method, TRIM, ELECT SIGN & MERGE (TIES-Merging), which introduces three novel steps when merging models: (1) resetting parameters that only changed a small amount during fine-tuning, (2) resolving sign conflicts, and (3) merging only the parameters that are in alignment with the final agreed-upon sign.

Transfer Learning

Non-Sequential Graph Script Induction via Multimedia Grounding

1 code implementation27 May 2023 Yu Zhou, Sha Li, Manling Li, Xudong Lin, Shih-Fu Chang, Mohit Bansal, Heng Ji

To automate the induction of such graph scripts for given tasks, we propose to take advantage of loosely aligned videos of people performing the tasks.

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

1 code implementation26 May 2023 Shiyue Zhang, Shijie Wu, Ozan Irsoy, Steven Lu, Mohit Bansal, Mark Dredze, David Rosenberg

Autoregressive language models are trained by minimizing the cross-entropy of the model distribution Q relative to the data distribution P -- that is, minimizing the forward cross-entropy, which is equivalent to maximum likelihood estimation (MLE).

Visual Programming for Text-to-Image Generation and Evaluation

no code implementations24 May 2023 Jaemin Cho, Abhay Zala, Mohit Bansal

First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation.

Text-to-Image Generation World Knowledge

Any-to-Any Generation via Composable Diffusion

1 code implementation NeurIPS 2023 Zineng Tang, ZiYi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal

We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities.

Audio Generation

Self-Chained Image-Language Model for Video Localization and Question Answering

1 code implementation NeurIPS 2023 Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal

SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.

Ranked #3 on Zero-Shot Video Question Answer on IntentQA (using extra training data)

Language Modelling Representation Learning +2

HistAlign: Improving Context Dependency in Language Generation by Aligning with History

1 code implementation8 May 2023 David Wan, Shiyue Zhang, Mohit Bansal

Cache-LMs, which augment LMs with a memory of recent history, can increase context dependency and have shown remarkable performance in diverse language generation tasks.

Abstractive Text Summarization Text Generation

An Empirical Study of Multimodal Model Merging

1 code implementation28 Apr 2023 Yi-Lin Sung, Linjie Li, Kevin Lin, Zhe Gan, Mohit Bansal, Lijuan Wang

In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalities.

Retrieval Visual Question Answering (VQA)

ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness

1 code implementation21 Apr 2023 Archiki Prasad, Swarnadeep Saha, Xiang Zhou, Mohit Bansal

Multi-step reasoning ability is fundamental to many natural language tasks, yet it is unclear what constitutes a good reasoning chain and how to evaluate them.

Informativeness Natural Language Inference +1

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

1 code implementation13 Apr 2023 Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal

In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape.

Layout-to-Image Generation

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

no code implementations CVPR 2023 Jialu Li, Mohit Bansal

We then fine-tune the agent on the VLN task with an auxiliary loss that minimizes the difference between the view semantics generated by the agent and the ground truth view semantics of the next step.

Image Generation Navigate +3

Hierarchical Video-Moment Retrieval and Step-Captioning

1 code implementation CVPR 2023 Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas Oğuz, Yasher Mehdad, Mohit Bansal

Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.

Information Retrieval Moment Retrieval +4

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

1 code implementation28 Mar 2023 Adyasha Maharana, Amita Kamath, Christopher Clark, Mohit Bansal, Aniruddha Kembhavi

As general purpose vision models get increasingly effective at a wide set of tasks, it is imperative that they be consistent across the tasks they support.

Faithfulness-Aware Decoding Strategies for Abstractive Summarization

1 code implementation6 Mar 2023 David Wan, Mengwen Liu, Kathleen McKeown, Markus Dreyer, Mohit Bansal

We present a systematic study of the effect of generation techniques such as beam search and nucleus sampling on faithfulness in abstractive summarization.

Abstractive Text Summarization

Vision Transformers are Parameter-Efficient Audio-Visual Learners

1 code implementation CVPR 2023 Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius

To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT.

Audio-visual Question Answering

VindLU: A Recipe for Effective Video-and-Language Pretraining

1 code implementation CVPR 2023 Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA.

Ranked #2 on Video Retrieval on Condensed Movies (using extra training data)

Question Answering Retrieval +3

Unifying Vision, Text, and Layout for Universal Document Processing

2 code implementations CVPR 2023 Zineng Tang, ZiYi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal

UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation.

Ranked #5 on Visual Question Answering (VQA) on InfographicVQA (using extra training data)

document understanding Image Reconstruction +1

Mutual Exclusivity Training and Primitive Augmentation to Induce Compositionality

1 code implementation28 Nov 2022 Yichen Jiang, Xiang Zhou, Mohit Bansal

Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.

Data Augmentation Inductive Bias +1

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

1 code implementation21 Nov 2022 Zineng Tang, Jaemin Cho, Jie Lei, Mohit Bansal

We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text.

Cross-Modal Retrieval Language Modelling +1

Evaluating the Factual Consistency of Large Language Models Through News Summarization

1 code implementation15 Nov 2022 Derek Tam, Anisha Mascarenhas, Shiyue Zhang, Sarah Kwan, Mohit Bansal, Colin Raffel

To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent.

News Summarization

Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations

1 code implementation14 Nov 2022 Swarnadeep Saha, Peter Hase, Nazneen Rajani, Mohit Bansal

We observe that (1) GPT-3 explanations are as grammatical as human explanations regardless of the hardness of the test samples, (2) for easy examples, GPT-3 generates highly supportive explanations but human explanations are more generalizable, and (3) for hard examples, human explanations are significantly better than GPT-3 explanations both in terms of label-supportiveness and generalizability judgements.

Evaluating and Improving Factuality in Multimodal Abstractive Summarization

1 code implementation4 Nov 2022 David Wan, Mohit Bansal

Current metrics for evaluating factuality for abstractive document summarization have achieved high correlations with human judgment, but they do not account for the vision modality and thus are not adequate for vision-and-language summarization.

Abstractive Text Summarization Document Summarization

Exclusive Supermask Subnetwork Training for Continual Learning

1 code implementation18 Oct 2022 Prateek Yadav, Mohit Bansal

Although there is no forgetting, the performance of SupSup is sub-optimal because fixed weights restrict its representational power.

Continual Learning Text Classification +1

TVLT: Textless Vision-Language Transformer

2 code implementations28 Sep 2022 Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR).

Automatic Speech Recognition (ASR) Image Retrieval +6

Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees

1 code implementation21 Sep 2022 Swarnadeep Saha, Shiyue Zhang, Peter Hase, Mohit Bansal

We demonstrate that SP-Search effectively represents the generative process behind human summaries using modules that are typically faithful to their intended behavior.

Abstractive Text Summarization Sentence +1

StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

1 code implementation13 Sep 2022 Adyasha Maharana, Darryl Hannan, Mohit Bansal

Hence, we first propose the task of story continuation, where the generated visual story is conditioned on a source image, allowing for better generalization to narratives with new characters.

Image Generation Story Continuation +2

Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization

1 code implementation8 Sep 2022 Shiyue Zhang, David Wan, Mohit Bansal

Though extractive summarization is less prone to the common unfaithfulness issues of abstractive summaries, does that mean extractive is equal to faithful?

Abstractive Text Summarization Extractive Summarization

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

1 code implementation25 Jul 2022 Yonatan Bitton, Nitzan Bitton Guetta, Ron Yosef, Yuval Elovici, Mohit Bansal, Gabriel Stanovsky, Roy Schwartz

While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills.

Common Sense Reasoning General Knowledge +4

SETSum: Summarization and Visualization of Student Evaluations of Teaching

1 code implementation NAACL (ACL) 2022 Yinuo Hu, Shiyue Zhang, Viji Sathy, A. T. Panter, Mohit Bansal

Ten university professors from diverse departments serve as evaluators of the system and all agree that SETSum helps them interpret SET results more efficiently; and 6 out of 10 instructors prefer our system over the standard static PDF report (while the remaining 4 would like to have both).

Aspect Extraction Sentiment Analysis

CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination

1 code implementation NAACL 2022 Hyounghun Kim, Abhay Zala, Mohit Bansal

Next, a counterfactual imagined scene change (in textual form) is applied, and the model has to predict the new response to the initial question based on this scene change.


CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

1 code implementation Findings (NAACL) 2022 Jialu Li, Hao Tan, Mohit Bansal

Empirically, on the Room-Across-Room dataset, we show that our multilingual agent gets large improvements in all metrics over the strong baseline model when generalizing to unseen environments with the cross-lingual language representation and the environment-agnostic visual representation.

Navigate Representation Learning +2

Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?

1 code implementation NAACL 2022 Xiang Zhou, Shiyue Zhang, Mohit Bansal

MPoSM can model arbitrary tag dependency and perform POS induction through the objective of masked POS reconstruction.

POS POS Tagging +1

VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

1 code implementation22 Jun 2022 Zhuofan Ying, Peter Hase, Mohit Bansal

In this paper, we show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason (RRR) metrics by optimizing for four key model objectives: (1) accurate predictions given limited but sufficient information (Sufficiency); (2) max-entropy predictions given no important information (Uncertainty); (3) invariance of predictions to changes in unimportant features (Invariance); and (4) alignment between model FI explanations and human FI explanations (Plausibility).

Feature Importance Question Answering +2

Enhanced Knowledge Selection for Grounded Dialogues via Document Semantic Graphs

no code implementations15 Jun 2022 Sha Li, Mahdi Namazifar, Di Jin, Mohit Bansal, Heng Ji, Yang Liu, Dilek Hakkani-Tur

Providing conversation models with background knowledge has been shown to make open-domain dialogues more informative and engaging.

Multi-Task Learning Response Generation +1

LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

2 code implementations13 Jun 2022 Yi-Lin Sung, Jaemin Cho, Mohit Bansal

LST saves 69% of the memory costs to fine-tune the whole network, while other methods only save 26% of that in similar parameter usages (hence, 2. 7x more memory savings).

Transfer Learning Visual Question Answering (VQA)

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

3 code implementations9 Jun 2022 Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, ZiRui Wang, Ziyi Wu

BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models.

Common Sense Reasoning Math +1

Revealing Single Frame Bias for Video-and-Language Learning

2 code implementations7 Jun 2022 Jie Lei, Tamara L. Berg, Mohit Bansal

Training an effective video-and-language model intuitively requires multiple frames as model inputs.

Ranked #5 on Video Retrieval on SSv2-template retrieval (using extra training data)

Fine-grained Action Recognition Language Modelling +6

Fine-grained Image Captioning with CLIP Reward

1 code implementation Findings (NAACL) 2022 Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, Franck Dernoncourt, Trung Bui, Mohit Bansal

Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.

Caption Generation Descriptive +5

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

1 code implementation22 May 2022 Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, ZiYi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction.

Attribute Automatic Speech Recognition +6

On the Limits of Evaluating Embodied Agent Model Generalization Using Validation Sets

no code implementations insights (ACL) 2022 Hyounghun Kim, Aishwarya Padmakumar, Di Jin, Mohit Bansal, Dilek Hakkani-Tur

Natural language guided embodied task completion is a challenging problem since it requires understanding natural language instructions, aligning them with egocentric visual observations, and choosing appropriate actions to execute in the environment to produce desired changes.

FactPEGASUS: Factuality-Aware Pre-training and Fine-tuning for Abstractive Summarization

1 code implementation NAACL 2022 David Wan, Mohit Bansal

We present FactPEGASUS, an abstractive summarization model that addresses the problem of factuality during pre-training and fine-tuning: (1) We augment the sentence selection strategy of PEGASUS's (Zhang et al., 2020) pre-training objective to create pseudo-summaries that are both important and factual; (2) We introduce three complementary components for fine-tuning.

Abstractive Text Summarization Contrastive Learning +1

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

2 code implementations11 May 2022 Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, Colin Raffel

ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.

Few-Shot Text Classification In-Context Learning

How can NLP Help Revitalize Endangered Languages? A Case Study and Roadmap for the Cherokee Language

1 code implementation ACL 2022 Shiyue Zhang, Ben Frey, Mohit Bansal

We hope that our work serves not only to inform the NLP community about Cherokee, but also to provide inspiration for future work on endangered languages in general.

FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations

2 code implementations NAACL 2022 Leonardo F. R. Ribeiro, Mengwen Liu, Iryna Gurevych, Markus Dreyer, Mohit Bansal

Despite recent improvements in abstractive summarization, most current approaches generate summaries that are not factually consistent with the source document, severely restricting their trust and usage in real-world applications.

Abstractive Text Summarization

Explanation Graph Generation via Pre-trained Language Models: An Empirical Study with Contrastive Learning

1 code implementation ACL 2022 Swarnadeep Saha, Prateek Yadav, Mohit Bansal

In this work, we study pre-trained language models that generate explanation graphs in an end-to-end manner and analyze their ability to learn the structural constraints and semantics of such graphs.

Contrastive Learning Graph Generation +1

EnvEdit: Environment Editing for Vision-and-Language Navigation

1 code implementation CVPR 2022 Jialu Li, Hao Tan, Mohit Bansal

Training on these edit-augmented environments prevents the agent from overfitting to existing environments and helps generalize better to new, unseen environments.

Ranked #2 on Vision and Language Navigation on RxR (using extra training data)

Data Augmentation Navigate +1

GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models

2 code implementations14 Mar 2022 Archiki Prasad, Peter Hase, Xiang Zhou, Mohit Bansal

Providing natural language instructions in prompts is a useful new paradigm for improving task performance of large language models in a zero-shot setting.

CAISE: Conversational Agent for Image Search and Editing

1 code implementation24 Feb 2022 Hyounghun Kim, Doo Soon Kim, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, Mohit Bansal

To our knowledge, this is the first dataset that provides conversational image search and editing annotations, where the agent holds a grounded conversation with users and helps them to search and edit images according to their requests.

Image Retrieval

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

2 code implementations ICCV 2023 Jaemin Cho, Abhay Zala, Mohit Bansal

In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models.

Image Captioning Image Classification +9

MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

2 code implementations20 Dec 2021 Revanth Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, Alexander Schwing, Heng Ji

Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.

Answer Generation Data Augmentation +2

Analyzing the Limits of Self-Supervision in Handling Bias in Language

no code implementations16 Dec 2021 Lisa Bauer, Karthik Gopalakrishnan, Spandana Gella, Yang Liu, Mohit Bansal, Dilek Hakkani-Tur

We define three broad classes of task descriptions for these tasks: statement, question, and completion, with numerous lexical variants within each class.

Learning and Analyzing Generation Order for Undirected Sequence Models

1 code implementation Findings (EMNLP) 2021 Yichen Jiang, Mohit Bansal

On examples with a maximum source and target length of 30 from De-En, WMT'16 English-Romanian, and WMT'21 English-Chinese translation tasks, our learned order outperforms all heuristic generation orders on four out of six tasks.

Machine Translation Translation

Proposition-Level Clustering for Multi-Document Summarization

2 code implementations NAACL 2022 Ori Ernst, Avi Caciularu, Ori Shapira, Ramakanth Pasunuru, Mohit Bansal, Jacob Goldberger, Ido Dagan

Text clustering methods were traditionally incorporated into multi-document summarization (MDS) as a means for coping with considerable information repetition.

Clustering Document Summarization +3

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

1 code implementation CVPR 2022 Yi-Lin Sung, Jaemin Cho, Mohit Bansal

Our results demonstrate that training the adapter with the weight-sharing technique (4. 18% of total parameters for image-text tasks and 3. 39% for video-text tasks) can match the performance of fine-tuning the entire model.

Image Captioning Transfer Learning

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

1 code implementation8 Dec 2021 Yixin Nie, Linjie Li, Zhe Gan, Shuohang Wang, Chenguang Zhu, Michael Zeng, Zicheng Liu, Mohit Bansal, Lijuan Wang

Based on this, we ask an even bolder question: can we have an all-MLP architecture for VL modeling, where both VL fusion and the vision encoder are replaced with MLPs?

Language Modelling Visual Question Answering (VQA)

Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs

1 code implementation26 Nov 2021 Peter Hase, Mona Diab, Asli Celikyilmaz, Xian Li, Zornitsa Kozareva, Veselin Stoyanov, Mohit Bansal, Srinivasan Iyer

In this paper, we discuss approaches to detecting when models have beliefs about the world, and we improve on methods for updating model beliefs to be more truthful, with a focus on methods based on learned optimizers or hypernetworks.

Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions

1 code implementation1 Nov 2021 Prateek Yadav, Peter Hase, Mohit Bansal

Current approaches try to optimize for the cost incurred by users when adopting a recourse, but they assume that all users share the same cost function.


Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization

1 code implementation21 Oct 2021 Adyasha Maharana, Mohit Bansal

Prior work in this domain has shown that there is ample room for improvement in the generated image sequence in terms of visual quality, consistency and relevance.

Dense Captioning Image Generation +1

Inducing Transformer's Compositional Generalization Ability via Auxiliary Sequence Prediction Tasks

1 code implementation30 Sep 2021 Yichen Jiang, Mohit Bansal

Motivated by the failure of a Transformer model on the SCAN compositionality challenge (Lake and Baroni, 2018), which requires parsing a command into actions, we propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics, as additional training supervision.

Finding a Balanced Degree of Automation for Summary Evaluation

1 code implementation EMNLP 2021 Shiyue Zhang, Mohit Bansal

In this work, we propose flexible semiautomatic to automatic summary evaluation metrics, following the Pyramid human evaluation method.

Natural Language Inference Semantic Role Labeling

Continuous Language Generative Flow

1 code implementation ACL 2021 Zineng Tang, Shiyue Zhang, Hyounghun Kim, Mohit Bansal

Recent years have witnessed various types of generative models for natural language generation (NLG), especially RNNs or transformer based sequence-to-sequence models, as well as variational autoencoder (VAE) and generative adversarial network (GAN) based models.

Data Augmentation Density Estimation +9

EmailSum: Abstractive Email Thread Summarization

1 code implementation ACL 2021 Shiyue Zhang, Asli Celikyilmaz, Jianfeng Gao, Mohit Bansal

Furthermore, we find that widely used automatic evaluation metrics (ROUGE, BERTScore) are weakly correlated with human judgments on this email thread summarization task.

Abstractive Text Summarization Email Thread Summarization

ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality Estimation and Corrective Feedback

2 code implementations ACL 2021 Shiyue Zhang, Benjamin Frey, Mohit Bansal

The quantitative evaluation demonstrates that our backbone translation models achieve state-of-the-art translation performance and our quality estimation well correlates with both BLEU and human judgment.

Machine Translation NMT +3

MTVR: Multilingual Moment Retrieval in Videos

1 code implementation ACL 2021 Jie Lei, Tamara L. Berg, Mohit Bansal

We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21. 8K TV show video clips.

Moment Retrieval Retrieval

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

4 code implementations20 Jul 2021 Jie Lei, Tamara L. Berg, Mohit Bansal

Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.

Highlight Detection Moment Retrieval +2

How Much Can CLIP Benefit Vision-and-Language Tasks?

4 code implementations13 Jul 2021 Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

Ranked #4 on Vision and Language Navigation on RxR (using extra training data)