Search Results for author: Hao Tan

Found 19 papers, 14 papers with code

How Much Can CLIP Benefit Vision-and-Language Tasks?

2 code implementations13 Jul 2021 Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

 Ranked #1 on Visual Entailment on SNLI-VE val (using extra training data)

Question Answering Visual Entailment +1

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

1 code implementation6 Jul 2021 Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal

We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.

Image Retrieval Knowledge Distillation +4

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

1 code implementation21 Jun 2021 Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal

Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e. g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations.

Action Classification Action Recognition +2

Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information

1 code implementation NAACL 2021 Jialu Li, Hao Tan, Mohit Bansal

One key challenge in this task is to ground instructions with the current visual information that the agent perceives.

Vision-Language Navigation

Unifying Vision-and-Language Tasks via Text Generation

1 code implementation4 Feb 2021 Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal

On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.

Conditional Text Generation Image Captioning +6

ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments

no code implementations Findings of the Association for Computational Linguistics 2020 Hyounghun Kim, Abhay Zala, Graham Burri, Hao Tan, Mohit Bansal

During this task, the agent (similar to a PokeMON GO player) is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment, but then also ARRAnge the collected objects part-by-part in an egocentric grid-layout environment.

Referring Expression Comprehension Vision and Language Navigation

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

1 code implementation EMNLP 2020 Hao Tan, Mohit Bansal

We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora.

Image Captioning Language Modelling

RelativeNAS: Relative Neural Architecture Search via Slow-Fast Learning

2 code implementations14 Sep 2020 Hao Tan, Ran Cheng, Shihua Huang, Cheng He, Changxiao Qiu, Fan Yang, Ping Luo

Despite the remarkable successes of Convolutional Neural Networks (CNNs) in computer vision, it is time-consuming and error-prone to manually design a CNN.

Keypoint Detection Neural Architecture Search +2

Diagnosing the Environment Bias in Vision-and-Language Navigation

1 code implementation6 May 2020 Yubo Zhang, Hao Tan, Mohit Bansal

Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.

Vision and Language Navigation

The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

1 code implementation EMNLP 2020 Xiang Zhou, Yixin Nie, Hao Tan, Mohit Bansal

For the first question, we conduct a thorough empirical study over analysis sets and find that in addition to the unstable final performance, the instability exists all along the training curve.

Model Selection Natural Language Inference +1

Modality-Balanced Models for Visual Dialogue

no code implementations17 Jan 2020 Hyounghun Kim, Hao Tan, Mohit Bansal

The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.

Visual Dialog

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

6 code implementations IJCNLP 2019 Hao Tan, Mohit Bansal

In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.

Language Modelling Question Answering +2

Expressing Visual Relationships via Language

1 code implementation ACL 2019 Hao Tan, Franck Dernoncourt, Zhe Lin, Trung Bui, Mohit Bansal

To push forward the research in this direction, we first introduce a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions.

Image Captioning

Enabling Robots to Understand Incomplete Natural Language Instructions Using Commonsense Reasoning

no code implementations29 Apr 2019 Haonan Chen, Hao Tan, Alan Kuntz, Mohit Bansal, Ron Alterovitz

Our results show the feasibility of a robot learning commonsense knowledge automatically from web-based textual corpora, and the power of learned commonsense reasoning models in enabling a robot to autonomously perform tasks based on incomplete natural language instructions.

Common Sense Reasoning Language Modelling

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

1 code implementation NAACL 2019 Hao Tan, Licheng Yu, Mohit Bansal

Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions.

Translation Vision-Language Navigation

Object Ordering with Bidirectional Matchings for Visual Reasoning

no code implementations NAACL 2018 Hao Tan, Mohit Bansal

Visual reasoning with compositional natural language instructions, e. g., based on the newly-released Cornell Natural Language Visual Reasoning (NLVR) dataset, is a challenging task, where the model needs to have the ability to create an accurate mapping between the diverse phrases and the several objects placed in complex arrangements in the image.

Visual Reasoning

Source-Target Inference Models for Spatial Instruction Understanding

no code implementations12 Jul 2017 Hao Tan, Mohit Bansal

Models that can execute natural language instructions for situated robotic tasks such as assembly and navigation have several useful applications in homes, offices, and remote scenarios.

Representation Learning

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

2 code implementations CVPR 2017 Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg

The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.

Referring Expression Comprehension

Cannot find the paper you are looking for? You can Submit a new open access paper.