Search Results for author: Jiasen Lu

Found 25 papers, 18 papers with code

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

no code implementations • 28 Dec 2023 • Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action.

Image Generation Natural Language Understanding

Paper
Add Code

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

no code implementations • 17 Jun 2022 • Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing.

Ranked #1 on Object Segmentation on GRIT

Depth Estimation Image Generation +12

Paper
Add Code

ASC me to Do Anything: Multi-task Training for Embodied AI

no code implementations • 14 Feb 2022 • Jiasen Lu, Jordi Salvador, Roozbeh Mottaghi, Aniruddha Kembhavi

We propose Atomic Skill Completion (ASC), an approach for multi-task training for Embodied AI, where a set of atomic skills shared across multiple tasks are composed together to perform the tasks.

Paper
Add Code

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

no code implementations • CVPR 2022 • Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi

Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.

Ranked #6 on Action Classification on Kinetics-600 (using extra training data)

Action Classification Navigate +2

Paper
Add Code

Container: Context Aggregation Networks

2 code implementations • NeurIPS 2021 • Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations.

Inductive Bias Instance Segmentation +4

Paper
Code

A Simple Long-Tailed Recognition Baseline via Vision-Language Model

1 code implementation • 29 Nov 2021 • Teli Ma, Shijie Geng, Mengmeng Wang, Jing Shao, Jiasen Lu, Hongsheng Li, Peng Gao, Yu Qiao

Recent advances in large-scale contrastive visual-language pretraining shed light on a new pathway for visual recognition.

Ranked #4 on Long-tail Learning on Places-LT (using extra training data)

Contrastive Learning Language Modelling +3

Paper
Code

Container: Context Aggregation Network

4 code implementations • 2 Jun 2021 • Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations.

Ranked #465 on Image Classification on ImageNet

Image Classification Inductive Bias +5

Paper
Code

Multi-Modal Answer Validation for Knowledge-Based VQA

1 code implementation • 23 Mar 2021 • Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi

Instead of searching for the answer in a vast collection of often irrelevant facts as most existing approaches do, MAVEx aims to learn how to extract relevant knowledge from noisy sources, which knowledge source to trust for each answer candidate, and how to validate the candidate using that source.

Question Answering Retrieval +1

Paper
Code

Transferable Feature Learning on Graphs Across Visual Domains

no code implementations • 1 Jan 2021 • Ronghang Zhu, Xiaodong Jiang, Jiasen Lu, Sheng Li

In this paper, we propose a novel Transferable Feature Learning approach on Graphs (TFLG) for unsupervised adversarial domain adaptation, which jointly incorporates sample-level and class-level structure information across two domains.

Unsupervised Domain Adaptation

Paper
Add Code

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

1 code implementation • EMNLP 2020 • Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi

X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.

Image Captioning Image Generation +3

Paper
Code

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

1 code implementation • NeurIPS 2020 • Michael Cogswell, Jiasen Lu, Rishabh Jain, Stefan Lee, Devi Parikh, Dhruv Batra

Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people?

Visual Dialog Visual Question Answering (VQA)

Paper
Code

Spatially Aware Multimodal Transformers for TextVQA

1 code implementation • ECCV 2020 • Yash Kant, Dhruv Batra, Peter Anderson, Alex Schwing, Devi Parikh, Jiasen Lu, Harsh Agrawal

Further, each head in our multi-head self-attention layer focuses on a different subset of relations.

Optical Character Recognition (OCR) Visual Grounding +1

Paper
Code

12-in-1: Multi-Task Vision and Language Representation Learning

5 code implementations • CVPR 2020 • Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly.

Image Retrieval Question Answering +3

790

Paper
Code

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

11 code implementations • NeurIPS 2019 • Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Ranked #5 on Referring Expression Comprehension on Talk2Car

Image Retrieval Question Answering +5

790

Paper
Code

Emergence of Compositional Language with Deep Generational Transmission

1 code implementation • ICLR 2020 • Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, Dhruv Batra

In this paper, we introduce these cultural evolutionary dynamics into language emergence by periodically replacing agents in a population to create a knowledge gap, implicitly inducing cultural transmission of language.

Reinforcement Learning (RL)

Paper
Code

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

2 code implementations • ICLR 2019 • Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, Caiming Xiong

The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments.

Ranked #115 on Vision and Language Navigation on VLN Challenge

Natural Language Visual Grounding Vision and Language Navigation +2

117

Paper
Code

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

no code implementations • 1 Oct 2018 • Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

Our question generation policy generalizes to new environments and a new pair of eyes, i. e., new visual system.

Question Generation Question-Generation

Paper
Add Code

Graph R-CNN for Scene Graph Generation

3 code implementations • ECCV 2018 • Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images.

Ranked #12 on Scene Graph Generation on Visual Genome

Graph Generation Scene Graph Generation

721

Paper
Code

Neural Baby Talk

1 code implementation • CVPR 2018 • Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image.

Image Captioning Object +3

523

Paper
Code

Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model

1 code implementation • NeurIPS 2017 • Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra

In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses.

Ranked #8 on Visual Dialog on VisDial v0.9 val

Informativeness Metric Learning +2

110

Paper
Code

ParlAI: A Dialog Research Software Platform

22 code implementations • EMNLP 2017 • Alexander H. Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, Jason Weston

We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at http://parl. ai.

reinforcement-learning Reinforcement Learning (RL) +1

10,426

Paper
Code

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

7 code implementations • CVPR 2017 • Jiasen Lu, Caiming Xiong, Devi Parikh, Richard Socher

The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation.

Image Captioning Language Modelling

334

Paper
Code

Hierarchical Question-Image Co-Attention for Visual Question Answering

9 code implementations • NeurIPS 2016 • Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

Ranked #3 on Visual Question Answering (VQA) on VQA v1 test-std

Visual Dialog Visual Question Answering

345

Paper
Code

Human Action Segmentation With Hierarchical Supervoxel Consistency

no code implementations • CVPR 2015 • Jiasen Lu, ran Xu, Jason J. Corso

Detailed analysis of human action, such as action classification, detection and localization has received increasing attention from the community; datasets like JHMDB have made it plausible to conduct studies analyzing the impact that such deeper information has on the greater action understanding problem.

Action Classification Action Segmentation +3

Paper
Add Code

VQA: Visual Question Answering

21 code implementations • ICCV 2015 • Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh

Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

Ranked #1 on Visual Question Answering (VQA) on COCO Visual Question Answering (VQA) real images 2.0 open ended

Image Captioning Multiple-choice +1

1,425

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.