no code implementations • 9 Oct 2024 • Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, BoWen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, Yinfei Yang
First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data.
no code implementations • 25 Sep 2024 • Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi
We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future.
no code implementations • 11 Jul 2024 • Yue Bai, Zichen Zhang, Jiasen Lu, Yun Fu
Training large language models (LLMs) and multimodal LLMs necessitates significant computing resources, and existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks.
no code implementations • 13 Jun 2024 • Duong H. Le, Tuan Pham, Aniruddha Kembhavi, Stephan Mandt, Wei-Chiu Ma, Jiasen Lu
We present Piva (Preserving Identity with Variational Score Distillation), a novel optimization-based method for editing images and 3D models based on diffusion models.
no code implementations • CVPR 2024 • Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following novel instructions.
1 code implementation • 28 Dec 2023 • Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi
We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action.
no code implementations • 17 Jun 2022 • Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing.
Ranked #1 on Object Segmentation on GRIT
no code implementations • 14 Feb 2022 • Jiasen Lu, Jordi Salvador, Roozbeh Mottaghi, Aniruddha Kembhavi
We propose Atomic Skill Completion (ASC), an approach for multi-task training for Embodied AI, where a set of atomic skills shared across multiple tasks are composed together to perform the tasks.
no code implementations • CVPR 2022 • Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi
Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Ranked #6 on Action Classification on Kinetics-600 (using extra training data)
2 code implementations • NeurIPS 2021 • Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi
Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations.
1 code implementation • 29 Nov 2021 • Teli Ma, Shijie Geng, Mengmeng Wang, Jing Shao, Jiasen Lu, Hongsheng Li, Peng Gao, Yu Qiao
Recent advances in large-scale contrastive visual-language pretraining shed light on a new pathway for visual recognition.
Ranked #5 on Long-tail Learning on Places-LT (using extra training data)
4 code implementations • 2 Jun 2021 • Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi
Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations.
Ranked #478 on Image Classification on ImageNet
1 code implementation • 23 Mar 2021 • Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi
Instead of searching for the answer in a vast collection of often irrelevant facts as most existing approaches do, MAVEx aims to learn how to extract relevant knowledge from noisy sources, which knowledge source to trust for each answer candidate, and how to validate the candidate using that source.
no code implementations • 1 Jan 2021 • Ronghang Zhu, Xiaodong Jiang, Jiasen Lu, Sheng Li
In this paper, we propose a novel Transferable Feature Learning approach on Graphs (TFLG) for unsupervised adversarial domain adaptation, which jointly incorporates sample-level and class-level structure information across two domains.
1 code implementation • EMNLP 2020 • Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi
X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.
1 code implementation • NeurIPS 2020 • Michael Cogswell, Jiasen Lu, Rishabh Jain, Stefan Lee, Devi Parikh, Dhruv Batra
Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people?
1 code implementation • ECCV 2020 • Yash Kant, Dhruv Batra, Peter Anderson, Alex Schwing, Devi Parikh, Jiasen Lu, Harsh Agrawal
Further, each head in our multi-head self-attention layer focuses on a different subset of relations.
5 code implementations • CVPR 2020 • Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee
Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly.
Ranked #1 on 10-shot image generation on MEAD
11 code implementations • NeurIPS 2019 • Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
Ranked #5 on Referring Expression Comprehension on Talk2Car
1 code implementation • ICLR 2020 • Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, Dhruv Batra
In this paper, we introduce these cultural evolutionary dynamics into language emergence by periodically replacing agents in a population to create a knowledge gap, implicitly inducing cultural transmission of language.
2 code implementations • ICLR 2019 • Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, Caiming Xiong
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments.
Ranked #115 on Vision and Language Navigation on VLN Challenge
Natural Language Visual Grounding Vision and Language Navigation +2
no code implementations • 1 Oct 2018 • Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh
Our question generation policy generalizes to new environments and a new pair of eyes, i. e., new visual system.
3 code implementations • ECCV 2018 • Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh
We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images.
Ranked #13 on Scene Graph Generation on Visual Genome
1 code implementation • CVPR 2018 • Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh
We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image.
1 code implementation • NeurIPS 2017 • Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra
In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses.
Ranked #8 on Visual Dialog on VisDial v0.9 val
22 code implementations • EMNLP 2017 • Alexander H. Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, Jason Weston
We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at http://parl. ai.
7 code implementations • CVPR 2017 • Jiasen Lu, Caiming Xiong, Devi Parikh, Richard Socher
The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation.
9 code implementations • NeurIPS 2016 • Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh
In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
Ranked #3 on Visual Question Answering (VQA) on VQA v1 test-std
no code implementations • CVPR 2015 • Jiasen Lu, ran Xu, Jason J. Corso
Detailed analysis of human action, such as action classification, detection and localization has received increasing attention from the community; datasets like JHMDB have made it plausible to conduct studies analyzing the impact that such deeper information has on the greater action understanding problem.
21 code implementations • ICCV 2015 • Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh
Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.