Search Results for author: Jiasen Lu

Found 30 papers, 19 papers with code

MM-Ego: Towards Building Egocentric Multimodal LLMs

no code implementations9 Oct 2024 Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, BoWen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, Yinfei Yang

First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data.

Video Understanding

SoupLM: Model Integration in Large Language and Multi-Modal Models

no code implementations11 Jul 2024 Yue Bai, Zichen Zhang, Jiasen Lu, Yun Fu

Training large language models (LLMs) and multimodal LLMs necessitates significant computing resources, and existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks.

Chatbot

Preserving Identity with Variational Score for General-purpose 3D Editing

no code implementations13 Jun 2024 Duong H. Le, Tuan Pham, Aniruddha Kembhavi, Stephan Mandt, Wei-Chiu Ma, Jiasen Lu

We present Piva (Preserving Identity with Variational Score Distillation), a novel optimization-based method for editing images and 3D models based on diffusion models.

Denoising

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

1 code implementation28 Dec 2023 Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action.

Decoder Image Generation +1

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

no code implementations17 Jun 2022 Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing.

Depth Estimation Image Generation +12

ASC me to Do Anything: Multi-task Training for Embodied AI

no code implementations14 Feb 2022 Jiasen Lu, Jordi Salvador, Roozbeh Mottaghi, Aniruddha Kembhavi

We propose Atomic Skill Completion (ASC), an approach for multi-task training for Embodied AI, where a set of atomic skills shared across multiple tasks are composed together to perform the tasks.

Container: Context Aggregation Networks

2 code implementations NeurIPS 2021 Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations.

Inductive Bias Instance Segmentation +4

A Simple Long-Tailed Recognition Baseline via Vision-Language Model

1 code implementation29 Nov 2021 Teli Ma, Shijie Geng, Mengmeng Wang, Jing Shao, Jiasen Lu, Hongsheng Li, Peng Gao, Yu Qiao

Recent advances in large-scale contrastive visual-language pretraining shed light on a new pathway for visual recognition.

Ranked #5 on Long-tail Learning on Places-LT (using extra training data)

Contrastive Learning Language Modelling +3

Container: Context Aggregation Network

4 code implementations2 Jun 2021 Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations.

Image Classification Inductive Bias +5

Multi-Modal Answer Validation for Knowledge-Based VQA

1 code implementation23 Mar 2021 Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi

Instead of searching for the answer in a vast collection of often irrelevant facts as most existing approaches do, MAVEx aims to learn how to extract relevant knowledge from noisy sources, which knowledge source to trust for each answer candidate, and how to validate the candidate using that source.

Question Answering Retrieval +1

Transferable Feature Learning on Graphs Across Visual Domains

no code implementations1 Jan 2021 Ronghang Zhu, Xiaodong Jiang, Jiasen Lu, Sheng Li

In this paper, we propose a novel Transferable Feature Learning approach on Graphs (TFLG) for unsupervised adversarial domain adaptation, which jointly incorporates sample-level and class-level structure information across two domains.

Unsupervised Domain Adaptation

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

1 code implementation EMNLP 2020 Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi

X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.

Image Captioning Image Generation +3

12-in-1: Multi-Task Vision and Language Representation Learning

5 code implementations CVPR 2020 Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly.

Image Retrieval Question Answering +3

Emergence of Compositional Language with Deep Generational Transmission

1 code implementation ICLR 2020 Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, Dhruv Batra

In this paper, we introduce these cultural evolutionary dynamics into language emergence by periodically replacing agents in a population to create a knowledge gap, implicitly inducing cultural transmission of language.

Reinforcement Learning Reinforcement Learning (RL)

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

no code implementations1 Oct 2018 Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

Our question generation policy generalizes to new environments and a new pair of eyes, i. e., new visual system.

Question Generation Question-Generation +1

Graph R-CNN for Scene Graph Generation

3 code implementations ECCV 2018 Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images.

Graph Generation Scene Graph Generation

Neural Baby Talk

1 code implementation CVPR 2018 Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image.

Image Captioning Object +3

Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model

1 code implementation NeurIPS 2017 Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra

In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses.

Informativeness Metric Learning +2

ParlAI: A Dialog Research Software Platform

22 code implementations EMNLP 2017 Alexander H. Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, Jason Weston

We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at http://parl. ai.

reinforcement-learning Reinforcement Learning +2

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

7 code implementations CVPR 2017 Jiasen Lu, Caiming Xiong, Devi Parikh, Richard Socher

The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation.

Decoder Image Captioning +1

Hierarchical Question-Image Co-Attention for Visual Question Answering

9 code implementations NeurIPS 2016 Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

Visual Dialog Visual Question Answering

Human Action Segmentation With Hierarchical Supervoxel Consistency

no code implementations CVPR 2015 Jiasen Lu, ran Xu, Jason J. Corso

Detailed analysis of human action, such as action classification, detection and localization has received increasing attention from the community; datasets like JHMDB have made it plausible to conduct studies analyzing the impact that such deeper information has on the greater action understanding problem.

Action Classification Action Segmentation +3

Cannot find the paper you are looking for? You can Submit a new open access paper.