no code implementations • 15 Apr 2024 • Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal
Ctrl-Adapter provides diverse capabilities including image control, video control, video control with sparse frames, multi-condition control, compatibility with different backbones, adaptation to unseen control conditions, and video editing.
1 code implementation • 31 Mar 2024 • Qin Liu, Jaemin Cho, Mohit Bansal, Marc Niethammer
In light of this, we reintroduce this dense design into the generalist models, to facilitate the development of generalist models with high segmentation quality.
no code implementations • 18 Mar 2024 • Abhay Zala, Jaemin Cho, Han Lin, Jaehong Yoon, Mohit Bansal
Instead of directly employing LLMs as agents, can we use LLMs' reasoning capabilities to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at?
no code implementations • 11 Mar 2024 • Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal
In this paper, we introduce SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data, a novel paradigm to improve the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets, with skill-specific expert learning and merging.
no code implementations • 4 Mar 2024 • David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest.
no code implementations • 27 Oct 2023 • Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, Su Wang
With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above.
no code implementations • 18 Oct 2023 • Abhay Zala, Han Lin, Jaemin Cho, Mohit Bansal
In the first stage, we use LLMs to generate and iteratively refine 'diagram plans' (in a planner-auditor feedback loop) which describe all the entities (objects and text labels), their relationships (arrows or lines), and their bounding box layouts.
no code implementations • 26 Sep 2023 • Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal
Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation.
no code implementations • 24 May 2023 • Jaemin Cho, Abhay Zala, Mohit Bansal
First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation.
1 code implementation • NeurIPS 2023 • Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, Heng Ji
Action knowledge involves the understanding of textual, visual, and temporal aspects of actions.
Ranked #19 on Video Question Answering on NExT-QA (using extra training data)
1 code implementation • NeurIPS 2023 • Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
Ranked #3 on Zero-Shot Video Question Answer on IntentQA (using extra training data)
1 code implementation • 13 Apr 2023 • Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal
In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape.
Ranked #1 on Layout-to-Image Generation on LayoutBench
1 code implementation • CVPR 2023 • Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas Oğuz, Yasher Mehdad, Mohit Bansal
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
1 code implementation • 21 Nov 2022 • Zineng Tang, Jaemin Cho, Jie Lei, Mohit Bansal
We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text.
2 code implementations • 28 Sep 2022 • Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal
In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR).
2 code implementations • 13 Jun 2022 • Yi-Lin Sung, Jaemin Cho, Mohit Bansal
LST saves 69% of the memory costs to fine-tune the whole network, while other methods only save 26% of that in similar parameter usages (hence, 2. 7x more memory savings).
1 code implementation • Findings (NAACL) 2022 • Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, Franck Dernoncourt, Trung Bui, Mohit Bansal
Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
Ranked #26 on Image Captioning on COCO Captions
2 code implementations • ICCV 2023 • Jaemin Cho, Abhay Zala, Mohit Bansal
In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models.
2 code implementations • 20 Dec 2021 • Revanth Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, Alexander Schwing, Heng Ji
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
1 code implementation • CVPR 2022 • Yi-Lin Sung, Jaemin Cho, Mohit Bansal
Our results demonstrate that training the adapter with the weight-sharing technique (4. 18% of total parameters for image-text tasks and 3. 39% for video-text tasks) can match the performance of fine-tuning the entire model.
1 code implementation • NeurIPS 2021 • Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
2 code implementations • 4 Feb 2021 • Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal
On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.
Ranked #3 on Image Captioning on nocaps val
1 code implementation • EMNLP 2020 • Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi
X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.
1 code implementation • IJCNLP 2019 • Jaemin Cho, Minjoon Seo, Hannaneh Hajishirzi
The diversification stage uses a mixture of experts to sample different binary masks on the source sequence for diverse content selection.
Ranked #10 on Question Generation on SQuAD1.1
4 code implementations • NAACL 2018 • Yookoon Park, Jaemin Cho, Gunhee Kim
To solve the degeneration problem, we propose a novel model named Variational Hierarchical Conversation RNNs (VHCR), involving two key ideas of (1) using a hierarchical structure of latent variables, and (2) exploiting an utterance drop regularization.