Search Results for author: Jaemin Cho

Found 25 papers, 17 papers with code

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

no code implementations • 15 Apr 2024 • Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal

Ctrl-Adapter provides diverse capabilities including image control, video control, video control with sparse frames, multi-condition control, compatibility with different backbones, adaptation to unseen control conditions, and video editing.

Image Generation Video Editing +1

Paper
Add Code

Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts

1 code implementation • 31 Mar 2024 • Qin Liu, Jaemin Cho, Mohit Bansal, Marc Niethammer

In light of this, we reintroduce this dense design into the generalist models, to facilitate the development of generalist models with high segmentation quality.

Image Segmentation Interactive Segmentation +2

Paper
Code

EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents

no code implementations • 18 Mar 2024 • Abhay Zala, Jaemin Cho, Han Lin, Jaehong Yoon, Mohit Bansal

Instead of directly employing LLMs as agents, can we use LLMs' reasoning capabilities to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at?

Reinforcement Learning (RL) World Knowledge

Paper
Add Code

SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

no code implementations • 11 Mar 2024 • Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal

In this paper, we introduce SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data, a novel paradigm to improve the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets, with skill-specific expert learning and merging.

In-Context Learning

Paper
Add Code

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

no code implementations • 4 Mar 2024 • David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest.

Math Phrase Grounding +2

Paper
Add Code

Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation

no code implementations • 27 Oct 2023 • Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, Su Wang

With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above.

Question Answering Question Generation +3

Paper
Add Code

DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning

no code implementations • 18 Oct 2023 • Abhay Zala, Han Lin, Jaemin Cho, Mohit Bansal

In the first stage, we use LLMs to generate and iteratively refine 'diagram plans' (in a planner-auditor feedback loop) which describe all the entities (objects and text labels), their relationships (arrows or lines), and their bounding box layouts.

Paper
Add Code

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

no code implementations • 26 Sep 2023 • Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal

Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation.

Image Generation Video Generation

Paper
Add Code

Visual Programming for Text-to-Image Generation and Evaluation

no code implementations • 24 May 2023 • Jaemin Cho, Abhay Zala, Mohit Bansal

First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation.

Text-to-Image Generation World Knowledge

Paper
Add Code

Paxion: Patching Action Knowledge in Video-Language Foundation Models

1 code implementation • NeurIPS 2023 • Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, Heng Ji

Action knowledge involves the understanding of textual, visual, and temporal aspects of actions.

Ranked #19 on Video Question Answering on NExT-QA (using extra training data)

Action Understanding Object Recognition +1

Paper
Code

Self-Chained Image-Language Model for Video Localization and Question Answering

1 code implementation • NeurIPS 2023 • Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal

SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.

Ranked #3 on Zero-Shot Video Question Answer on IntentQA (using extra training data)

Language Modelling Representation Learning +2

160

Paper
Code

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

1 code implementation • 13 Apr 2023 • Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal

In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape.

Ranked #1 on Layout-to-Image Generation on LayoutBench

Layout-to-Image Generation

Paper
Code

Hierarchical Video-Moment Retrieval and Step-Captioning

1 code implementation • CVPR 2023 • Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas Oğuz, Yasher Mehdad, Mohit Bansal

Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.

Information Retrieval Moment Retrieval +4

Paper
Code

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

1 code implementation • 21 Nov 2022 • Zineng Tang, Jaemin Cho, Jie Lei, Mohit Bansal

We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text.

Cross-Modal Retrieval Language Modelling +1

Paper
Code

TVLT: Textless Vision-Language Transformer

2 code implementations • 28 Sep 2022 • Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR).

Automatic Speech Recognition (ASR) Image Retrieval +6

124,793

Paper
Code

LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

2 code implementations • 13 Jun 2022 • Yi-Lin Sung, Jaemin Cho, Mohit Bansal

LST saves 69% of the memory costs to fine-tune the whole network, while other methods only save 26% of that in similar parameter usages (hence, 2. 7x more memory savings).

Transfer Learning Visual Question Answering (VQA)

212

Paper
Code

Fine-grained Image Captioning with CLIP Reward

1 code implementation • Findings (NAACL) 2022 • Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, Franck Dernoncourt, Trung Bui, Mohit Bansal

Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.

Ranked #26 on Image Captioning on COCO Captions

Caption Generation Descriptive +5

224

Paper
Code

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

2 code implementations • ICCV 2023 • Jaemin Cho, Abhay Zala, Mohit Bansal

In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models.

Image Captioning Image Classification +9

683

Paper
Code

MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

2 code implementations • 20 Dec 2021 • Revanth Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, Alexander Schwing, Heng Ji

Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.

Answer Generation Data Augmentation +2

698

Paper
Code

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

1 code implementation • CVPR 2022 • Yi-Lin Sung, Jaemin Cho, Mohit Bansal

Our results demonstrate that training the adapter with the weight-sharing technique (4. 18% of total parameters for image-text tasks and 3. 39% for video-text tasks) can match the performance of fine-tuning the entire model.

Image Captioning Transfer Learning

195

Paper
Code

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

1 code implementation • NeurIPS 2021 • Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal

We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.

Image Retrieval Knowledge Distillation +6

Paper
Code

Unifying Vision-and-Language Tasks via Text Generation

2 code implementations • 4 Feb 2021 • Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal

On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.

Ranked #3 on Image Captioning on nocaps val

Conditional Text Generation Image Captioning +7

350

Paper
Code

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

1 code implementation • EMNLP 2020 • Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi

X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.

Image Captioning Image Generation +3

Paper
Code

Mixture Content Selection for Diverse Sequence Generation

1 code implementation • IJCNLP 2019 • Jaemin Cho, Minjoon Seo, Hannaneh Hajishirzi

The diversification stage uses a mixture of experts to sample different binary masks on the source sequence for diverse content selection.

Ranked #10 on Question Generation on SQuAD1.1

Abstractive Text Summarization Document Summarization +2

113

Paper
Code

A Hierarchical Latent Structure for Variational Conversation Modeling

4 code implementations • NAACL 2018 • Yookoon Park, Jaemin Cho, Gunhee Kim

To solve the degeneration problem, we propose a novel model named Variational Hierarchical Conversation RNNs (VHCR), involving two key ideas of (1) using a hierarchical structure of latent variables, and (2) exploiting an utterance drop regularization.

175

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.