Search Results for author: Yong Jae Lee

Found 84 papers, 48 papers with code

Stay-Positive: A Case for Ignoring Real Image Features in Fake Image Detection

no code implementations11 Feb 2025 Anirudh Sundara Rajan, Yong Jae Lee

Additionally, unlike detectors that associate artifacts with real images, those that focus purely on fake artifacts are better at detecting inpainted real images.

Fake Image Detection

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

no code implementations8 Jan 2025 Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, DongHyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, Ning Zhang, Yong Jae Lee, Miao Liu

Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows.

EgoSchema Object Tracking +1

On the Effectiveness of Dataset Alignment for Fake Image Detection

no code implementations15 Oct 2024 Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, Yong Jae Lee

In this work, we argue that in addition to these algorithmic choices, we also require a well aligned dataset of real/fake images to train a robust detector.

Denoising Fake Image Detection +1

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

1 code implementation14 Oct 2024 Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, Jianwei Yang

TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips.

2k Benchmarking +4

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

1 code implementation3 Oct 2024 Jianrui Zhang, Mu Cai, Yong Jae Lee

There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension.

counterfactual

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

no code implementations1 Oct 2024 Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Yong Jae Lee, Krishna Kumar Singh

In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models.

Language Modeling Language Modelling

Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner

no code implementations19 Sep 2024 Yuzhang Shang, Bingxin Xu, Weitai Kang, Mu Cai, Yuheng Li, Zehao Wen, Zhen Dong, Kurt Keutzer, Yong Jae Lee, Yan Yan

In this paper, we first identify the primary challenges in interpolating Video-LLMs: (1) the video encoder and modality alignment projector are fixed, preventing the integration of additional frames into Video-LLMs, and (2) the LLM backbone is limited in its content length capabilities, which complicates the processing of an increased number of video tokens.

CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models

1 code implementation26 Aug 2024 Shubham Bharti, Shiyun Cheng, Jihyun Rho, Jianrui Zhang, Mu Cai, Yong Jae Lee, Martina Rau, Xiaojin Zhu

We benchmark leading LLMs as of late 2024 - including GPT, Claude, Gemini, Qwen, Llama, and Llava - on the CHARTOM dataset and found that our benchmark was challenging to all of them, suggesting room for future large language models to improve.

Language Modeling Language Modelling

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation

1 code implementation15 Jul 2024 Bocheng Zou, Mu Cai, Jianrui Zhang, Yong Jae Lee

In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world.

Vector Graphics

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

1 code implementation28 Jun 2024 Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo

In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision.

Vision-Language-Action World Knowledge

MATE: Meet At The Embedding -- Connecting Images with Long Texts

no code implementations26 Jun 2024 Young Kyun Jang, Junmo Kang, Yong Jae Lee, Donghyun Kim

While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions.

Cross-Modal Retrieval Descriptive

Yo'LLaVA: Your Personalized Language and Vision Assistant

1 code implementation13 Jun 2024 Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee

In this paper, we introduce the novel task of personalizing LMMs, so that they can have conversations about a specific subject.

Image Captioning Question Answering +1

Matryoshka Multimodal Models

no code implementations27 May 2024 Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee

Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning.

Language Modelling Large Language Model

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

1 code implementation22 Mar 2024 Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan

In response, we propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.

Language Modelling Large Language Model +4

LLM Inference Unveiled: Survey and Roofline Model Insights

2 code implementations26 Feb 2024 Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer

Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques.

Knowledge Distillation Language Modelling +5

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

1 code implementation20 Feb 2024 Jianrui Zhang, Mu Cai, Tengyang Xie, Yong Jae Lee

We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning.

counterfactual Data Augmentation +2

Edit One for All: Interactive Batch Image Editing

no code implementations CVPR 2024 Thao Nguyen, Utkarsh Ojha, Yuheng Li, Haotian Liu, Yong Jae Lee

With increased human control, it is now possible to edit an image in a plethora of ways; from specifying in text what we want to change, to straight up dragging the contents of the image in an interactive point-based manner.

All

Interfacing Foundation Models' Embeddings

1 code implementation12 Dec 2023 Xueyan Zou, Linjie Li, JianFeng Wang, Jianwei Yang, Mingyu Ding, Junyi Wei, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang

To further unleash the power of foundation models, we present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity.

Decoder Image Segmentation +3

Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

no code implementations4 Dec 2023 Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee

We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts.

Diversity Domain Generalization +1

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

4 code implementations CVPR 2024 Mu Cai, Haotian Liu, Dennis Park, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Yong Jae Lee

Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain.

Visual Commonsense Reasoning Visual Prompting

Testing learning-enabled cyber-physical systems with Large-Language Models: A Formal Approach

no code implementations13 Nov 2023 Xi Zheng, Aloysius K. Mok, Ruzica Piskac, Yong Jae Lee, Bhaskar Krishnamachari, Dakai Zhu, Oleg Sokolsky, Insup Lee

The integration of machine learning (ML) into cyber-physical systems (CPS) offers significant benefits, including enhanced efficiency, predictive capabilities, real-time responsiveness, and the enabling of autonomous operations.

Autonomous Vehicles

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance

1 code implementation ICCV 2023 Zeyi Huang, Andy Zhou, Zijian Lin, Mu Cai, Haohan Wang, Yong Jae Lee

Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain.

Domain Generalization Knowledge Distillation +3

Investigating the Catastrophic Forgetting in Multimodal Large Language Models

no code implementations19 Sep 2023 Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma

However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM).

Image Classification Language Modelling +2

Visual Instruction Inversion: Image Editing via Visual Prompting

1 code implementation26 Jul 2023 Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee

Given pairs of example that represent the "before" and "after" images of an edit, our goal is to learn a text-based editing direction that can be used to perform the same edit on new images.

Visual Prompting

Benchmarking and Analyzing Generative Data for Visual Recognition

no code implementations25 Jul 2023 Bo Li, Haotian Liu, Liangyu Chen, Yong Jae Lee, Chunyuan Li, Ziwei Liu

Advancements in large pre-trained generative models have expanded their potential as effective data generators in visual recognition.

Benchmarking Retrieval

Generate Anything Anywhere in Any Scene

no code implementations29 Jun 2023 Yuheng Li, Haotian Liu, Yangming Wen, Yong Jae Lee

Text-to-image diffusion models have attracted considerable interest due to their wide applicability across diverse fields.

Data Augmentation Object

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

no code implementations9 Jun 2023 Mu Cai, Zeyi Huang, Yuheng Li, Utkarsh Ojha, Haohan Wang, Yong Jae Lee

To study what the LLM can do with this XML-based textual description of images, we test the LLM on three broad computer vision tasks: (i) visual reasoning and question answering, (ii) image classification under distribution shift, few-shot learning, and (iii) generating new images using visual prompting.

Few-Shot Learning Image Classification +6

Visual Instruction Tuning

11 code implementations NeurIPS 2023 Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.

1 Image, 2*2 Stitching 3D Question Answering (3D-QA) +9

Segment Everything Everywhere All at Once

3 code implementations NeurIPS 2023 Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, JianFeng Wang, Lijuan Wang, Jianfeng Gao, Yong Jae Lee

In SEEM, we propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal segmentation interface that behaves like large language models (LLMs).

All Decoder +6

InPL: Pseudo-labeling the Inliers First for Imbalanced Semi-supervised Learning

no code implementations13 Mar 2023 Zhuoran Yu, Yin Li, Yong Jae Lee

Without relying on model confidence, we propose to measure whether an unlabeled sample is likely to be ``in-distribution''; i. e., close to the current training data.

Out-of-Distribution Detection

Towards Universal Fake Image Detectors that Generalize Across Generative Models

2 code implementations CVPR 2023 Utkarsh Ojha, Yuheng Li, Yong Jae Lee

In this work, we first show that the existing paradigm, which consists of training a deep network for real-vs-fake classification, fails to detect fake images from newer breeds of generative models when trained to detect GAN fake images.

Classification Language Modeling +1

EnergyMatch: Energy-based Pseudo-Labeling for Semi-Supervised Learning

no code implementations13 Jun 2022 Zhuoran Yu, Yin Li, Yong Jae Lee

However, it has been shown that softmax-based confidence scores in deep networks can be arbitrarily high for samples far from the training data, and thus, the pseudo-labels for even high-confidence unlabeled samples may still be unreliable.

Out-of-Distribution Detection

The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization

1 code implementation9 Apr 2022 Zeyi Huang, Haohan Wang, Dong Huang, Yong Jae Lee, Eric P. Xing

Training with an emphasis on "hard-to-learn" components of the data has been proven as an effective method to improve the generalization of machine learning models, especially in the settings where robustness (e. g., generalization across distributions) is valued.

BIG-bench Machine Learning Domain Generalization

End-to-End Instance Edge Detection

no code implementations6 Apr 2022 Xueyan Zou, Haotian Liu, Yong Jae Lee

We demonstrate highly competitive instance edge detection performance compared to state-of-the-art baselines, and also show that the proposed task and loss are complementary to instance segmentation and object detection.

Decoder Edge Detection +6

GIRAFFE HD: A High-Resolution 3D-aware Generative Model

1 code implementation CVPR 2022 Yang Xue, Yuheng Li, Krishna Kumar Singh, Yong Jae Lee

3D-aware generative models have shown that the introduction of 3D information can lead to more controllable image generation.

Disentanglement Image Generation +2

Masked Discrimination for Self-Supervised Learning on Point Clouds

1 code implementation21 Mar 2022 Haotian Liu, Mu Cai, Yong Jae Lee

Masked autoencoding has achieved great success for self-supervised learning in the image and language domains.

3D Shape Classification Binary Classification +4

The Two Dimensions of Worst-Case Training and Their Integrated Effect for Out-of-Domain Generalization

no code implementations CVPR 2022 Zeyi Huang, Haohan Wang, Dong Huang, Yong Jae Lee, Eric P. Xing

Training with an emphasis on "hard-to-learn" components of the data has been proven as an effective method to improve the generalization of machine learning models, especially in the settings where robustness (e. g., generalization across distributions) is valued.

BIG-bench Machine Learning Domain Generalization

Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features

no code implementations5 Nov 2021 Haohan Wang, Zeyi Huang, HANLIN ZHANG, Yong Jae Lee, Eric Xing

Machine learning has demonstrated remarkable prediction accuracy over i. i. d data, but the accuracy often drops when tested with data from another distribution.

BIG-bench Machine Learning

Progressive Temporal Feature Alignment Network for Video Inpainting

1 code implementation CVPR 2021 Xueyan Zou, Linjie Yang, Ding Liu, Yong Jae Lee

To achieve this goal, it is necessary to find correspondences from neighbouring frames to faithfully hallucinate the unknown content.

Optical Flow Estimation Video Inpainting

Generating Furry Cars: Disentangling Object Shape & Appearance across Multiple Domains

no code implementations5 Apr 2021 Utkarsh Ojha, Krishna Kumar Singh, Yong Jae Lee

We consider the novel task of learning disentangled representations of object shape and appearance across multiple domains (e. g., dogs and cars).

Disentanglement Object

Generating Furry Cars: Disentangling Object Shape and Appearance across Multiple Domains

no code implementations ICLR 2021 Utkarsh Ojha, Krishna Kumar Singh, Yong Jae Lee

We consider the novel task of learning disentangled representations of object shape and appearance across multiple domains (e. g., dogs and cars).

Disentanglement Object

YolactEdge: Real-time Instance Segmentation on the Edge

2 code implementations22 Dec 2020 Haotian Liu, Rafael A. Rivera Soto, Fanyi Xiao, Yong Jae Lee

We propose YolactEdge, the first competitive instance segmentation approach that runs on small edge devices at real-time speeds.

Real-time Instance Segmentation Semantic Segmentation

Delving Deeper into Anti-aliasing in ConvNets

2 code implementations21 Aug 2020 Xueyan Zou, Fanyi Xiao, Zhiding Yu, Yong Jae Lee

Aliasing refers to the phenomenon that high frequency signals degenerate into completely different ones after sampling.

Instance Segmentation Segmentation +1

Password-conditioned Anonymization and Deanonymization with Face Identity Transformers

1 code implementation26 Nov 2019 Xiuye Gu, Weixin Luo, Michael S. Ryoo, Yong Jae Lee

Cameras are prevalent in our daily lives, and enable many useful systems built upon computer vision technologies such as smart cameras and home robots for service applications.

MixNMatch: Multifactor Disentanglement and Encoding for Conditional Image Generation

3 code implementations CVPR 2020 Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, Yong Jae Lee

We present MixNMatch, a conditional generative model that learns to disentangle and encode background, object pose, shape, and texture from real images with minimal supervision, for mix-and-match image generation.

Conditional Image Generation Disentanglement

Elastic-InfoGAN: Unsupervised Disentangled Representation Learning in Class-Imbalanced Data

1 code implementation NeurIPS 2020 Utkarsh Ojha, Krishna Kumar Singh, Cho-Jui Hsieh, Yong Jae Lee

We propose a novel unsupervised generative model that learns to disentangle object identity from other low-level aspects in class-imbalanced data.

Object Representation Learning

FineGAN: Unsupervised Hierarchical Disentanglement for Fine-Grained Object Generation and Discovery

1 code implementation CVPR 2019 Krishna Kumar Singh, Utkarsh Ojha, Yong Jae Lee

We propose FineGAN, a novel unsupervised GAN framework, which disentangles the background, object shape, and object appearance to hierarchically generate images of fine-grained object categories.

Conditional Image Generation Disentanglement +3

DOCK: Detecting Objects by transferring Common-sense Knowledge

no code implementations ECCV 2018 Krishna Kumar Singh, Santosh Divvala, Ali Farhadi, Yong Jae Lee

We present a scalable approach for Detecting Objects by transferring Common-sense Knowledge (DOCK) from source to target categories.

Attribute Common Sense Reasoning +3

Learning to Anonymize Faces for Privacy Preserving Action Detection

1 code implementation ECCV 2018 Zhongzheng Ren, Yong Jae Lee, Michael S. Ryoo

The end result is a video anonymizer that performs pixel-level modifications to anonymize each person's face, with minimal effect on action detection performance.

Action Detection Privacy Preserving

Who Will Share My Image? Predicting the Content Diffusion Path in Online Social Networks

no code implementations25 May 2017 Wenjian Hu, Krishna Kumar Singh, Fanyi Xiao, Jinyoung Han, Chen-Nee Chuah, Yong Jae Lee

Content popularity prediction has been extensively studied due to its importance and interest for both users and hosts of social media sites like Facebook, Instagram, Twitter, and Pinterest.

Weakly-supervised Visual Grounding of Phrases with Linguistic Structures

no code implementations CVPR 2017 Fanyi Xiao, Leonid Sigal, Yong Jae Lee

We propose a weakly-supervised approach that takes image-sentence pairs as input and learns to visually ground (i. e., localize) arbitrary linguistic phrases, in the form of spatial attention masks.

Sentence Visual Grounding

Identifying First-person Camera Wearers in Third-person Videos

no code implementations CVPR 2017 Chenyou Fan, Jang-Won Lee, Mingze Xu, Krishna Kumar Singh, Yong Jae Lee, David J. Crandall, Michael S. Ryoo

We consider scenarios in which we wish to perform joint scene understanding, object tracking, activity recognition, and other tasks in environments in which multiple people are wearing body-worn cameras while a third-person static camera also captures the scene.

Activity Recognition Object Tracking +2

Interspecies Knowledge Transfer for Facial Keypoint Detection

1 code implementation CVPR 2017 Maheen Rashid, Xiuye Gu, Yong Jae Lee

Instead of directly finetuning a network trained to detect keypoints on human faces to animal faces (which is sub-optimal since human and animal faces can look quite different), we propose to first adapt the animal images to the pre-trained human detection network by correcting for the differences in animal and human face shape.

Human Detection Keypoint Detection +1

End-to-End Localization and Ranking for Relative Attributes

no code implementations9 Aug 2016 Krishna Kumar Singh, Yong Jae Lee

We propose an end-to-end deep convolutional network to simultaneously localize and rank relative visual attributes, given only weakly-supervised pairwise image comparisons.

Attribute

Track and Segment: An Iterative Unsupervised Approach for Video Object Proposals

no code implementations CVPR 2016 Fanyi Xiao, Yong Jae Lee

We present an unsupervised approach that generates a diverse, ranked set of bounding box and segmentation video object proposals---spatio-temporal tubes that localize the foreground objects---in an unannotated video.

Segmentation

Discovering the Spatial Extent of Relative Attributes

no code implementations ICCV 2015 Fanyi Xiao, Yong Jae Lee

We present a weakly-supervised approach that discovers the spatial extent of relative attributes, given only pairs of ordered images.

Attribute

FlowWeb: Joint Image Set Alignment by Weaving Consistent, Pixel-Wise Correspondences

no code implementations CVPR 2015 Tinghui Zhou, Yong Jae Lee, Stella X. Yu, Alyosha A. Efros

Given a set of poorly aligned images of the same visual concept without any annotations, we propose an algorithm to jointly bring them into pixel-wise correspondence by estimating a FlowWeb representation of the image set.

Optical Flow Estimation

Predicting Important Objects for Egocentric Video Summarization

no code implementations18 May 2015 Yong Jae Lee, Kristen Grauman

Our results on two egocentric video datasets show the method's promise relative to existing techniques for saliency and summarization.

Event Detection Video Summarization

Weakly-supervised Discovery of Visual Pattern Configurations

no code implementations NeurIPS 2014 Hyun Oh Song, Yong Jae Lee, Stefanie Jegelka, Trevor Darrell

The increasing prominence of weakly labeled data nurtures a growing demand for object detection methods that can cope with minimal supervision.

Object object-detection +1

Cannot find the paper you are looking for? You can Submit a new open access paper.