Search Results for author: Dahun Kim

Found 33 papers, 16 papers with code

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

1 code implementation4 Apr 2025 Dahun Kim, AJ Piergiovanni, Ganesh Mallya, Anelia Angelova

We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment.

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

no code implementations22 Nov 2024 AJ Piergiovanni, Dahun Kim, Michael S. Ryoo, Isaac Noble, Anelia Angelova

Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames.

Dense Video Captioning

Learning Visual Grounding from Generative Vision and Language Model

no code implementations18 Jul 2024 Shijie Wang, Dahun Kim, Ali Taalimi, Chen Sun, Weicheng Kuo

In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of visual grounding data.

Attribute Language Modeling +8

Uni-DVPS: Unified Model for Depth-Aware Video Panoptic Segmentation

1 code implementation IEEE Robotics and Automation Letters (RA-L) 2024 Kim Ji-Yeon, Oh Hyun-Bin, Kwon Byung-Ki, Dahun Kim, Yongjin Kwon, Tae-Hyun Oh

We present Uni-DVPS, a unified model for Depth-aware Video Panoptic Segmentation (DVPS) that jointly tackles distinct vision tasks, i. e., video panoptic segmentation, monocular depth estimation, and object tracking.

Autonomous Driving Decoder +7

OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All

no code implementations25 May 2024 Yuanhuiyi Lyu, Xu Zheng, Dahun Kim, Lin Wang

Specifically, we propose Cross-modal Alignment Distillation (CAD) to address the unequal-scale problem between student and teacher modalities and effectively align student modalities into the teacher modalities' representation space in stage one.

All cross-modal alignment

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

no code implementations CVPR 2024 AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential.

Action Classification Audio Classification +1

Region-centric Image-Language Pretraining for Open-Vocabulary Detection

2 code implementations29 Sep 2023 Dahun Kim, Anelia Angelova, Weicheng Kuo

We present a new open-vocabulary detection approach based on region-centric image-language pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection.

Contrastive Learning Object +3

Contrastive Feature Masking Open-Vocabulary Vision Transformer

no code implementations ICCV 2023 Dahun Kim, Anelia Angelova, Weicheng Kuo

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD).

Contrastive Learning Image-text Retrieval +4

Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation

1 code implementation3 Aug 2023 Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro

By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT).

Decoder Quantization +7

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

2 code implementations CVPR 2023 Dahun Kim, Anelia Angelova, Weicheng Kuo

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.

Contrastive Learning Image-text Retrieval +5

RECLIP: Resource-efficient CLIP by Training with Small Images

no code implementations12 Apr 2023 Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo

We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining).

Contrastive Learning Image-text Retrieval +3

Neural Image-based Avatars: Generalizable Radiance Fields for Human Avatar Modeling

no code implementations10 Apr 2023 Youngjoong Kwon, Dahun Kim, Duygu Ceylan, Henry Fuchs

We present a method that enables synthesizing novel views and novel poses of arbitrary human performers from sparse multi-view images.

NeRF

Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation

no code implementations10 Apr 2023 Inkyu Shin, Dahun Kim, Qihang Yu, Jun Xie, Hong-Seok Kim, Bradley Green, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen

The meta architecture of the proposed Video-kMaX consists of two components: within clip segmenter (for clip-level segmentation) and cross-clip associater (for association beyond clips).

Segmentation Video Panoptic Segmentation +1

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

1 code implementation29 Mar 2023 Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova

We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.

Cross-Modal Retrieval Decoder +9

Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering

1 code implementation NeurIPS 2021 Youngjoong Kwon, Dahun Kim, Duygu Ceylan, Henry Fuchs

To tackle this, we propose Neural Human Performer, a novel approach that learns generalizable neural radiance fields based on a parametric human body model for robust performance capture.

Generalizable Novel View Synthesis NeRF

Learning Open-World Object Proposals without Learning to Classify

6 code implementations15 Aug 2021 Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, Weicheng Kuo

In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories.

Object object-detection +4

DeepLab2: A TensorFlow Library for Deep Labeling

4 code implementations17 Jun 2021 Mark Weber, Huiyu Wang, Siyuan Qiao, Jun Xie, Maxwell D. Collins, Yukun Zhu, Liangzhe Yuan, Dahun Kim, Qihang Yu, Daniel Cremers, Laura Leal-Taixe, Alan L. Yuille, Florian Schroff, Hartwig Adam, Liang-Chieh Chen

DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a state-of-the-art and easy-to-use TensorFlow codebase for general dense pixel prediction problems in computer vision.

Learning to Associate Every Segment for Video Panoptic Segmentation

no code implementations CVPR 2021 Sanghyun Woo, Dahun Kim, Joon-Young Lee, In So Kweon

Temporal correspondence - linking pixels or objects across frames - is a fundamental supervisory signal for the video models.

Ranked #6 on Video Panoptic Segmentation on Cityscapes-VPS (using extra training data)

Video Panoptic Segmentation

The Devil is in the Boundary: Exploiting Boundary Representation for Basis-based Instance Segmentation

no code implementations26 Nov 2020 Myungchul Kim, Sanghyun Woo, Dahun Kim, In So Kweon

In this work, we propose Boundary Basis based Instance Segmentation(B2Inst) to learn a global boundary representation that can complement existing global-mask-based methods that are often lacking high-frequency details.

Instance Segmentation Scene Understanding +2

Video Panoptic Segmentation

1 code implementation CVPR 2020 Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon

In this paper, we propose and explore a new video extension of this task, called video panoptic segmentation.

Ranked #7 on Video Panoptic Segmentation on Cityscapes-VPS (using extra training data)

Instance Segmentation Segmentation +5

Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence

1 code implementation CVPR 2019 Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon

Blind video decaptioning is a problem of automatically removing text overlays and inpainting the occluded parts in videos without any input masks.

Decoder Video Denoising +2

Discriminative Feature Learning for Unsupervised Video Summarization

1 code implementation24 Nov 2018 Yunjae Jung, Donghyeon Cho, Dahun Kim, Sanghyun Woo, In So Kweon

The proposed variance loss allows a network to predict output scores for each frame with high discrepancy which enables effective feature learning and significantly improves model performance.

Supervised Video Summarization Unsupervised Video Summarization

Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

no code implementations24 Nov 2018 Dahun Kim, Donghyeon Cho, In So Kweon

Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visual representation learning for still images, when the number of labeled images is limited or absent at all.

Colorization Representation Learning +2

LinkNet: Relational Embedding for Scene Graph

3 code implementations NeurIPS 2018 Sanghyun Woo, Dahun Kim, Donghyeon Cho, In So Kweon

In this paper, we present a method that improves scene graph generation by explicitly modeling inter-dependency among the entire object instances.

Graph Generation Scene Graph Generation

Learning Image Representations by Completing Damaged Jigsaw Puzzles

no code implementations6 Feb 2018 Dahun Kim, Donghyeon Cho, Donggeun Yoo, In So Kweon

The recovery of the aforementioned damage pushes the network to obtain robust and general-purpose representations.

Colorization Representation Learning +2

Two-Phase Learning for Weakly Supervised Object Localization

no code implementations ICCV 2017 Dahun Kim, Donghyeon Cho, Donggeun Yoo, In So Kweon

Weakly supervised semantic segmentation and localiza- tion have a problem of focusing only on the most important parts of an image since they use only image-level annota- tions.

Object Segmentation +5

Cannot find the paper you are looking for? You can Submit a new open access paper.