1 code implementation • ECCV 2020 • Youngjoong Kwon, Stefano Petrangeli, Dahun Kim, Haoliang Wang, Eunbyung Park, Viswanathan Swaminathan, Henry Fuchs
Second, we introduce a novel loss to explicitly enforce consistency across generated views both in space and in time.
1 code implementation • 4 Apr 2025 • Dahun Kim, AJ Piergiovanni, Ganesh Mallya, Anelia Angelova
We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment.
no code implementations • 22 Nov 2024 • AJ Piergiovanni, Dahun Kim, Michael S. Ryoo, Isaac Noble, Anelia Angelova
Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames.
no code implementations • 18 Jul 2024 • Shijie Wang, Dahun Kim, Ali Taalimi, Chen Sun, Weicheng Kuo
In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of visual grounding data.
1 code implementation • IEEE Robotics and Automation Letters (RA-L) 2024 • Kim Ji-Yeon, Oh Hyun-Bin, Kwon Byung-Ki, Dahun Kim, Yongjin Kwon, Tae-Hyun Oh
We present Uni-DVPS, a unified model for Depth-aware Video Panoptic Segmentation (DVPS) that jointly tackles distinct vision tasks, i. e., video panoptic segmentation, monocular depth estimation, and object tracking.
no code implementations • 25 May 2024 • Yuanhuiyi Lyu, Xu Zheng, Dahun Kim, Lin Wang
Specifically, we propose Cross-modal Alignment Distillation (CAD) to address the unequal-scale problem between student and teacher modalities and effectively align student modalities into the teacher modalities' representation space in stage one.
no code implementations • CVPR 2024 • AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova
We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential.
Ranked #1 on
Audio Classification
on VGGSound
2 code implementations • 29 Sep 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present a new open-vocabulary detection approach based on region-centric image-language pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection.
Ranked #2 on
Open Vocabulary Object Detection
on LVIS v1.0
no code implementations • ICCV 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD).
Ranked #8 on
Open Vocabulary Object Detection
on LVIS v1.0
1 code implementation • 3 Aug 2023 • Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro
By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT).
2 code implementations • CVPR 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.
Ranked #6 on
Zero-Shot Cross-Modal Retrieval
on Flickr30k
no code implementations • 12 Apr 2023 • Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo
We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining).
no code implementations • 10 Apr 2023 • Youngjoong Kwon, Dahun Kim, Duygu Ceylan, Henry Fuchs
We present a method that enables synthesizing novel views and novel poses of arbitrary human performers from sparse multi-view images.
no code implementations • 10 Apr 2023 • Inkyu Shin, Dahun Kim, Qihang Yu, Jun Xie, Hong-Seok Kim, Bradley Green, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen
The meta architecture of the proposed Video-kMaX consists of two components: within clip segmenter (for clip-level segmentation) and cross-clip associater (for association beyond clips).
1 code implementation • 29 Mar 2023 • Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova
We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
Ranked #1 on
Video Captioning
on MSVD
2 code implementations • CVPR 2022 • Qihang Yu, Huiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
We propose Clustering Mask Transformer (CMT-DeepLab), a transformer-based framework for panoptic segmentation designed around clustering.
Ranked #6 on
Panoptic Segmentation
on COCO test-dev
no code implementations • CVPR 2022 • Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim, Hartwig Adam, In So Kweon, Liang-Chieh Chen
We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner.
1 code implementation • NeurIPS 2021 • Youngjoong Kwon, Dahun Kim, Duygu Ceylan, Henry Fuchs
To tackle this, we propose Neural Human Performer, a novel approach that learns generalizable neural radiance fields based on a parametric human body model for robust performance capture.
Ranked #3 on
Generalizable Novel View Synthesis
on ZJU-MoCap
6 code implementations • 15 Aug 2021 • Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, Weicheng Kuo
In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories.
Ranked #2 on
Open World Object Detection
on COCO VOC to non-VOC
4 code implementations • 17 Jun 2021 • Mark Weber, Huiyu Wang, Siyuan Qiao, Jun Xie, Maxwell D. Collins, Yukun Zhu, Liangzhe Yuan, Dahun Kim, Qihang Yu, Daniel Cremers, Laura Leal-Taixe, Alan L. Yuille, Florian Schroff, Hartwig Adam, Liang-Chieh Chen
DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a state-of-the-art and easy-to-use TensorFlow codebase for general dense pixel prediction problems in computer vision.
no code implementations • CVPR 2021 • Sanghyun Woo, Dahun Kim, Joon-Young Lee, In So Kweon
Temporal correspondence - linking pixels or objects across frames - is a fundamental supervisory signal for the video models.
Ranked #6 on
Video Panoptic Segmentation
on Cityscapes-VPS
(using extra training data)
no code implementations • 26 Nov 2020 • Myungchul Kim, Sanghyun Woo, Dahun Kim, In So Kweon
In this work, we propose Boundary Basis based Instance Segmentation(B2Inst) to learn a global boundary representation that can complement existing global-mask-based methods that are often lacking high-frequency details.
1 code implementation • CVPR 2020 • Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon
In this paper, we propose and explore a new video extension of this task, called video panoptic segmentation.
Ranked #7 on
Video Panoptic Segmentation
on Cityscapes-VPS
(using extra training data)
no code implementations • 3 Feb 2020 • Yunjae Jung, Dahun Kim, Sanghyun Woo, Kyung-Su Kim, Sungjin Kim, In So Kweon
In this paper, we propose to explicitly learn to imagine a storyline that bridges the visual gap.
Ranked #7 on
Visual Storytelling
on VIST
no code implementations • 21 Aug 2019 • Kwanyong Park, Sanghyun Woo, Dahun Kim, Donghyeon Cho, In So Kweon
In this paper, we investigate the problem of unpaired video-to-video translation.
no code implementations • 30 May 2019 • Sanghyun Woo, Dahun Kim, KwanYong Park, Joon-Young Lee, In So Kweon
Our video inpainting network consists of two stages.
1 code implementation • CVPR 2019 • Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon
Blind video decaptioning is a problem of automatically removing text overlays and inpainting the occluded parts in videos without any input masks.
2 code implementations • CVPR 2019 • Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon
Video inpainting aims to fill spatio-temporal holes with plausible content in a video.
Ranked #7 on
Video Inpainting
on DAVIS
1 code implementation • 24 Nov 2018 • Yunjae Jung, Donghyeon Cho, Dahun Kim, Sanghyun Woo, In So Kweon
The proposed variance loss allows a network to predict output scores for each frame with high discrepancy which enables effective feature learning and significantly improves model performance.
Ranked #4 on
Unsupervised Video Summarization
on SumMe
Supervised Video Summarization
Unsupervised Video Summarization
no code implementations • 24 Nov 2018 • Dahun Kim, Donghyeon Cho, In So Kweon
Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visual representation learning for still images, when the number of labeled images is limited or absent at all.
Ranked #42 on
Self-Supervised Action Recognition
on HMDB51
3 code implementations • NeurIPS 2018 • Sanghyun Woo, Dahun Kim, Donghyeon Cho, In So Kweon
In this paper, we present a method that improves scene graph generation by explicitly modeling inter-dependency among the entire object instances.
no code implementations • 6 Feb 2018 • Dahun Kim, Donghyeon Cho, Donggeun Yoo, In So Kweon
The recovery of the aforementioned damage pushes the network to obtain robust and general-purpose representations.
no code implementations • ICCV 2017 • Dahun Kim, Donghyeon Cho, Donggeun Yoo, In So Kweon
Weakly supervised semantic segmentation and localiza- tion have a problem of focusing only on the most important parts of an image since they use only image-level annota- tions.