2 code implementations • CVPR 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.
Ranked #5 on Zero-Shot Cross-Modal Retrieval on Flickr30k
1 code implementation • 29 Sep 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection.
Ranked #1 on Open Vocabulary Object Detection on LVIS v1.0
4 code implementations • 17 Jun 2021 • Mark Weber, Huiyu Wang, Siyuan Qiao, Jun Xie, Maxwell D. Collins, Yukun Zhu, Liangzhe Yuan, Dahun Kim, Qihang Yu, Daniel Cremers, Laura Leal-Taixe, Alan L. Yuille, Florian Schroff, Hartwig Adam, Liang-Chieh Chen
DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a state-of-the-art and easy-to-use TensorFlow codebase for general dense pixel prediction problems in computer vision.
2 code implementations • CVPR 2019 • Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon
Video inpainting aims to fill spatio-temporal holes with plausible content in a video.
Ranked #7 on Video Inpainting on DAVIS
1 code implementation • CVPR 2020 • Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon
In this paper, we propose and explore a new video extension of this task, called video panoptic segmentation.
Ranked #7 on Video Panoptic Segmentation on Cityscapes-VPS (using extra training data)
3 code implementations • 15 Aug 2021 • Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, Weicheng Kuo
In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories.
Ranked #2 on Open World Object Detection on COCO VOC to non-VOC
1 code implementation • NeurIPS 2021 • Youngjoong Kwon, Dahun Kim, Duygu Ceylan, Henry Fuchs
To tackle this, we propose Neural Human Performer, a novel approach that learns generalizable neural radiance fields based on a parametric human body model for robust performance capture.
Ranked #3 on Generalizable Novel View Synthesis on ZJU-MoCap
2 code implementations • CVPR 2022 • Qihang Yu, Huiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
We propose Clustering Mask Transformer (CMT-DeepLab), a transformer-based framework for panoptic segmentation designed around clustering.
Ranked #6 on Panoptic Segmentation on COCO test-dev
1 code implementation • ECCV 2020 • Youngjoong Kwon, Stefano Petrangeli, Dahun Kim, Haoliang Wang, Eunbyung Park, Viswanathan Swaminathan, Henry Fuchs
Second, we introduce a novel loss to explicitly enforce consistency across generated views both in space and in time.
1 code implementation • CVPR 2019 • Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon
Blind video decaptioning is a problem of automatically removing text overlays and inpainting the occluded parts in videos without any input masks.
3 code implementations • NeurIPS 2018 • Sanghyun Woo, Dahun Kim, Donghyeon Cho, In So Kweon
In this paper, we present a method that improves scene graph generation by explicitly modeling inter-dependency among the entire object instances.
1 code implementation • 3 Aug 2023 • Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro
A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST).
1 code implementation • 24 Nov 2018 • Yunjae Jung, Donghyeon Cho, Dahun Kim, Sanghyun Woo, In So Kweon
The proposed variance loss allows a network to predict output scores for each frame with high discrepancy which enables effective feature learning and significantly improves model performance.
Ranked #3 on Unsupervised Video Summarization on SumMe
Supervised Video Summarization Unsupervised Video Summarization
no code implementations • 6 Feb 2018 • Dahun Kim, Donghyeon Cho, Donggeun Yoo, In So Kweon
The recovery of the aforementioned damage pushes the network to obtain robust and general-purpose representations.
no code implementations • ICCV 2017 • Dahun Kim, Donghyeon Cho, Donggeun Yoo, In So Kweon
Weakly supervised semantic segmentation and localiza- tion have a problem of focusing only on the most important parts of an image since they use only image-level annota- tions.
no code implementations • 24 Nov 2018 • Dahun Kim, Donghyeon Cho, In So Kweon
Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visual representation learning for still images, when the number of labeled images is limited or absent at all.
Ranked #42 on Self-Supervised Action Recognition on HMDB51
no code implementations • 30 May 2019 • Sanghyun Woo, Dahun Kim, KwanYong Park, Joon-Young Lee, In So Kweon
Our video inpainting network consists of two stages.
no code implementations • 21 Aug 2019 • Kwanyong Park, Sanghyun Woo, Dahun Kim, Donghyeon Cho, In So Kweon
In this paper, we investigate the problem of unpaired video-to-video translation.
no code implementations • 3 Feb 2020 • Yunjae Jung, Dahun Kim, Sanghyun Woo, Kyung-Su Kim, Sungjin Kim, In So Kweon
In this paper, we propose to explicitly learn to imagine a storyline that bridges the visual gap.
Ranked #7 on Visual Storytelling on VIST
no code implementations • 26 Nov 2020 • Myungchul Kim, Sanghyun Woo, Dahun Kim, In So Kweon
In this work, we propose Boundary Basis based Instance Segmentation(B2Inst) to learn a global boundary representation that can complement existing global-mask-based methods that are often lacking high-frequency details.
no code implementations • CVPR 2021 • Sanghyun Woo, Dahun Kim, Joon-Young Lee, In So Kweon
Temporal correspondence - linking pixels or objects across frames - is a fundamental supervisory signal for the video models.
Ranked #6 on Video Panoptic Segmentation on Cityscapes-VPS (using extra training data)
no code implementations • CVPR 2022 • Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim, Hartwig Adam, In So Kweon, Liang-Chieh Chen
We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner.
no code implementations • 29 Mar 2023 • Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova
We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
no code implementations • 10 Apr 2023 • Inkyu Shin, Dahun Kim, Qihang Yu, Jun Xie, Hong-Seok Kim, Bradley Green, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen
The meta architecture of the proposed Video-kMaX consists of two components: within clip segmenter (for clip-level segmentation) and cross-clip associater (for association beyond clips).
no code implementations • 10 Apr 2023 • Youngjoong Kwon, Dahun Kim, Duygu Ceylan, Henry Fuchs
We present a method that enables synthesizing novel views and novel poses of arbitrary human performers from sparse multi-view images.
no code implementations • 12 Apr 2023 • Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo
We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining).
no code implementations • ICCV 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD).
Ranked #5 on Open Vocabulary Object Detection on LVIS v1.0
no code implementations • 9 Nov 2023 • AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova
We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential.
Ranked #1 on Audio Classification on VGGSound