1 code implementation • CVPR 2024 • Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim
To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training.
Ranked #2 on Video Retrieval on SSv2-template retrieval (using extra training data)
1 code implementation • CVPR 2024 • Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim
Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss.
Ranked #1 on Efficient ViTs on ImageNet-1K (With LV-ViT-S)
1 code implementation • ICCV 2023 • Dongjun Lee, Seokwon Song, Jihee Suh, Joonmyung Choi, Sanghyeok Lee, Hyunwoo J. Kim
RPO leverages masked attention to prevent the internal representation shift in the pre-trained model.
Ranked #7 on Prompt Engineering on Caltech-101
no code implementations • 23 Aug 2023 • Injae Kim, Jongha Kim, Joonmyung Choi, Hyunwoo J. Kim
However, those methods do not consider whether a concept is visually relevant or not, which is an important factor in computing meaningful concept scores.
1 code implementation • CVPR 2023 • Dohwan Ko, Joonmyung Choi, Hyeong Kyu Choi, Kyoung-Woon On, Byungseok Roh, Hyunwoo J. Kim
Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning.
Ranked #2 on Video Captioning on YouCook2
1 code implementation • 14 Oct 2022 • Hyeong Kyu Choi, Joonmyung Choi, Hyunwoo J. Kim
To this end, we propose TokenMixup, an efficient attention-guided token-level data augmentation method that aims to maximize the saliency of a mixed set of tokens.
Ranked #72 on Image Classification on CIFAR-10
1 code implementation • CVPR 2022 • Dohwan Ko, Joonmyung Choi, Juyeon Ko, Shinyeong Noh, Kyoung-Woon On, Eun-Sol Kim, Hyunwoo J. Kim
In this paper, we propose a novel multi-modal self-supervised framework Video-Text Temporally Weak Alignment-based Contrastive Learning (VT-TWINS) to capture significant information from noisy and weakly correlated data using a variant of Dynamic Time Warping (DTW).