no code implementations • 5 Apr 2024 • Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho
We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention.
Ranked #4 on Action Recognition on Diving-48
1 code implementation • CVPR 2023 • Seonghoon Yu, Paul Hongsuck Seo, Jeany Son
To overcome this issue, we propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP.
no code implementations • CVPR 2023 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
(ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech).
1 code implementation • CVPR 2023 • Sukmin Yun, Seong Hyeon Park, Paul Hongsuck Seo, Jinwoo Shin
In this paper, we introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories, but without any task-specific images and annotations.
3 code implementations • 21 Mar 2023 • Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim
Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions.
Ranked #1 on Open Vocabulary Semantic Segmentation on ADE20K-150
3 code implementations • CVPR 2023 • Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)
no code implementations • 18 Nov 2022 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022.
1 code implementation • 15 Jun 2022 • Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 1 Apr 2022 • Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid
To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
Ranked #6 on Zero-shot Text to Audio Retrieval on AudioCaps
no code implementations • CVPR 2022 • Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
Recent video and language pretraining frameworks lack the ability to generate sentences.
Ranked #14 on Video Captioning on MSR-VTT (using extra training data)
no code implementations • CVPR 2021 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.
1 code implementation • NeurIPS 2019 • Paul Hongsuck Seo, Geeho Kim, Bohyung Han
Label noise is one of the critical sources that degrade generalization performance of deep neural networks significantly.
no code implementations • 21 Nov 2019 • Paul Hongsuck Seo, Piyush Sharma, Tomer Levinboim, Bohyung Han, Radu Soricut
Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset.
no code implementations • 3 Oct 2019 • Wonpyo Park, Paul Hongsuck Seo, Bohyung Han, Minsu Cho
We introduce a novel stochastic regularization technique for deep neural networks, which decomposes a layer into multiple branches with different parameters and merges stochastically sampled combinations of the outputs from the branches during training.
no code implementations • CVPR 2019 • Seonguk Seo, Paul Hongsuck Seo, Bohyung Han
The proposed loss function enables us to learn deep neural networks that predict confidence calibrated scores using a single inference.
no code implementations • ECCV 2018 • Paul Hongsuck Seo, Tobias Weyand, Jack Sim, Bohyung Han
Image geolocalization is the task of identifying the location depicted in a photo based only on its visual information.
Ranked #1 on Photo geolocation estimation on Im2GPS (Reference images metric)
no code implementations • ECCV 2018 • Paul Hongsuck Seo, Jongmin Lee, Deunsol Jung, Bohyung Han, Minsu Cho
Semantic correspondence is the problem of establishing correspondences across images depicting different instances of the same object or scene class.
no code implementations • NeurIPS 2017 • Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal
From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references.
Ranked #13 on Visual Dialog on VisDial v0.9 val (R@1 metric)
no code implementations • ICCV 2017 • Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, Bohyung Han
To address this objective, we automatically generate a customized synthetic VideoQA dataset using {\em Super Mario Bros.} gameplay videos so that it contains events with different levels of reasoning complexity.
1 code implementation • 8 Jun 2016 • Paul Hongsuck Seo, Zhe Lin, Scott Cohen, Xiaohui Shen, Bohyung Han
We propose a novel attention model that can accurately attends to target objects of various scales and shapes in images.
1 code implementation • CVPR 2016 • Hyeonwoo Noh, Paul Hongsuck Seo, Bohyung Han
We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on questions.
Image Retrieval with Multi-Modal Query Parameter Prediction +2