no code implementations • 6 Jun 2025 • Seung-jae Lee, Paul Hongsuck Seo
Audiovisual segmentation (AVS) aims to identify visual regions corresponding to sound sources, playing a vital role in video understanding, surveillance, and human-computer interaction.
no code implementations • 3 Jun 2025 • Geonyoung Lee, Geonhee Han, Paul Hongsuck Seo
Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries.
no code implementations • 2 Apr 2025 • Dohyun Kim, Sehwan Park, Geonhee Han, Seung Wook Kim, Paul Hongsuck Seo
In this work, we propose Random Conditioning, a novel approach that pairs noised images with randomly selected text conditions to enable efficient, image-free knowledge distillation.
2 code implementations • 31 Mar 2025 • Guhnoo Yun, Juhan Yoo, Kijung Kim, Jeongho Lee, Paul Hongsuck Seo, Dong Hwan Kim
Recent studies have shown that 2D convolution and self-attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance.
no code implementations • 10 Feb 2025 • Sumin An, Junyoung Sung, Wonpyo Park, Chanjun Park, Paul Hongsuck Seo
While large language models (LLMs) excel in generating coherent and contextually rich outputs, their capacity to efficiently handle long-form contexts is limited by fixed-length position embeddings.
no code implementations • CVPR 2025 • Dohyun Kim, Sehwan Park, Geonhee Han, Seung Wook Kim, Paul Hongsuck Seo
In this work, we propose Random Conditioning, a novel approach that pairs noised images with randomly selected text conditions to enable efficient, image-free knowledge distillation.
1 code implementation • 2 Dec 2024 • Sangbeom Lim, Seongchan Kim, Seungjun An, Seokju Cho, Paul Hongsuck Seo, Seungryong Kim
Thus, developing a new video segmentation dataset aimed at tracking multi-granularity segmentation target in the video scene is necessary.
no code implementations • 30 Sep 2024 • Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim
Large-scale vision-language models like CLIP have demonstrated impressive open-vocabulary capabilities for image-level tasks, excelling in recognizing what objects are present.
1 code implementation • 10 Jul 2024 • Seonghoon Yu, Paul Hongsuck Seo, Jeany Son
To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) 'distinctive caption sampling', a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target.
no code implementations • CVPR 2024 • Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho
We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention.
Ranked #5 on
Action Recognition
on Diving-48
1 code implementation • CVPR 2023 • Seonghoon Yu, Paul Hongsuck Seo, Jeany Son
To overcome this issue, we propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP.
no code implementations • CVPR 2023 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
(ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech).
1 code implementation • CVPR 2023 • Sukmin Yun, Seong Hyeon Park, Paul Hongsuck Seo, Jinwoo Shin
In this paper, we introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories, but without any task-specific images and annotations.
3 code implementations • CVPR 2024 • Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim
Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions.
3 code implementations • CVPR 2023 • Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
Ranked #1 on
Dense Video Captioning
on ActivityNet Captions
(using extra training data)
no code implementations • 18 Nov 2022 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022.
1 code implementation • 15 Jun 2022 • Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 1 Apr 2022 • Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid
To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
Ranked #6 on
Zero-shot Text to Audio Retrieval
on AudioCaps
no code implementations • CVPR 2022 • Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
Recent video and language pretraining frameworks lack the ability to generate sentences.
Ranked #15 on
Video Captioning
on MSR-VTT
(using extra training data)
no code implementations • CVPR 2021 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.
1 code implementation • NeurIPS 2019 • Paul Hongsuck Seo, Geeho Kim, Bohyung Han
Label noise is one of the critical sources that degrade generalization performance of deep neural networks significantly.
no code implementations • 21 Nov 2019 • Paul Hongsuck Seo, Piyush Sharma, Tomer Levinboim, Bohyung Han, Radu Soricut
Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset.
no code implementations • 3 Oct 2019 • Wonpyo Park, Paul Hongsuck Seo, Bohyung Han, Minsu Cho
We introduce a novel stochastic regularization technique for deep neural networks, which decomposes a layer into multiple branches with different parameters and merges stochastically sampled combinations of the outputs from the branches during training.
no code implementations • CVPR 2019 • Seonguk Seo, Paul Hongsuck Seo, Bohyung Han
The proposed loss function enables us to learn deep neural networks that predict confidence calibrated scores using a single inference.
no code implementations • ECCV 2018 • Paul Hongsuck Seo, Tobias Weyand, Jack Sim, Bohyung Han
Image geolocalization is the task of identifying the location depicted in a photo based only on its visual information.
Ranked #1 on
Photo geolocation estimation
on Im2GPS
(Reference images metric)
no code implementations • ECCV 2018 • Paul Hongsuck Seo, Jongmin Lee, Deunsol Jung, Bohyung Han, Minsu Cho
Semantic correspondence is the problem of establishing correspondences across images depicting different instances of the same object or scene class.
no code implementations • NeurIPS 2017 • Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal
From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references.
Ranked #13 on
Visual Dialog
on VisDial v0.9 val
(R@1 metric)
no code implementations • ICCV 2017 • Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, Bohyung Han
To address this objective, we automatically generate a customized synthetic VideoQA dataset using {\em Super Mario Bros.} gameplay videos so that it contains events with different levels of reasoning complexity.
1 code implementation • 8 Jun 2016 • Paul Hongsuck Seo, Zhe Lin, Scott Cohen, Xiaohui Shen, Bohyung Han
We propose a novel attention model that can accurately attends to target objects of various scales and shapes in images.
1 code implementation • CVPR 2016 • Hyeonwoo Noh, Paul Hongsuck Seo, Bohyung Han
We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on questions.
Image Retrieval with Multi-Modal Query
Parameter Prediction
+3