Search Results for author: Paul Hongsuck Seo

Found 21 papers, 8 papers with code

Zero-shot Referring Image Segmentation with Global-Local Context Features

1 code implementation CVPR 2023 Seonghoon Yu, Paul Hongsuck Seo, Jeany Son

To overcome this issue, we propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP.

Image Segmentation Referring Expression +4

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

no code implementations CVPR 2023 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

(ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech).

Automatic Speech Recognition Domain Adaptation +2

IFSeg: Image-free Semantic Segmentation via Vision-Language Model

1 code implementation CVPR 2023 Sukmin Yun, Seong Hyeon Park, Paul Hongsuck Seo, Jinwoo Shin

In this paper, we introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories, but without any task-specific images and annotations.

Image Segmentation Language Modelling +3

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

3 code implementations21 Mar 2023 Seokju Cho, Heeseong Shin, Sunghwan Hong, Seungjun An, Seungjun Lee, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim

However, the problem of transferring these capabilities learned from image-level supervision to the pixel-level task of segmentation and addressing arbitrary unseen categories at inference makes this task challenging.

Image Segmentation Open Vocabulary Semantic Segmentation +3

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

3 code implementations CVPR 2023 Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.

 Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)

Dense Video Captioning Language Modelling +1

AVATAR submission to the Ego4D AV Transcription Challenge

no code implementations18 Nov 2022 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022.

AVATAR: Unconstrained Audiovisual Speech Recognition

1 code implementation15 Jun 2022 Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Learning Audio-Video Modalities from Image Captions

no code implementations1 Apr 2022 Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.

Image Captioning Retrieval +4

Look Before you Speak: Visually Contextualized Utterances

no code implementations CVPR 2021 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.

Combinatorial Inference against Label Noise

1 code implementation NeurIPS 2019 Paul Hongsuck Seo, Geeho Kim, Bohyung Han

Label noise is one of the critical sources that degrade generalization performance of deep neural networks significantly.

Clustering

Reinforcing an Image Caption Generator Using Off-Line Human Feedback

no code implementations21 Nov 2019 Paul Hongsuck Seo, Piyush Sharma, Tomer Levinboim, Bohyung Han, Radu Soricut

Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset.

Image Captioning

Regularizing Neural Networks via Stochastic Branch Layers

no code implementations3 Oct 2019 Wonpyo Park, Paul Hongsuck Seo, Bohyung Han, Minsu Cho

We introduce a novel stochastic regularization technique for deep neural networks, which decomposes a layer into multiple branches with different parameters and merges stochastically sampled combinations of the outputs from the branches during training.

Learning for Single-Shot Confidence Calibration in Deep Neural Networks through Stochastic Inferences

no code implementations CVPR 2019 Seonguk Seo, Paul Hongsuck Seo, Bohyung Han

The proposed loss function enables us to learn deep neural networks that predict confidence calibrated scores using a single inference.

Attentive Semantic Alignment with Offset-Aware Correlation Kernels

no code implementations ECCV 2018 Paul Hongsuck Seo, Jongmin Lee, Deunsol Jung, Bohyung Han, Minsu Cho

Semantic correspondence is the problem of establishing correspondences across images depicting different instances of the same object or scene class.

Semantic correspondence Translation

Visual Reference Resolution using Attention Memory for Visual Dialog

no code implementations NeurIPS 2017 Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal

From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references.

Ranked #13 on Visual Dialog on VisDial v0.9 val (R@1 metric)

Parameter Prediction Question Answering +3

MarioQA: Answering Questions by Watching Gameplay Videos

no code implementations ICCV 2017 Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, Bohyung Han

To address this objective, we automatically generate a customized synthetic VideoQA dataset using {\em Super Mario Bros.} gameplay videos so that it contains events with different levels of reasoning complexity.

Question Answering Video Question Answering

Progressive Attention Networks for Visual Attribute Prediction

1 code implementation8 Jun 2016 Paul Hongsuck Seo, Zhe Lin, Scott Cohen, Xiaohui Shen, Bohyung Han

We propose a novel attention model that can accurately attends to target objects of various scales and shapes in images.

Attribute Hard Attention

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

1 code implementation CVPR 2016 Hyeonwoo Noh, Paul Hongsuck Seo, Bohyung Han

We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on questions.

Image Retrieval with Multi-Modal Query Parameter Prediction +2

Cannot find the paper you are looking for? You can Submit a new open access paper.