Search Results for author: Joon Son Chung

Found 60 papers, 17 papers with code

Scaling Up Video Summarization Pretraining with Large Language Models

no code implementations4 Apr 2024 Dawit Mureja Argaw, Seunghyun Yoon, Fabian Caba Heilbron, Hanieh Deilamsalehy, Trung Bui, Zhaowen Wang, Franck Dernoncourt, Joon Son Chung

Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem.

Video Alignment Video Summarization

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

no code implementations14 Mar 2024 Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung

Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations.

audio-visual learning Contrastive Learning +2

FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

2 code implementations18 Jan 2024 Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung

The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad.

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

no code implementations16 Jan 2024 Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

We introduce multi-phase training of audio spectrogram transformers by connecting the seminal idea of coarse-to-fine with transformer models.

Audio Classification

Can CLIP Help Sound Source Localization?

1 code implementation7 Nov 2023 Sooyoung Park, Arda Senocak, Joon Son Chung

Large-scale pre-trained image-text models demonstrate remarkable versatility across diverse tasks, benefiting from their robust representational capabilities and effective multimodal alignment.

audio-visual learning Contrastive Learning

Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

no code implementations30 Oct 2023 Suyeon Lee, Chaeyoung Jung, Youngjoon Jang, Jaehun Kim, Joon Son Chung

For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism.

Speech Separation

VoiceLDM: Text-to-Speech with Environmental Context

no code implementations24 Sep 2023 Yeonghyeon Lee, Inmo Yeon, Juhan Nam, Joon Son Chung

This paper presents VoiceLDM, a model designed to produce audio that accurately follows two distinct natural language text prompts: the description prompt and the content prompt.

AudioCaps

SlowFast Network for Continuous Sign Language Recognition

no code implementations21 Sep 2023 Junseok Ahn, Youngjoon Jang, Joon Son Chung

The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR).

Sign Language Recognition

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

no code implementations21 Sep 2023 Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You Jin Kim, Youngjoon Jang, Joon Son Chung

The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames.

Contrastive Learning

Sound Source Localization is All about Cross-Modal Alignment

no code implementations ICCV 2023 Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for genuine sound source localization.

Cross-Modal Retrieval Retrieval

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

no code implementations29 Aug 2023 Ji-Hoon Kim, Jaehun Kim, Joon Son Chung

In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives.

FlexiAST: Flexibility is What AST Needs

no code implementations18 Jul 2023 Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

To overcome this limitation, this paper proposes a training procedure to provide flexibility to standard AST models without architectural changes, allowing them to work with various patch sizes at the inference stage - FlexiAST.

Audio Classification

Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples

no code implementations30 Mar 2023 Hyeonggon Ryu, Arda Senocak, In So Kweon, Joon Son Chung

The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.

Cross-Modal Retrieval Retrieval

Self-Sufficient Framework for Continuous Sign Language Recognition

no code implementations21 Mar 2023 Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Myungchul Kim, Dong-Jin Kim, In So Kweon, Joon Son Chung

The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition.

Pseudo Label Sign Language Recognition

VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

1 code implementation20 Feb 2023 Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022.

Speaker Diarization Speaker Recognition +1

MarginNCE: Robust Sound Localization with a Negative Margin

no code implementations3 Nov 2022 Sooyoung Park, Arda Senocak, Joon Son Chung

Furthermore, we demonstrate that the introduction of a negative margin to existing methods results in a consistent improvement in performance.

Contrastive Learning

Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition

1 code implementation1 Nov 2022 Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, Joon Son Chung, In So Kweon

Most existing Continuous Sign Language Recognition (CSLR) benchmarks have fixed backgrounds and are filmed in studios with a static monochromatic background.

Benchmarking Disentanglement +1

Metric Learning for User-defined Keyword Spotting

no code implementations1 Nov 2022 Jaemin Jung, Youkyum Kim, Jihwan Park, Youshin Lim, Byeong-Yeol Kim, Youngjoon Jang, Joon Son Chung

In particular, we make the following contributions: (1) we construct a large-scale keyword dataset with an existing speech corpus and propose a filtering method to remove data that degrade model training; (2) we propose a metric learning-based two-stage training strategy, and demonstrate that the proposed method improves the performance on the user-defined keyword spotting task by enriching their representations; (3) to facilitate the fair comparison in the user-defined KWS field, we propose unified evaluation protocol and metrics.

Keyword Spotting Metric Learning

In search of strong embedding extractors for speaker diarisation

no code implementations26 Oct 2022 Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung

First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.

Data Augmentation Speaker Verification

Pushing the limits of raw waveform speaker recognition

2 code implementations16 Mar 2022 Jee-weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

Our best model achieves an equal error rate of 0. 89%, which is competitive with the state-of-the-art models based on handcrafted features, and outperforms the best model based on raw waveform inputs by a large margin.

Self-Supervised Learning Speaker Recognition +1

Spell my name: keyword boosted speech recognition

no code implementations6 Oct 2021 Namkyu Jung, Geonmin Kim, Joon Son Chung

Recognition of uncommon words such as names and technical terminology is important to understanding conversations in context.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Look Who's Talking: Active Speaker Detection in the Wild

1 code implementation17 Aug 2021 You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way.

Graph Attention Networks for Speaker Verification

no code implementations22 Oct 2020 Jee-weon Jung, Hee-Soo Heo, Ha-Jin Yu, Joon Son Chung

The proposed framework inputs segment-wise speaker embeddings from an enrollment and a test utterance and directly outputs a similarity score.

Graph Attention Speaker Verification

Augmentation adversarial training for self-supervised speaker recognition

no code implementations23 Jul 2020 Jaesung Huh, Hee Soo Heo, Jingu Kang, Shinji Watanabe, Joon Son Chung

Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general.

Contrastive Learning Speaker Recognition

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

1 code implementation ECCV 2020 Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, Andrew Zisserman

Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality.

Action Classification Keyword Spotting +2

Spot the conversation: speaker diarisation in the wild

no code implementations2 Jul 2020 Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman

Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.

Speaker Verification

FaceFilter: Audio-visual speech separation using still images

no code implementations14 May 2020 Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang

The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network.

Speech Separation

Disentangled Speech Embeddings using Cross-modal Self-supervision

no code implementations20 Feb 2020 Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman

The objective of this paper is to learn representations of speaker identity without access to manually annotated data.

Self-Supervised Learning Speaker Recognition

VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

no code implementations5 Dec 2019 Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman

The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.

Speaker Recognition

ASR is all you need: cross-modal distillation for lip reading

no code implementations28 Nov 2019 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data.

Ranked #14 on Lipreading on LRS3-TED (using extra training data)

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Delving into VoxCeleb: environment invariant speaker recognition

1 code implementation24 Oct 2019 Joon Son Chung, Jaesung Huh, Seongkyu Mun

Research in speaker recognition has recently seen significant progress due to the application of neural network models and the availability of new large-scale datasets.

Speaker Identification Speaker Recognition

My lips are concealed: Audio-visual speech enhancement through obstructions

no code implementations11 Jul 2019 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice.

Speech Enhancement

Who said that?: Audio-visual speaker diarisation of real-world meetings

no code implementations24 Jun 2019 Joon Son Chung, Bong-Jin Lee, Icksang Han

The goal of this work is to determine 'who spoke when' in real-world meetings.

Utterance-level Aggregation For Speaker Recognition In The Wild

10 code implementations26 Feb 2019 Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman

The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.

Speaker Recognition Text-Independent Speaker Verification

Perfect match: Improved cross-modal embeddings for audio-visual synchronisation

no code implementations21 Sep 2018 Soo-Whan Chung, Joon Son Chung, Hong-Goo Kang

This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronization.

Binary Classification Cross-Modal Retrieval +4

Deep Lip Reading: a comparison of models and an online application

no code implementations15 Jun 2018 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

The goal of this paper is to develop state-of-the-art models for lip reading -- visual speech recognition.

Language Modelling Lip Reading +2

VoxCeleb2: Deep Speaker Recognition

2 code implementations14 Jun 2018 Joon Son Chung, Arsha Nagrani, Andrew Zisserman

The objective of this paper is speaker recognition under noisy and unconstrained conditions.

 Ranked #1 on Speaker Verification on VoxCeleb2 (using extra training data)

Speaker Recognition Speaker Verification

The Conversation: Deep Audio-Visual Speech Enhancement

no code implementations11 Apr 2018 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos.

Speech Enhancement

VoxCeleb: a large-scale speaker identification dataset

8 code implementations Interspeech 2018 Arsha Nagrani, Joon Son Chung, Andrew Zisserman

Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance.

Sound

You said that?

1 code implementation8 May 2017 Joon Son Chung, Amir Jamaludin, Andrew Zisserman

To achieve this we propose an encoder-decoder CNN model that uses a joint embedding of the face and audio to generate synthesised talking face video frames.

Unconstrained Lip-synchronization

Lip Reading Sentences in the Wild

1 code implementation CVPR 2017 Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.

Ranked #4 on Lipreading on GRID corpus (mixed-speech) (using extra training data)

Lipreading Lip Reading +2

Signs in time: Encoding human motion as a temporal image

no code implementations6 Aug 2016 Joon Son Chung, Andrew Zisserman

The goal of this work is to recognise and localise short temporal signals in image time series, where strong supervision is not available for training.

Time Series Time Series Analysis

Cannot find the paper you are looking for? You can Submit a new open access paper.