Search Results for author: Joon Son Chung

Found 60 papers, 17 papers with code

Towards Automated Movie Trailer Generation

no code implementations • 4 Apr 2024 • Dawit Mureja Argaw, Mattia Soldan, Alejandro Pardo, Chen Zhao, Fabian Caba Heilbron, Joon Son Chung, Bernard Ghanem

Movie trailers are an essential tool for promoting films and attracting audiences.

Machine Translation

Paper
Add Code

Scaling Up Video Summarization Pretraining with Large Language Models

no code implementations • 4 Apr 2024 • Dawit Mureja Argaw, Seunghyun Yoon, Fabian Caba Heilbron, Hanieh Deilamsalehy, Trung Bui, Zhaowen Wang, Franck Dernoncourt, Joon Son Chung

Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem.

Video Alignment Video Summarization

Paper
Add Code

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

no code implementations • 14 Mar 2024 • Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung

Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations.

audio-visual learning Contrastive Learning +2

Paper
Add Code

FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

2 code implementations • 18 Jan 2024 • Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung

The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad.

Paper
Code

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

no code implementations • 16 Jan 2024 • Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

We introduce multi-phase training of audio spectrogram transformers by connecting the seminal idea of coarse-to-fine with transformer models.

Audio Classification

Paper
Add Code

Can CLIP Help Sound Source Localization?

1 code implementation • 7 Nov 2023 • Sooyoung Park, Arda Senocak, Joon Son Chung

Large-scale pre-trained image-text models demonstrate remarkable versatility across diverse tasks, benefiting from their robust representational capabilities and effective multimodal alignment.

audio-visual learning Contrastive Learning

Paper
Code

Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

no code implementations • 30 Oct 2023 • Suyeon Lee, Chaeyoung Jung, Youngjoon Jang, Jaehun Kim, Joon Son Chung

For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism.

Speech Separation

Paper
Add Code

Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification

no code implementations • 26 Sep 2023 • Hee-Soo Heo, Kihyun Nam, Bong-Jin Lee, Youngki Kwon, Minjae Lee, You Jin Kim, Joon Son Chung

In the field of speaker verification, session or channel variability poses a significant challenge.

Speaker Verification

Paper
Add Code

VoiceLDM: Text-to-Speech with Environmental Context

no code implementations • 24 Sep 2023 • Yeonghyeon Lee, Inmo Yeon, Juhan Nam, Joon Son Chung

This paper presents VoiceLDM, a model designed to produce audio that accurately follows two distinct natural language text prompts: the description prompt and the content prompt.

AudioCaps

Paper
Add Code

SlowFast Network for Continuous Sign Language Recognition

no code implementations • 21 Sep 2023 • Junseok Ahn, Youngjoon Jang, Joon Son Chung

The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR).

Sign Language Recognition

Paper
Add Code

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

no code implementations • 21 Sep 2023 • Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You Jin Kim, Youngjoon Jang, Joon Son Chung

The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames.

Contrastive Learning

Paper
Add Code

Sound Source Localization is All about Cross-Modal Alignment

no code implementations • ICCV 2023 • Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for genuine sound source localization.

Cross-Modal Retrieval Retrieval

Paper
Add Code

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

no code implementations • 29 Aug 2023 • Ji-Hoon Kim, Jaehun Kim, Joon Son Chung

In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives.

Paper
Add Code

FlexiAST: Flexibility is What AST Needs

no code implementations • 18 Jul 2023 • Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

To overcome this limitation, this paper proposes a training procedure to provide flexibility to standard AST models without architectural changes, allowing them to work with various patch sizes at the inference stage - FlexiAST.

Audio Classification

Paper
Add Code

That's What I Said: Fully-Controllable Talking Face Generation

no code implementations • 6 Apr 2023 • Youngjoon Jang, Kyeongha Rho, Jong-Bin Woo, Hyeongkeun Lee, Jihwan Park, Youshin Lim, Byeong-Yeol Kim, Joon Son Chung

The goal of this paper is to synthesise talking faces with controllable facial motions.

Navigate Talking Face Generation

Paper
Add Code

Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples

no code implementations • 30 Mar 2023 • Hyeonggon Ryu, Arda Senocak, In So Kweon, Joon Son Chung

The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.

Cross-Modal Retrieval Retrieval

Paper
Add Code

Self-Sufficient Framework for Continuous Sign Language Recognition

no code implementations • 21 Mar 2023 • Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Myungchul Kim, Dong-Jin Kim, In So Kweon, Joon Son Chung

The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition.

Pseudo Label Sign Language Recognition

Paper
Add Code

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

1 code implementation • 27 Feb 2023 • Jiyoung Lee, Joon Son Chung, Soo-Whan Chung

This is the first time that face images are used as a condition to train a TTS model.

Speech Synthesis Text-To-Speech Synthesis

Paper
Code

VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

1 code implementation • 20 Feb 2023 • Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022.

Speaker Diarization Speaker Recognition +1

Paper
Code

MarginNCE: Robust Sound Localization with a Negative Margin

no code implementations • 3 Nov 2022 • Sooyoung Park, Arda Senocak, Joon Son Chung

Furthermore, we demonstrate that the introduction of a negative margin to existing methods results in a consistent improvement in performance.

Contrastive Learning

Paper
Add Code

Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition

1 code implementation • 1 Nov 2022 • Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, Joon Son Chung, In So Kweon

Most existing Continuous Sign Language Recognition (CSLR) benchmarks have fixed backgrounds and are filmed in studios with a static monochromatic background.

Benchmarking Disentanglement +1

Paper
Code

Metric Learning for User-defined Keyword Spotting

no code implementations • 1 Nov 2022 • Jaemin Jung, Youkyum Kim, Jihwan Park, Youshin Lim, Byeong-Yeol Kim, Youngjoon Jang, Joon Son Chung

In particular, we make the following contributions: (1) we construct a large-scale keyword dataset with an existing speech corpus and propose a filtering method to remove data that degrade model training; (2) we propose a metric learning-based two-stage training strategy, and demonstrate that the proposed method improves the performance on the user-defined keyword spotting task by enriching their representations; (3) to facilitate the fair comparison in the user-defined KWS field, we propose unified evaluation protocol and metrics.

Keyword Spotting Metric Learning

Paper
Add Code

Disentangled representation learning for multilingual speaker recognition

no code implementations • 1 Nov 2022 • Kihyun Nam, Youkyum Kim, Jaesung Huh, Hee Soo Heo, Jee-weon Jung, Joon Son Chung

The goal of this paper is to learn robust speaker representation for bilingual speaking scenario.

Disentanglement Metric Learning +1

Paper
Add Code

In search of strong embedding extractors for speaker diarisation

no code implementations • 26 Oct 2022 • Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung

First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.

Data Augmentation Speaker Verification

Paper
Add Code

Large-scale learning of generalised representations for speaker recognition

no code implementations • 20 Oct 2022 • Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesong Lee, Hye-jin Shim, Youngki Kwon, Joon Son Chung, Shinji Watanabe

We also show that training with proposed large data configurations gives better performance.

Inductive Bias Speaker Recognition

Paper
Add Code

Curriculum learning for self-supervised speaker verification

no code implementations • 28 Mar 2022 • Hee-Soo Heo, Jee-weon Jung, Jingu Kang, Youngki Kwon, You Jin Kim, Bong-Jin Lee, Joon Son Chung

The goal of this paper is to train effective self-supervised speaker representations without identity labels.

Self-Supervised Learning Speaker Recognition +1

Paper
Add Code

Pushing the limits of raw waveform speaker recognition

2 code implementations • 16 Mar 2022 • Jee-weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

Our best model achieves an equal error rate of 0. 89%, which is competitive with the state-of-the-art models based on handcrafted features, and outperforms the best model based on raw waveform inputs by a large margin.

Self-Supervised Learning Speaker Recognition +1

967

Paper
Code

Multi-scale speaker embedding-based graph attention networks for speaker diarisation

no code implementations • 7 Oct 2021 • Youngki Kwon, Hee-Soo Heo, Jee-weon Jung, You Jin Kim, Bong-Jin Lee, Joon Son Chung

The objective of this work is effective speaker diarisation using multi-scale speaker embeddings.

Graph Attention

Paper
Add Code

Advancing the dimensionality reduction of speaker embeddings for speaker diarisation: disentangling noise and informing speech activity

no code implementations • 7 Oct 2021 • You Jin Kim, Hee-Soo Heo, Jee-weon Jung, Youngki Kwon, Bong-Jin Lee, Joon Son Chung

The objective of this work is to train noise-robust speaker embeddings adapted for speaker diarisation.

Dimensionality Reduction

Paper
Add Code

Spell my name: keyword boosted speech recognition

no code implementations • 6 Oct 2021 • Namkyu Jung, Geonmin Kim, Joon Son Chung

Recognition of uncommon words such as names and technical terminology is important to understanding conversations in context.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks

1 code implementation • 4 Oct 2021 • Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, Nicholas Evans

Artefacts that differentiate spoofed from bona-fide utterances can reside in spectral or temporal domains.

Ranked #1 on Voice Anti-spoofing on ASVspoof 2019 - LA

Graph Attention Voice Anti-spoofing

116

Paper
Code

Look Who's Talking: Active Speaker Detection in the Wild

1 code implementation • 17 Aug 2021 • You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way.

Paper
Code

Three-class Overlapped Speech Detection using a Convolutional Recurrent Neural Network

no code implementations • 7 Apr 2021 • Jee-weon Jung, Hee-Soo Heo, Youngki Kwon, Joon Son Chung, Bong-Jin Lee

In this work, we propose an overlapped speech detection system trained as a three-class classifier.

Binary Classification speaker-diarization +1

Paper
Add Code

Adapting Speaker Embeddings for Speaker Diarisation

no code implementations • 7 Apr 2021 • Youngki Kwon, Jee-weon Jung, Hee-Soo Heo, You Jin Kim, Bong-Jin Lee, Joon Son Chung

The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation.

Clustering Dimensionality Reduction +1

Paper
Add Code

VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge

no code implementations • 12 Dec 2020 • Arsha Nagrani, Joon Son Chung, Jaesung Huh, Andrew Brown, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman

We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020.

Speaker Recognition

Paper
Add Code

Graph Attention Networks for Speaker Verification

no code implementations • 22 Oct 2020 • Jee-weon Jung, Hee-Soo Heo, Ha-Jin Yu, Joon Son Chung

The proposed framework inputs segment-wise speaker embeddings from an enrollment and a test utterance and directly outputs a similarity score.

Graph Attention Speaker Verification

Paper
Add Code

Self-Supervised Learning of Audio-Visual Objects from Video

1 code implementation • ECCV 2020 • Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning.

Face Detection Optical Flow Estimation +1

110

Paper
Code

Augmentation adversarial training for self-supervised speaker recognition

no code implementations • 23 Jul 2020 • Jaesung Huh, Hee Soo Heo, Jingu Kang, Shinji Watanabe, Joon Son Chung

Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general.

Contrastive Learning Speaker Recognition

Paper
Add Code

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

1 code implementation • ECCV 2020 • Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, Andrew Zisserman

Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality.

Ranked #4 on Sign Language Recognition on WLASL-2000

Action Classification Keyword Spotting +2

Paper
Code

Spot the conversation: speaker diarisation in the wild

no code implementations • 2 Jul 2020 • Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman

Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.

Speaker Verification

Paper
Add Code

FaceFilter: Audio-visual speech separation using still images

no code implementations • 14 May 2020 • Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang

The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network.

Speech Separation

Paper
Add Code

Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

no code implementations • 29 Apr 2020 • Soo-Whan Chung, Hong Goo Kang, Joon Son Chung

We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks.

Lip Reading Self-Supervised Learning +1

Paper
Add Code

Disentangled Speech Embeddings using Cross-modal Self-supervision

no code implementations • 20 Feb 2020 • Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman

The objective of this paper is to learn representations of speaker identity without access to manually annotated data.

Self-Supervised Learning Speaker Recognition

Paper
Add Code

VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

no code implementations • 5 Dec 2019 • Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman

The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.

Speaker Recognition

Paper
Add Code

ASR is all you need: cross-modal distillation for lip reading

no code implementations • 28 Nov 2019 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data.

Ranked #14 on Lipreading on LRS3-TED (using extra training data)

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Delving into VoxCeleb: environment invariant speaker recognition

1 code implementation • 24 Oct 2019 • Joon Son Chung, Jaesung Huh, Seongkyu Mun

Research in speaker recognition has recently seen significant progress due to the application of neural network models and the availability of new large-scale datasets.

Speaker Identification Speaker Recognition

Paper
Code

My lips are concealed: Audio-visual speech enhancement through obstructions

no code implementations • 11 Jul 2019 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice.

Speech Enhancement

Paper
Add Code

Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection (AVA)

no code implementations • 25 Jun 2019 • Joon Son Chung

This report describes our submission to the ActivityNet Challenge at CVPR 2019.

Ranked #14 on Audio-Visual Active Speaker Detection on AVA-ActiveSpeaker (using extra training data)

Audio-Visual Active Speaker Detection

Paper
Add Code

Who said that?: Audio-visual speaker diarisation of real-world meetings

no code implementations • 24 Jun 2019 • Joon Son Chung, Bong-Jin Lee, Icksang Han

The goal of this work is to determine 'who spoke when' in real-world meetings.

Paper
Add Code

Utterance-level Aggregation For Speaker Recognition In The Wild

10 code implementations • 26 Feb 2019 • Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman

The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.

Speaker Recognition Text-Independent Speaker Verification

449

Paper
Code

Perfect match: Improved cross-modal embeddings for audio-visual synchronisation

no code implementations • 21 Sep 2018 • Soo-Whan Chung, Joon Son Chung, Hong-Goo Kang

This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronization.

Binary Classification Cross-Modal Retrieval +4

Paper
Add Code

Deep Audio-Visual Speech Recognition

4 code implementations • 6 Sep 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.

Ranked #6 on Audio-Visual Speech Recognition on LRS2

Audio-Visual Speech Recognition Automatic Speech Recognition (ASR) +4

185

Paper
Code

LRS3-TED: a large-scale dataset for visual speech recognition

no code implementations • 3 Sep 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition.

Audio-Visual Speech Recognition speech-recognition +2

Paper
Add Code

Deep Lip Reading: a comparison of models and an online application

no code implementations • 15 Jun 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

The goal of this paper is to develop state-of-the-art models for lip reading -- visual speech recognition.

Language Modelling Lip Reading +2

Paper
Add Code

VoxCeleb2: Deep Speaker Recognition

2 code implementations • 14 Jun 2018 • Joon Son Chung, Arsha Nagrani, Andrew Zisserman

The objective of this paper is speaker recognition under noisy and unconstrained conditions.

Ranked #1 on Speaker Verification on VoxCeleb2 (using extra training data)

Speaker Recognition Speaker Verification

366

Paper
Code

The Conversation: Deep Audio-Visual Speech Enhancement

no code implementations • 11 Apr 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos.

Speech Enhancement

Paper
Add Code

VoxCeleb: a large-scale speaker identification dataset

8 code implementations • Interspeech 2018 • Arsha Nagrani, Joon Son Chung, Andrew Zisserman

Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance.

Sound

488

Paper
Code

You said that?

1 code implementation • 8 May 2017 • Joon Son Chung, Amir Jamaludin, Andrew Zisserman

To achieve this we propose an encoder-decoder CNN model that uses a joint embedding of the face and audio to generate synthesised talking face video frames.

Unconstrained Lip-synchronization

Paper
Code

Lip Reading Sentences in the Wild

1 code implementation • CVPR 2017 • Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.

Ranked #4 on Lipreading on GRID corpus (mixed-speech) (using extra training data)

Lipreading Lip Reading +2

Paper
Code

Signs in time: Encoding human motion as a temporal image

no code implementations • 6 Aug 2016 • Joon Son Chung, Andrew Zisserman

The goal of this work is to recognise and localise short temporal signals in image time series, where strong supervision is not available for training.

Time Series Time Series Analysis

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.