Search Results for author: Joon Son Chung

Found 80 papers, 25 papers with code

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

1 code implementation16 Jan 2025 Kyeongha Rho, Hyeongkeun Lee, Valentio Iverson, Joon Son Chung

LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction.

AudioCaps Audio captioning +5

AdaptVC: High Quality Voice Conversion with Adaptive Learning

no code implementations2 Jan 2025 Jaehun Kim, Ji-Hoon Kim, Yeunju Choi, Tan Dat Nguyen, Seongkyu Mun, Joon Son Chung

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content.

Decoder Disentanglement +1

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation

no code implementations28 Dec 2024 Ji-Hoon Kim, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung

A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems.

Speech Synthesis

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

no code implementations26 Dec 2024 Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung

We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts.

Audio Generation Speech Synthesis

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

no code implementations29 Nov 2024 Jeongsoo Choi, Ji-Hoon Kim, Jinyu Li, Joon Son Chung, Shujie Liu

In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos.

Decoder

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

1 code implementation23 Oct 2024 Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, Tae-Hyun Oh

Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities, due to their limited capacity to perceive complex multimodal signals and their relationships.

Hallucination

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding

no code implementations17 Oct 2024 Tan Dat Nguyen, Ji-Hoon Kim, Jeongsoo Choi, Shukjae Choi, Jinseok Park, Younglo Lee, Joon Son Chung

In our experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models, with minimal quality trade-off or even improvement in terms of speech intelligibility.

Speech Synthesis

SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

no code implementations18 Sep 2024 Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe

This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data.

DeepFake Detection Diversity +3

The VoxCeleb Speaker Recognition Challenge: A Retrospective

no code implementations27 Aug 2024 Jaesung Huh, Joon Son Chung, Arsha Nagrani, Andrew Brown, Jee-weon Jung, Daniel Garcia-Romero, Andrew Zisserman

In this paper, we provide a review of these challenges that covers: what they explored; the methods developed by the challenge participants and how these evolved; and also the current state of the field for speaker verification and diarisation.

Domain Adaptation Speaker Recognition +1

Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting

no code implementations7 Aug 2024 Youkyum Kim, Jaemin Jung, Jihwan Park, Byeong-Yeol Kim, Joon Son Chung

This paper proposes a novel user-defined keyword spotting framework that accurately detects audio keywords based on text enrollment.

Keyword Spotting

Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment

1 code implementation18 Jul 2024 Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

Second, we introduce new evaluation metrics to rigorously assess sound source localization methods, focusing on accurately evaluating both localization performance and cross-modal interaction ability.

cross-modal alignment Cross-Modal Retrieval +1

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

1 code implementation11 Jul 2024 Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs.

Audio Classification

Lightweight Audio Segmentation for Long-form Speech Translation

no code implementations15 Jun 2024 Jaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung

We propose an ASR-with-punctuation task as an effective pre-training strategy for the segmentation model.

Segmentation Translation

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

no code implementations13 Jun 2024 Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung

This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues.

Speech Enhancement

To what extent can ASV systems naturally defend against spoofing attacks?

no code implementations8 Jun 2024 Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, Joon Son Chung

The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target.

Speaker Verification

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

1 code implementation5 Jun 2024 Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung

Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs.

Audio Classification Mamba +2

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

1 code implementation14 Mar 2024 Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung

Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations.

Ranked #3 on Audio Classification on VGGSound (using extra training data)

Audio Classification audio-visual learning +3

FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

2 code implementations18 Jan 2024 Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung

The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad.

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

no code implementations16 Jan 2024 Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

We introduce multi-phase training of audio spectrogram transformers by connecting the seminal idea of coarse-to-fine with transformer models.

Audio Classification

Can CLIP Help Sound Source Localization?

1 code implementation7 Nov 2023 Sooyoung Park, Arda Senocak, Joon Son Chung

Large-scale pre-trained image-text models demonstrate remarkable versatility across diverse tasks, benefiting from their robust representational capabilities and effective multimodal alignment.

audio-visual learning Contrastive Learning +1

Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

no code implementations30 Oct 2023 Suyeon Lee, Chaeyoung Jung, Youngjoon Jang, Jaehun Kim, Joon Son Chung

For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism.

Speech Separation

VoiceLDM: Text-to-Speech with Environmental Context

no code implementations24 Sep 2023 Yeonghyeon Lee, Inmo Yeon, Juhan Nam, Joon Son Chung

This paper presents VoiceLDM, a model designed to produce audio that accurately follows two distinct natural language text prompts: the description prompt and the content prompt.

AudioCaps Text to Speech

SlowFast Network for Continuous Sign Language Recognition

1 code implementation21 Sep 2023 Junseok Ahn, Youngjoon Jang, Joon Son Chung

The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR).

Sign Language Recognition

Sound Source Localization is All about Cross-Modal Alignment

no code implementations ICCV 2023 Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for genuine sound source localization.

cross-modal alignment Cross-Modal Retrieval +2

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

no code implementations29 Aug 2023 Ji-Hoon Kim, Jaehun Kim, Joon Son Chung

In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives.

FlexiAST: Flexibility is What AST Needs

no code implementations18 Jul 2023 Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

To overcome this limitation, this paper proposes a training procedure to provide flexibility to standard AST models without architectural changes, allowing them to work with various patch sizes at the inference stage - FlexiAST.

Audio Classification

Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples

no code implementations30 Mar 2023 Hyeonggon Ryu, Arda Senocak, In So Kweon, Joon Son Chung

The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.

Cross-Modal Retrieval Retrieval

Self-Sufficient Framework for Continuous Sign Language Recognition

no code implementations21 Mar 2023 Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Myungchul Kim, Dong-Jin Kim, In So Kweon, Joon Son Chung

The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition.

Pseudo Label Sign Language Recognition

VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

1 code implementation20 Feb 2023 Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022.

Speaker Diarization Speaker Recognition +1

MarginNCE: Robust Sound Localization with a Negative Margin

no code implementations3 Nov 2022 Sooyoung Park, Arda Senocak, Joon Son Chung

Furthermore, we demonstrate that the introduction of a negative margin to existing methods results in a consistent improvement in performance.

Contrastive Learning Sound Source Localization

Metric Learning for User-defined Keyword Spotting

no code implementations1 Nov 2022 Jaemin Jung, Youkyum Kim, Jihwan Park, Youshin Lim, Byeong-Yeol Kim, Youngjoon Jang, Joon Son Chung

In particular, we make the following contributions: (1) we construct a large-scale keyword dataset with an existing speech corpus and propose a filtering method to remove data that degrade model training; (2) we propose a metric learning-based two-stage training strategy, and demonstrate that the proposed method improves the performance on the user-defined keyword spotting task by enriching their representations; (3) to facilitate the fair comparison in the user-defined KWS field, we propose unified evaluation protocol and metrics.

Keyword Spotting Metric Learning

Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition

1 code implementation1 Nov 2022 Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, Joon Son Chung, In So Kweon

Most existing Continuous Sign Language Recognition (CSLR) benchmarks have fixed backgrounds and are filmed in studios with a static monochromatic background.

Benchmarking Disentanglement +1

In search of strong embedding extractors for speaker diarisation

no code implementations26 Oct 2022 Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung

First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.

Data Augmentation Speaker Verification

Pushing the limits of raw waveform speaker recognition

2 code implementations16 Mar 2022 Jee-weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

Our best model achieves an equal error rate of 0. 89%, which is competitive with the state-of-the-art models based on handcrafted features, and outperforms the best model based on raw waveform inputs by a large margin.

Self-Supervised Learning Speaker Recognition +1

Spell my name: keyword boosted speech recognition

no code implementations6 Oct 2021 Namkyu Jung, Geonmin Kim, Joon Son Chung

Recognition of uncommon words such as names and technical terminology is important to understanding conversations in context.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Look Who's Talking: Active Speaker Detection in the Wild

1 code implementation17 Aug 2021 You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way.

Active Speaker Detection

Graph Attention Networks for Speaker Verification

no code implementations22 Oct 2020 Jee-weon Jung, Hee-Soo Heo, Ha-Jin Yu, Joon Son Chung

The proposed framework inputs segment-wise speaker embeddings from an enrollment and a test utterance and directly outputs a similarity score.

Graph Attention Speaker Verification

Augmentation adversarial training for self-supervised speaker recognition

no code implementations23 Jul 2020 Jaesung Huh, Hee Soo Heo, Jingu Kang, Shinji Watanabe, Joon Son Chung

Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general.

Contrastive Learning Speaker Recognition

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

1 code implementation ECCV 2020 Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, Andrew Zisserman

Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality.

Action Classification Keyword Spotting +2

Spot the conversation: speaker diarisation in the wild

no code implementations2 Jul 2020 Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman

Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.

Active Speaker Detection Speaker Verification

FaceFilter: Audio-visual speech separation using still images

no code implementations14 May 2020 Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang

The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network.

Speech Separation

Disentangled Speech Embeddings using Cross-modal Self-supervision

no code implementations20 Feb 2020 Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman

The objective of this paper is to learn representations of speaker identity without access to manually annotated data.

Self-Supervised Learning Speaker Recognition

VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

no code implementations5 Dec 2019 Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman

The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.

Speaker Recognition

ASR is all you need: cross-modal distillation for lip reading

no code implementations28 Nov 2019 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data.

Ranked #21 on Lipreading on LRS3-TED (using extra training data)

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Delving into VoxCeleb: environment invariant speaker recognition

1 code implementation24 Oct 2019 Joon Son Chung, Jaesung Huh, Seongkyu Mun

Research in speaker recognition has recently seen significant progress due to the application of neural network models and the availability of new large-scale datasets.

Speaker Identification Speaker Recognition

My lips are concealed: Audio-visual speech enhancement through obstructions

no code implementations11 Jul 2019 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice.

Speech Enhancement

Who said that?: Audio-visual speaker diarisation of real-world meetings

no code implementations24 Jun 2019 Joon Son Chung, Bong-Jin Lee, Icksang Han

The goal of this work is to determine 'who spoke when' in real-world meetings.

Utterance-level Aggregation For Speaker Recognition In The Wild

10 code implementations26 Feb 2019 Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman

The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.

Speaker Recognition Text-Independent Speaker Verification

Perfect match: Improved cross-modal embeddings for audio-visual synchronisation

no code implementations21 Sep 2018 Soo-Whan Chung, Joon Son Chung, Hong-Goo Kang

This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronization.

Binary Classification Cross-Modal Retrieval +4

VoxCeleb2: Deep Speaker Recognition

2 code implementations14 Jun 2018 Joon Son Chung, Arsha Nagrani, Andrew Zisserman

The objective of this paper is speaker recognition under noisy and unconstrained conditions.

 Ranked #1 on Speaker Verification on VoxCeleb2 (using extra training data)

Speaker Recognition Speaker Verification

The Conversation: Deep Audio-Visual Speech Enhancement

no code implementations11 Apr 2018 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos.

Speech Enhancement

VoxCeleb: a large-scale speaker identification dataset

8 code implementations Interspeech 2018 Arsha Nagrani, Joon Son Chung, Andrew Zisserman

Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance.

Sound

You said that?

1 code implementation8 May 2017 Joon Son Chung, Amir Jamaludin, Andrew Zisserman

To achieve this we propose an encoder-decoder CNN model that uses a joint embedding of the face and audio to generate synthesised talking face video frames.

Decoder Unconstrained Lip-synchronization

Lip Reading Sentences in the Wild

no code implementations CVPR 2017 Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.

Ranked #4 on Lipreading on GRID corpus (mixed-speech) (using extra training data)

Lipreading Lip Reading +2

Signs in time: Encoding human motion as a temporal image

no code implementations6 Aug 2016 Joon Son Chung, Andrew Zisserman

The goal of this work is to recognise and localise short temporal signals in image time series, where strong supervision is not available for training.

Time Series Time Series Analysis

Cannot find the paper you are looking for? You can Submit a new open access paper.