1 code implementation • 16 Jan 2025 • Kyeongha Rho, Hyeongkeun Lee, Valentio Iverson, Joon Son Chung
LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction.
no code implementations • 2 Jan 2025 • Jaehun Kim, Ji-Hoon Kim, Yeunju Choi, Tan Dat Nguyen, Seongkyu Mun, Joon Son Chung
The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content.
no code implementations • 28 Dec 2024 • Ji-Hoon Kim, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung
A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems.
no code implementations • 26 Dec 2024 • Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung
We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts.
no code implementations • 29 Nov 2024 • Jeongsoo Choi, Ji-Hoon Kim, Jinyu Li, Joon Son Chung, Shujie Liu
In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos.
1 code implementation • 23 Oct 2024 • Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, Tae-Hyun Oh
Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities, due to their limited capacity to perceive complex multimodal signals and their relationships.
no code implementations • 17 Oct 2024 • Tan Dat Nguyen, Ji-Hoon Kim, Jeongsoo Choi, Shukjae Choi, Jinseok Park, Younglo Lee, Joon Son Chung
In our experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models, with minimal quality trade-off or even improvement in terms of speech intelligibility.
no code implementations • 17 Oct 2024 • Jongbhin Woo, Hyeonggon Ryu, Youngjoon Jang, Jae Won Cho, Joon Son Chung
Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match text queries.
no code implementations • 18 Sep 2024 • Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe
This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data.
no code implementations • 13 Sep 2024 • Jee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe
The recent literature nonetheless shows efforts to train TTS systems using data collected in the wild.
no code implementations • 27 Aug 2024 • Jaesung Huh, Joon Son Chung, Arsha Nagrani, Andrew Brown, Jee-weon Jung, Daniel Garcia-Romero, Andrew Zisserman
In this paper, we provide a review of these challenges that covers: what they explored; the methods developed by the challenge participants and how these evolved; and also the current state of the field for speaker verification and diarisation.
no code implementations • 7 Aug 2024 • Youkyum Kim, Jaemin Jung, Jihwan Park, Byeong-Yeol Kim, Joon Son Chung
This paper proposes a novel user-defined keyword spotting framework that accurately detects audio keywords based on text enrollment.
1 code implementation • 26 Jul 2024 • Junseok Ahn, Youkyum Kim, Yeunju Choi, Doyeop Kwak, Ji-Hoon Kim, Seongkyu Mun, Joon Son Chung
This paper introduces VoxSim, a dataset of perceptual voice similarity ratings.
1 code implementation • 18 Jul 2024 • Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung
Second, we introduce new evaluation metrics to rigorously assess sound source localization methods, focusing on accurately evaluating both localization performance and cross-modal interaction ability.
1 code implementation • 11 Jul 2024 • Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak
Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs.
no code implementations • 15 Jun 2024 • Jaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung
We propose an ASR-with-punctuation task as an effective pre-training strategy for the segmentation model.
no code implementations • 13 Jun 2024 • Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung
This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues.
no code implementations • 8 Jun 2024 • Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, Joon Son Chung
The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target.
1 code implementation • 5 Jun 2024 • Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung
Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs.
no code implementations • CVPR 2024 • Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung
The goal of this work is to simultaneously generate natural talking faces and speech outputs from text.
no code implementations • CVPR 2024 • Dawit Mureja Argaw, Seunghyun Yoon, Fabian Caba Heilbron, Hanieh Deilamsalehy, Trung Bui, Zhaowen Wang, Franck Dernoncourt, Joon Son Chung
Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem.
no code implementations • CVPR 2024 • Dawit Mureja Argaw, Mattia Soldan, Alejandro Pardo, Chen Zhao, Fabian Caba Heilbron, Joon Son Chung, Bernard Ghanem
Movie trailers are an essential tool for promoting films and attracting audiences.
1 code implementation • 14 Mar 2024 • Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung
Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations.
Ranked #3 on
Audio Classification
on VGGSound
(using extra training data)
2 code implementations • 18 Jan 2024 • Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung
The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad.
no code implementations • 16 Jan 2024 • Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak
We introduce multi-phase training of audio spectrogram transformers by connecting the seminal idea of coarse-to-fine with transformer models.
1 code implementation • 7 Nov 2023 • Sooyoung Park, Arda Senocak, Joon Son Chung
Large-scale pre-trained image-text models demonstrate remarkable versatility across diverse tasks, benefiting from their robust representational capabilities and effective multimodal alignment.
no code implementations • 30 Oct 2023 • Suyeon Lee, Chaeyoung Jung, Youngjoon Jang, Jaehun Kim, Joon Son Chung
For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism.
no code implementations • 26 Sep 2023 • Hee-Soo Heo, Kihyun Nam, Bong-Jin Lee, Youngki Kwon, Minjae Lee, You Jin Kim, Joon Son Chung
In the field of speaker verification, session or channel variability poses a significant challenge.
no code implementations • 24 Sep 2023 • Yeonghyeon Lee, Inmo Yeon, Juhan Nam, Joon Son Chung
This paper presents VoiceLDM, a model designed to produce audio that accurately follows two distinct natural language text prompts: the description prompt and the content prompt.
1 code implementation • 21 Sep 2023 • Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You Jin Kim, Youngjoon Jang, Joon Son Chung
The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames.
Active Speaker Detection
Audio-Visual Active Speaker Detection
+1
1 code implementation • 21 Sep 2023 • Junseok Ahn, Youngjoon Jang, Joon Son Chung
The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR).
no code implementations • ICCV 2023 • Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung
However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for genuine sound source localization.
no code implementations • 29 Aug 2023 • Ji-Hoon Kim, Jaehun Kim, Joon Son Chung
In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives.
no code implementations • 18 Jul 2023 • Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak
To overcome this limitation, this paper proposes a training procedure to provide flexibility to standard AST models without architectural changes, allowing them to work with various patch sizes at the inference stage - FlexiAST.
no code implementations • 6 Apr 2023 • Youngjoon Jang, Kyeongha Rho, Jong-Bin Woo, Hyeongkeun Lee, Jihwan Park, Youshin Lim, Byeong-Yeol Kim, Joon Son Chung
The goal of this paper is to synthesise talking faces with controllable facial motions.
no code implementations • 30 Mar 2023 • Hyeonggon Ryu, Arda Senocak, In So Kweon, Joon Son Chung
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
no code implementations • 21 Mar 2023 • Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Myungchul Kim, Dong-Jin Kim, In So Kweon, Joon Son Chung
The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition.
1 code implementation • 27 Feb 2023 • Jiyoung Lee, Joon Son Chung, Soo-Whan Chung
This is the first time that face images are used as a condition to train a TTS model.
1 code implementation • 20 Feb 2023 • Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman
This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022.
no code implementations • 3 Nov 2022 • Sooyoung Park, Arda Senocak, Joon Son Chung
Furthermore, we demonstrate that the introduction of a negative margin to existing methods results in a consistent improvement in performance.
no code implementations • 1 Nov 2022 • Kihyun Nam, Youkyum Kim, Jaesung Huh, Hee Soo Heo, Jee-weon Jung, Joon Son Chung
The goal of this paper is to learn robust speaker representation for bilingual speaking scenario.
no code implementations • 1 Nov 2022 • Jaemin Jung, Youkyum Kim, Jihwan Park, Youshin Lim, Byeong-Yeol Kim, Youngjoon Jang, Joon Son Chung
In particular, we make the following contributions: (1) we construct a large-scale keyword dataset with an existing speech corpus and propose a filtering method to remove data that degrade model training; (2) we propose a metric learning-based two-stage training strategy, and demonstrate that the proposed method improves the performance on the user-defined keyword spotting task by enriching their representations; (3) to facilitate the fair comparison in the user-defined KWS field, we propose unified evaluation protocol and metrics.
1 code implementation • 1 Nov 2022 • Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, Joon Son Chung, In So Kweon
Most existing Continuous Sign Language Recognition (CSLR) benchmarks have fixed backgrounds and are filmed in studios with a static monochromatic background.
no code implementations • 26 Oct 2022 • Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
no code implementations • 20 Oct 2022 • Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesong Lee, Hye-jin Shim, Youngki Kwon, Joon Son Chung, Shinji Watanabe
We also show that training with proposed large data configurations gives better performance.
no code implementations • 28 Mar 2022 • Hee-Soo Heo, Jee-weon Jung, Jingu Kang, Youngki Kwon, You Jin Kim, Bong-Jin Lee, Joon Son Chung
The goal of this paper is to train effective self-supervised speaker representations without identity labels.
2 code implementations • 16 Mar 2022 • Jee-weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Joon Son Chung
Our best model achieves an equal error rate of 0. 89%, which is competitive with the state-of-the-art models based on handcrafted features, and outperforms the best model based on raw waveform inputs by a large margin.
no code implementations • 7 Oct 2021 • You Jin Kim, Hee-Soo Heo, Jee-weon Jung, Youngki Kwon, Bong-Jin Lee, Joon Son Chung
The objective of this work is to train noise-robust speaker embeddings adapted for speaker diarisation.
no code implementations • 7 Oct 2021 • Youngki Kwon, Hee-Soo Heo, Jee-weon Jung, You Jin Kim, Bong-Jin Lee, Joon Son Chung
The objective of this work is effective speaker diarisation using multi-scale speaker embeddings.
no code implementations • 6 Oct 2021 • Namkyu Jung, Geonmin Kim, Joon Son Chung
Recognition of uncommon words such as names and technical terminology is important to understanding conversations in context.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
2 code implementations • 4 Oct 2021 • Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, Nicholas Evans
Artefacts that differentiate spoofed from bona-fide utterances can reside in spectral or temporal domains.
Ranked #1 on
Voice Anti-spoofing
on ASVspoof 2019 - LA
1 code implementation • 17 Aug 2021 • You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung
Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way.
no code implementations • 7 Apr 2021 • Youngki Kwon, Jee-weon Jung, Hee-Soo Heo, You Jin Kim, Bong-Jin Lee, Joon Son Chung
The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation.
no code implementations • 7 Apr 2021 • Jee-weon Jung, Hee-Soo Heo, Youngki Kwon, Joon Son Chung, Bong-Jin Lee
In this work, we propose an overlapped speech detection system trained as a three-class classifier.
no code implementations • 12 Dec 2020 • Arsha Nagrani, Joon Son Chung, Jaesung Huh, Andrew Brown, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman
We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020.
no code implementations • 22 Oct 2020 • Jee-weon Jung, Hee-Soo Heo, Ha-Jin Yu, Joon Son Chung
The proposed framework inputs segment-wise speaker embeddings from an enrollment and a test utterance and directly outputs a similarity score.
1 code implementation • ECCV 2020 • Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning.
no code implementations • 23 Jul 2020 • Jaesung Huh, Hee Soo Heo, Jingu Kang, Shinji Watanabe, Joon Son Chung
Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general.
1 code implementation • ECCV 2020 • Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, Andrew Zisserman
Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality.
Ranked #7 on
Sign Language Recognition
on WLASL-2000
no code implementations • 2 Jul 2020 • Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman
Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.
no code implementations • 14 May 2020 • Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang
The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network.
no code implementations • 29 Apr 2020 • Soo-Whan Chung, Hong Goo Kang, Joon Son Chung
We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks.
no code implementations • 20 Feb 2020 • Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman
The objective of this paper is to learn representations of speaker identity without access to manually annotated data.
no code implementations • 5 Dec 2019 • Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman
The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.
no code implementations • 28 Nov 2019 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data.
Ranked #21 on
Lipreading
on LRS3-TED
(using extra training data)
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
1 code implementation • 24 Oct 2019 • Joon Son Chung, Jaesung Huh, Seongkyu Mun
Research in speaker recognition has recently seen significant progress due to the application of neural network models and the availability of new large-scale datasets.
no code implementations • 11 Jul 2019 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice.
no code implementations • 25 Jun 2019 • Joon Son Chung
This report describes our submission to the ActivityNet Challenge at CVPR 2019.
Ranked #17 on
Audio-Visual Active Speaker Detection
on AVA-ActiveSpeaker
(using extra training data)
Active Speaker Detection
Audio-Visual Active Speaker Detection
no code implementations • 24 Jun 2019 • Joon Son Chung, Bong-Jin Lee, Icksang Han
The goal of this work is to determine 'who spoke when' in real-world meetings.
10 code implementations • 26 Feb 2019 • Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman
The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.
no code implementations • 21 Sep 2018 • Soo-Whan Chung, Joon Son Chung, Hong-Goo Kang
This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronization.
4 code implementations • 6 Sep 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
Ranked #7 on
Audio-Visual Speech Recognition
on LRS2
Audio-Visual Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 3 Sep 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition.
no code implementations • 15 Jun 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
The goal of this paper is to develop state-of-the-art models for lip reading -- visual speech recognition.
2 code implementations • 14 Jun 2018 • Joon Son Chung, Arsha Nagrani, Andrew Zisserman
The objective of this paper is speaker recognition under noisy and unconstrained conditions.
Ranked #1 on
Speaker Verification
on VoxCeleb2
(using extra training data)
no code implementations • 11 Apr 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos.
8 code implementations • Interspeech 2018 • Arsha Nagrani, Joon Son Chung, Andrew Zisserman
Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance.
Sound
1 code implementation • 8 May 2017 • Joon Son Chung, Amir Jamaludin, Andrew Zisserman
To achieve this we propose an encoder-decoder CNN model that uses a joint embedding of the face and audio to generate synthesised talking face video frames.
no code implementations • CVPR 2017 • Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
Ranked #4 on
Lipreading
on GRID corpus (mixed-speech)
(using extra training data)
no code implementations • 6 Aug 2016 • Joon Son Chung, Andrew Zisserman
The goal of this work is to recognise and localise short temporal signals in image time series, where strong supervision is not available for training.