Search Results for author: Tom Ko

Found 27 papers, 16 papers with code

MetaMix: Improved Meta-Learning with Interpolation-based Consistency Regularization

no code implementations29 Sep 2020 Yangbin Chen, Yun Ma, Tom Ko, Jian-Ping Wang, Qing Li

MetaMix can be integrated with any of the MAML-based algorithms and learn the decision boundaries generalizing better to new tasks.

Few-Shot Learning Transfer Learning

AutoSpeech 2020: The Second Automated Machine Learning Challenge for Speech Classification

no code implementations25 Oct 2020 Jingsong Wang, Tom Ko, Zhen Xu, Xiawei Guo, Souxiang Liu, Wei-Wei Tu, Lei Xie

The AutoSpeech challenge calls for automated machine learning (AutoML) solutions to automate the process of applying machine learning to speech processing tasks.

AutoML BIG-bench Machine Learning +1

Auto-KWS 2021 Challenge: Task, Datasets, and Baselines

1 code implementation31 Mar 2021 Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-Yi Lee, Lei Xie

Auto-KWS 2021 challenge calls for automated machine learning (AutoML) solutions to automate the process of applying machine learning to a customized keyword spotting task.

AutoML BIG-bench Machine Learning +1

Exploring Machine Speech Chain for Domain Adaptation and Few-Shot Speaker Adaptation

no code implementations8 Apr 2021 Fengpeng Yue, Yan Deng, Lei He, Tom Ko

Machine Speech Chain, which integrates both end-to-end (E2E) automatic speech recognition (ASR) and text-to-speech (TTS) into one circle for joint training, has been proven to be effective in data augmentation by leveraging large amounts of unpaired data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

CL4AC: A Contrastive Loss for Audio Captioning

2 code implementations21 Jul 2021 Xubo Liu, Qiushi Huang, Xinhao Mei, Tom Ko, H Lilian Tang, Mark D. Plumbley, Wenwu Wang

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip.

Audio captioning Translation

Multi-View Self-Attention Based Transformer for Speaker Recognition

no code implementations11 Oct 2021 Rui Wang, Junyi Ao, Long Zhou, Shujie Liu, Zhihua Wei, Tom Ko, Qing Li, Yu Zhang

In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition.

Speaker Recognition

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

3 code implementations ACL 2022 Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +7

LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

1 code implementation29 Mar 2022 Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li

LightHuBERT outperforms the original HuBERT on ASR and five SUPERB tasks with the HuBERT size, achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters, and obtains a $3. 5\times$ compression ratio in three SUPERB tasks, e. g., automatic speaker verification, keyword spotting, and intent classification, with a slight accuracy loss.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +6

GigaST: A 10,000-hour Pseudo Speech Translation Corpus

1 code implementation8 Apr 2022 Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao Wang, Mingxuan Wang, Jun Cao

The training set is translated by a strong machine translation system and the test set is translated by human.

Machine Translation Translation

A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

no code implementations3 Aug 2022 Qibing Bai, Tom Ko, Yu Zhang

In human speech, the attitude of a speaker cannot be fully expressed only by the textual content.

Speech Synthesis

Personalized Dialogue Generation with Persona-Adaptive Attention

1 code implementation27 Oct 2022 Qiushi Huang, Yu Zhang, Tom Ko, Xubo Liu, Bo Wu, Wenwu Wang, Lilian Tang

Persona-based dialogue systems aim to generate consistent responses based on historical context and predefined persona.

Dialogue Generation

Leveraging per Image-Token Consistency for Vision-Language Pre-training

no code implementations CVPR 2023 Yunhao Gou, Tom Ko, Hansi Yang, James Kwok, Yu Zhang, Mingxuan Wang

(2) Under-utilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations.

Language Modelling Masked Language Modeling +1

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

3 code implementations30 Mar 2023 Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.

 Ranked #1 on Zero-Shot Environment Sound Classification on ESC-50 (using extra training data)

Audio captioning Event Detection +6

DUB: Discrete Unit Back-translation for Speech Translation

1 code implementation19 May 2023 Dong Zhang, Rong Ye, Tom Ko, Mingxuan Wang, Yaqian Zhou

The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST.

Machine Translation Speech-to-Text Translation +1

CTC-based Non-autoregressive Speech Translation

1 code implementation27 May 2023 Chen Xu, Xiaoqian Liu, Xiaowen Liu, Qingxuan Sun, Yuhao Zhang, Murun Yang, Qianqian Dong, Tom Ko, Mingxuan Wang, Tong Xiao, Anxiang Ma, Jingbo Zhu

Combining end-to-end speech translation (ST) and non-autoregressive (NAR) generation is promising in language and speech processing for their advantages of less error propagation and low latency.

Translation

MOSPC: MOS Prediction Based on Pairwise Comparison

no code implementations18 Jun 2023 Kexin Wang, Yunlong Zhao, Qianqian Dong, Tom Ko, Mingxuan Wang

And our framework also surpasses the strong baseline in ranking accuracy on each fine-grained segment.

Recent Advances in Direct Speech-to-text Translation

no code implementations20 Jun 2023 Chen Xu, Rong Ye, Qianqian Dong, Chengqi Zhao, Tom Ko, Mingxuan Wang, Tong Xiao, Jingbo Zhu

Recently, speech-to-text translation has attracted more and more attention and many studies have emerged rapidly.

Data Augmentation Knowledge Distillation +2

RepCodec: A Speech Representation Codec for Speech Tokenization

1 code implementation31 Aug 2023 Zhichao Huang, Chutong Meng, Tom Ko

To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization.

Language Modelling Quantization

Speech Translation with Large Language Models: An Industrial Practice

no code implementations21 Dec 2023 Zhichao Huang, Rong Ye, Tom Ko, Qianqian Dong, Shanbo Cheng, Mingxuan Wang, Hang Li

Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM.

Language Modelling Large Language Model +1

Cannot find the paper you are looking for? You can Submit a new open access paper.