Search Results for author: Yu Wu

Found 79 papers, 30 papers with code

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

no code implementations27 Apr 2022 Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu, Furu Wei

Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition.

Self-Supervised Learning Speaker Recognition +2

Ultra Fast Speech Separation Model with Teacher Student Learning

no code implementations27 Apr 2022 Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, Jinyu Li, Xiangzhan Yu

In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning).

Speech Separation

A collaborative decomposition-based evolutionary algorithm integrating normal and penalty-based boundary intersection for many-objective optimization

no code implementations14 Apr 2022 Yu Wu, Jianle Wei, Weiqin Ying, Yanqi Lan, Zhen Cui, Zhenyu Wang

On the other hand, the parallel reference lines of the parallel decomposition methods including the normal boundary intersection (NBI) might result in poor diversity because of under-sampling near the boundaries for MaOPs with concave frontiers.

Quantized GAN for Complex Music Generation from Dance Videos

1 code implementation1 Apr 2022 Ye Zhu, Kyle Olszewski, Yu Wu, Panos Achlioptas, Menglei Chai, Yan Yan, Sergey Tulyakov

We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates complex musical samples conditioned on dance videos.

Music Generation

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

no code implementations30 Mar 2022 Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency.

Automatic Speech Recognition Speaker Diarization +1

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

no code implementations2 Feb 2022 Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR).

Automatic Speech Recognition

Multi-query Video Retrieval

1 code implementation10 Jan 2022 Zeyu Wang, Yu Wu, Karthik Narasimhan, Olga Russakovsky

In this paper, we focus on the less-studied setting of multi-query video retrieval, where multiple queries are provided to the model for searching over the video archive.

Video Retrieval

Self-Supervised Learning for speech recognition with Intermediate layer supervision

1 code implementation16 Dec 2021 Chengyi Wang, Yu Wu, Sanyuan Chen, Shujie Liu, Jinyu Li, Yao Qian, Zhenglu Yang

Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.

Self-Supervised Learning Speech Recognition

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

no code implementations ACL 2022 Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.

Automatic Speech Recognition Quantization +5

Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

no code implementations11 Oct 2021 Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, Yu Wu

In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning.

Automatic Speech Recognition Contrastive Learning +4

Cell2State: Learning Cell State Representations From Barcoded Single-Cell Gene-Expression Transitions

no code implementations29 Sep 2021 Yu Wu, Joseph Chahn Kim, Chengzhuo Ni, Le Cong, Mengdi Wang

Genetic barcoding coupled with single-cell sequencing technology enables direct measurement of cell-to-cell transitions and gene-expression evolution over a long timespan.

Dimensionality Reduction

Contrastive Video-Language Segmentation

no code implementations29 Sep 2021 Chen Liang, Yawei Luo, Yu Wu, Yi Yang

We focus on the problem of segmenting a certain object referred by a natural language sentence in video content, at the core of formulating a pinpoint vision-language relation.

Contrastive Learning

Knowledge Enhanced Fine-Tuning for Better Handling Unseen Entities in Dialogue Generation

1 code implementation EMNLP 2021 Leyang Cui, Yu Wu, Shujie Liu, Yue Zhang

To deal with this problem, instead of introducing knowledge base as the input, we force the model to learn a better semantic representation by predicting the information in the knowledge base, only based on the input context.

Dialogue Generation

Detecting Speaker Personas from Conversational Texts

1 code implementation EMNLP 2021 Jia-Chen Gu, Zhen-Hua Ling, Yu Wu, Quan Liu, Zhigang Chen, Xiaodan Zhu

This is a many-to-many semantic matching task because both contexts and personas in SPD are composed of multiple sentences.

Flexible Clustered Federated Learning for Client-Level Data Distribution Shift

1 code implementation22 Aug 2021 Moming Duan, Duo Liu, Xinyuan Ji, Yu Wu, Liang Liang, Xianzhang Chen, Yujuan Tan

Federated Learning (FL) enables the multiple participating devices to collaboratively contribute to a global neural network model while keeping the training data locally.

Federated Learning

UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset

no code implementations12 Jul 2021 Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Yao Qian, Kenichi Kumatani, Furu Wei

Recently, there has been a vast interest in self-supervised learning (SSL) where the model is pre-trained on large scale unlabeled data and then fine-tuned on a small labeled dataset.

Self-Supervised Learning Speech Recognition

Investigation of Practical Aspects of Single Channel Speech Separation for ASR

no code implementations5 Jul 2021 Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, Jinyu Li

Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR).

Automatic Speech Recognition Model Compression +1

Saying the Unseen: Video Descriptions via Dialog Agents

1 code implementation26 Jun 2021 Ye Zhu, Yu Wu, Yi Yang, Yan Yan

Current vision and language tasks usually take complete visual data (e. g., raw images or videos) as input, however, practical scenarios may often consist the situations where part of the visual information becomes inaccessible due to various reasons e. g., restricted view with fixed camera or intentional vision block for security concerns.

Transfer Learning

VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild

no code implementations CVPR 2021 Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, Yi Yang

To the best of our knowledge, our VSPW is the first attempt to tackle the challenging video scene parsing task in the wild by considering diverse scenarios.

Frame Scene Parsing

Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing

no code implementations CVPR 2021 Yu Wu, Yi Yang

Previous works take the overall event labels to supervise both audio and visual model predictions.

Contrastive Learning

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

no code implementations4 Jun 2021 Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li, Yifan Gong

In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference.

Speech Recognition

Template-Based Named Entity Recognition Using BART

1 code implementation Findings (ACL) 2021 Leyang Cui, Yu Wu, Jian Liu, Sen yang, Yue Zhang

To address the issue, we propose a template-based method for NER, treating NER as a language model ranking problem in a sequence-to-sequence framework, where original sentences and statement templates filled by candidate named entity span are regarded as the source sequence and the target sequence, respectively.

few-shot-ner Few-shot NER +3

Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone

no code implementations31 Mar 2021 Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR).

Automatic Speech Recognition

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

no code implementations19 Mar 2021 Chen Liang, Yu Wu, Yawei Luo, Yi Yang

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos.

Referring Expression Segmentation Video Segmentation +2

UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

2 code implementations19 Jan 2021 Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang

In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner.

Multi-Task Learning Representation Learning +2

Learning to Anticipate Egocentric Actions by Imagination

no code implementations13 Jan 2021 Yu Wu, Linchao Zhu, Xiaohan Wang, Yi Yang, Fei Wu

We further improve ImagineRNN by residual anticipation, i. e., changing its target to predicting the feature difference of adjacent frames instead of the frame content.

Action Anticipation Autonomous Driving +2

Connection-Adaptive Meta-Learning

no code implementations1 Jan 2021 Yadong Ding, Yu Wu, Chengyue Huang, Siliang Tang, Yi Yang, Yueting Zhuang

In this paper, we aim to obtain better meta-learners by co-optimizing the architecture and meta-weights simultaneously.

Meta-Learning

Formality Style Transfer with Shared Latent Space

1 code implementation COLING 2020 Yunli Wang, Yu Wu, Lili Mou, Zhoujun Li, WenHan Chao

Conventional approaches for formality style transfer borrow models from neural machine translation, which typically requires massive parallel data for training.

Machine Translation Style Transfer +1

Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer

1 code implementation23 Oct 2020 Sanyuan Chen, Yu Wu, Zhuo Chen, Takuya Yoshioka, Shujie Liu, Jinyu Li

With its strong modeling capacity that comes from a multi-head and multi-layer structure, Transformer is a very powerful model for learning a sequential representation and has been successfully applied to speech separation recently.

Speech Separation

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

no code implementations22 Oct 2020 Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, Jinyu Li

Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition.

Speech Recognition

Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents

1 code implementation ECCV 2020 Ye Zhu, Yu Wu, Yi Yang, Yan Yan

With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources.

Video Description

Continuous Speech Separation with Conformer

1 code implementation13 Aug 2020 Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li, Takuya Yoshioka, Chengyi Wang, Shujie Liu, Ming Zhou

Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription.

Speech Separation

On Commonsense Cues in BERT for Solving Commonsense Tasks

no code implementations Findings (ACL) 2021 Leyang Cui, Sijie Cheng, Yu Wu, Yue Zhang

We quantitatively investigate the presence of structural commonsense cues in BERT when solving commonsense tasks, and the importance of such cues for the model prediction.

Sentiment Analysis

A Retrieve-and-Rewrite Initialization Method for Unsupervised Machine Translation

1 code implementation ACL 2020 Shuo Ren, Yu Wu, Shujie Liu, Ming Zhou, Shuai Ma

The commonly used framework for unsupervised machine translation builds initial translation models of both translation directions, and then performs iterative back-translation to jointly boost their translation performance.

Translation Unsupervised Machine Translation

Abnormal activity capture from passenger flow of elevator based on unsupervised learning and fine-grained multi-label recognition

no code implementations29 Jun 2020 Chunhua Jia, Wenhai Yi, Yu Wu, Hui Huang, Lei Zhang, Leilei Wu

We present a work-flow which aims at capturing residents' abnormal activities through the passenger flow of elevator in multi-storey residence buildings.

Anomaly Detection Instance Segmentation +1

On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

1 code implementation28 May 2020 Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu

Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode.

Automatic Speech Recognition

Curriculum Pre-training for End-to-End Speech Translation

no code implementations ACL 2020 Chengyi Wang, Yu Wu, Shujie Liu, Ming Zhou, Zhenglu Yang

End-to-end speech translation poses a heavy burden on the encoder, because it has to transcribe, understand, and learn cross-lingual semantics simultaneously.

Speech Recognition Translation

MuTual: A Dataset for Multi-Turn Dialogue Reasoning

1 code implementation ACL 2020 Leyang Cui, Yu Wu, Shujie Liu, Yue Zhang, Ming Zhou

Non-task oriented dialogue systems have achieved great success in recent years due to largely accessible conversation data and the development of deep learning techniques.

Task-Oriented Dialogue Systems

Low Latency End-to-End Streaming Speech Recognition with a Scout Network

no code implementations23 Mar 2020 Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu, Guoli Ye, Ming Zhou

The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode.

Audio and Speech Processing

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

no code implementations8 Feb 2020 Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a two-branch structure for action recognition, ie, one branch for verb classification and the other branch for noun classification.

Action Recognition Egocentric Activity Recognition +3

Semantic Mask for Transformer based End-to-End Speech Recognition

1 code implementation6 Dec 2019 Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, Ming Zhou

Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks.

Automatic Speech Recognition

Harnessing Pre-Trained Neural Networks with Rules for Formality Style Transfer

no code implementations IJCNLP 2019 Yunli Wang, Yu Wu, Lili Mou, Zhoujun Li, WenHan Chao

Formality text style transfer plays an important role in various NLP applications, such as non-native speaker assistants and child education.

Style Transfer Text Style Transfer

Dual Attention Matching for Audio-Visual Event Localization

no code implementations ICCV 2019 Yu Wu, Linchao Zhu, Yan Yan, Yi Yang

The duration of these segments is usually short, making the visual and acoustic feature of each segment possibly not well aligned.

audio-visual event localization

Gated Channel Transformation for Visual Recognition

3 code implementations CVPR 2020 Zongxin Yang, Linchao Zhu, Yu Wu, Yi Yang

This lightweight layer incorporates a simple l2 normalization, enabling our transformation unit applicable to operator-level without much increase of additional parameters.

General Classification Image Classification +4

Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation

no code implementations17 Sep 2019 Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, Ming Zhou

End-to-end speech translation, a hot topic in recent years, aims to translate a segment of audio into a specific language with an end-to-end model.

Multi-Task Learning Translation

Explicit Cross-lingual Pre-training for Unsupervised Machine Translation

no code implementations IJCNLP 2019 Shuo Ren, Yu Wu, Shujie Liu, Ming Zhou, Shuai Ma

Pre-training has proven to be effective in unsupervised machine translation due to its ability to model deep context information in cross-lingual scenarios.

Language Modelling Translation +1

Cascaded Revision Network for Novel Object Captioning

1 code implementation6 Aug 2019 Qianyu Feng, Yu Wu, Hehe Fan, Chenggang Yan, Yi Yang

By this novel cascaded captioning-revising mechanism, CRN can accurately describe images with unseen objects.

Image Captioning Object Detection

Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019

no code implementations22 Jun 2019 Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

In this report, we present the Baidu-UTS submission to the EPIC-Kitchens Action Recognition Challenge in CVPR 2019.

Action Recognition Object Detection

Revisiting EmbodiedQA: A Simple Baseline and Beyond

no code implementations8 Apr 2019 Yu Wu, Lu Jiang, Yi Yang

In this paper, we empirically study this problem and introduce 1) a simple yet effective baseline that achieves promising performance; 2) an easier and practical setting for EmbodiedQA where an agent has a chance to adapt the trained model to a new environment before it actually answers users questions.

Embodied Question Answering Question Answering

Text Morphing

no code implementations30 Sep 2018 Shaohan Huang, Yu Wu, Furu Wei, Ming Zhou

In this paper, we introduce a novel natural language generation task, termed as text morphing, which targets at generating the intermediate sentences that are fluency and smooth with the two input sentences.

Text Generation

Neural Melody Composition from Lyrics

no code implementations12 Sep 2018 Hangbo Bao, Shaohan Huang, Furu Wei, Lei Cui, Yu Wu, Chuanqi Tan, Songhao Piao, Ming Zhou

In this paper, we study a novel task that learns to compose music from natural language.

Towards Explainable and Controllable Open Domain Dialogue Generation with Dialogue Acts

no code implementations19 Jul 2018 Can Xu, Wei Wu, Yu Wu

We study open domain dialogue generation with dialogue acts designed to explain how people engage in social chat.

Dialogue Generation reinforcement-learning +1

Dictionary-Guided Editing Networks for Paraphrase Generation

no code implementations21 Jun 2018 Shaohan Huang, Yu Wu, Furu Wei, Ming Zhou

An intuitive way for a human to write paraphrase sentences is to replace words or phrases in the original sentence with their corresponding synonyms and make necessary changes to ensure the new sentences are fluent and grammatically correct.

Paraphrase Generation

Response Generation by Context-aware Prototype Editing

3 code implementations19 Jun 2018 Yu Wu, Furu Wei, Shaohan Huang, Yunli Wang, Zhoujun Li, Ming Zhou

Open domain response generation has achieved remarkable progress in recent years, but sometimes yields short and uninformative responses.

Informativeness Response Generation

Learning Matching Models with Weak Supervision for Response Selection in Retrieval-based Chatbots

no code implementations ACL 2018 Yu Wu, Wei Wu, Zhoujun Li, Ming Zhou

We propose a method that can leverage unlabeled data to learn a matching model for response selection in retrieval-based chatbots.

Decoupled Novel Object Captioner

1 code implementation11 Apr 2018 Yu Wu, Linchao Zhu, Lu Jiang, Yi Yang

Thus, the sequence model can be decoupled from the novel object descriptions.

Image Captioning

Towards Interpretable Chit-chat: Open Domain Dialogue Generation with Dialogue Acts

no code implementations ICLR 2018 Wei Wu, Can Xu, Yu Wu, Zhoujun Li

Conventional methods model open domain dialogue generation as a black box through end-to-end learning from large scale conversation data.

Dialogue Generation Response Generation

A Sequential Matching Framework for Multi-turn Response Selection in Retrieval-based Chatbots

no code implementations CL 2019 Yu Wu, Wei Wu, Chen Xing, Can Xu, Zhoujun Li, Ming Zhou

The task requires matching a response candidate with a conversation context, whose challenges include how to recognize important parts of the context, and how to model the relationships among utterances in the context.

Non-Convex Weighted Lp Nuclear Norm based ADMM Framework for Image Restoration

no code implementations24 Apr 2017 Zhiyuan Zha, Xinggan Zhang, Yu Wu, Qiong Wang, Lan Tang

Since the matrix formed by nonlocal similar patches in a natural image is of low rank, the nuclear norm minimization (NNM) has been widely used in various image processing studies.

Compressive Sensing Deblurring +3

Non-Convex Weighted Lp Minimization based Group Sparse Representation Framework for Image Denoising

no code implementations5 Apr 2017 Qiong Wang, Xinggan Zhang, Yu Wu, Lan Tang, Zhiyuan Zha

Nonlocal image representation or group sparsity has attracted considerable interest in various low-level vision tasks and has led to several state-of-the-art image denoising techniques, such as BM3D, LSSC.

Image Denoising

Hierarchical Recurrent Attention Network for Response Generation

1 code implementation25 Jan 2017 Chen Xing, Wei Wu, Yu Wu, Ming Zhou, YaLou Huang, Wei-Ying Ma

With the word level attention, hidden vectors of a word level encoder are synthesized as utterance vectors and fed to an utterance level encoder to construct hidden representations of the context.

Response Generation

Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots

3 code implementations ACL 2017 Yu Wu, Wei Wu, Chen Xing, Ming Zhou, Zhoujun Li

Existing work either concatenates utterances in context or matches a response with a highly abstract context vector finally, which may lose relationships among utterances or important contextual information.

Conversational Response Selection

Knowledge Enhanced Hybrid Neural Network for Text Matching

no code implementations15 Nov 2016 Yu Wu, Wei Wu, Zhoujun Li, Ming Zhou

Long text brings a big challenge to semantic matching due to their complicated semantic and syntactic structures.

Question Answering Text Matching

Detecting Context Dependent Messages in a Conversational Environment

no code implementations COLING 2016 Chaozhuo Li, Yu Wu, Wei Wu, Chen Xing, Zhoujun Li, Ming Zhou

While automatic response generation for building chatbot systems has drawn a lot of attention recently, there is limited understanding on when we need to consider the linguistic context of an input text in the generation process.

Chatbot Response Generation

Topic Aware Neural Response Generation

1 code implementation21 Jun 2016 Chen Xing, Wei Wu, Yu Wu, Jie Liu, YaLou Huang, Ming Zhou, Wei-Ying Ma

We consider incorporating topic information into the sequence-to-sequence framework to generate informative and interesting responses for chatbots.

Response Generation

Response Selection with Topic Clues for Retrieval-based Chatbots

1 code implementation30 Apr 2016 Yu Wu, Wei Wu, Zhoujun Li, Ming Zhou

The message vector, the response vector, and the two topic vectors are fed to neural tensors to calculate a matching score.

Cannot find the paper you are looking for? You can Submit a new open access paper.