Search Results for author: Kai Yu

Found 143 papers, 51 papers with code

Semi-supervised Learning for Code-Switching ASR with Large Language Model Filter

no code implementations5 Jul 2024 Yu Xi, Wen Ding, Kai Yu, Junjie Lai

Code-switching (CS) phenomenon occurs when words or phrases from different languages are alternated in a single sentence.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

On the Effectiveness of Acoustic BPE in Decoder-Only TTS

no code implementations4 Jul 2024 Bohan Li, Feiyu Shen, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu

Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM).

Decoder Diversity +1

IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation

1 code implementation1 Jul 2024 Senyu Han, Lu Chen, Li-Min Lin, Zhengshan Xu, Kai Yu

To evaluate the framework, we create a novel drama plot that involves several actor agents and check the interactions between them under the instruction of the director agent.

Language Modelling

Text-aware Speech Separation for Multi-talker Keyword Spotting

1 code implementation18 Jun 2024 Haoyu Li, Baochen Yang, Yu Xi, Linfeng Yu, Tian Tan, Hao Li, Kai Yu

TPDT-SS shows remarkable success in addressing permutation problems in mixed keyword speech, thereby greatly boosting the performance of the backend.

Keyword Spotting Speech Separation

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

1 code implementation17 Jun 2024 Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

Notably, ASR models trained on GigaSpeech 2 can reduce the word error rate for Thai, Indonesian, and Vietnamese on our challenging and realistic YouTube test set by 25% to 40% compared to the Whisper large-v3 model, with merely 10% model parameters.

speech-recognition Speech Recognition

Evolving Subnetwork Training for Large Language Models

no code implementations11 Jun 2024 Hanqi Li, Lu Chen, Da Ma, Zijian Wu, Su Zhu, Kai Yu

In this paper, inspired by the redundancy in the parameters of large language models, we propose a novel training paradigm: Evolving Subnetwork Training (EST).

Language Modelling Large Language Model

Sparsity-Accelerated Training for Large Language Models

no code implementations3 Jun 2024 Da Ma, Lu Chen, Pengyu Wang, Hongshen Xu, Hanqi Li, Liangtai Sun, Su Zhu, Shuai Fan, Kai Yu

Large language models (LLMs) have demonstrated proficiency across various natural language processing (NLP) tasks but often require additional training, such as continual pre-training and supervised fine-tuning.

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

no code implementations26 May 2024 Jinlin Liu, Kai Yu, Mengyang Feng, Xiefan Guo, Miaomiao Cui

Training on real-world videos enhanced with this innovative motion depiction approach, our model generates videos exhibiting coherent movement in both foreground subjects and their surrounding contexts.

Video Generation

Performance Analysis of Uplink/Downlink Decoupled Access in Cellular-V2X Networks

no code implementations10 May 2024 Luofang Jiao, Kai Yu, Jiacheng Chen, Tingting Liu, Haibo Zhou, Lin Cai

This paper firstly develops an analytical framework to investigate the performance of uplink (UL) / downlink (DL) decoupled access in cellular vehicle-to-everything (C-V2X) networks, in which a vehicle's UL/DL can be connected to different macro/small base stations (MBSs/SBSs) separately.

AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding

1 code implementation6 May 2024 Tao Liu, Feilong Chen, Shuai Fan, Chenpeng Du, Qi Chen, Xie Chen, Kai Yu

The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait.

Metric Learning Self-Supervised Learning

CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions

1 code implementation4 May 2024 Hanchong Zhang, Ruisheng Cao, Hongshen Xu, Lu Chen, Kai Yu

Recently, Large Language Models (LLMs) have been demonstrated to possess impressive capabilities in a variety of domains and tasks.

In-Context Learning Text-To-SQL

Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech

no code implementations30 Apr 2024 Hankun Wang, Chenpeng Du, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu

We call the attention maps of those heads Alignment-Emerged Attention Maps (AEAMs).

Decoder

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations

no code implementations23 Apr 2024 Sen Liu, Yiwei Guo, Xie Chen, Kai Yu

While acoustic expressiveness has long been studied in expressive text-to-speech (ETTS), the inherent expressiveness in text lacks sufficient attention, especially for ETTS of artistic works.

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

no code implementations9 Apr 2024 Yiwei Guo, Chenrun Wang, Yifan Yang, Hankun Wang, Ziyang Ma, Chenpeng Du, Shuai Wang, Hanzheng Li, Shuai Fan, HUI ZHANG, Xie Chen, Kai Yu

Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Cell-Free Multi-User MIMO Equalization via In-Context Learning

1 code implementation8 Apr 2024 Matteo Zecchin, Kai Yu, Osvaldo Simeone

In this work, we demonstrate that ICL can be also used to tackle the problem of multi-user equalization in cell-free MIMO systems with limited fronthaul capacity.

In-Context Learning

Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind

1 code implementation6 Apr 2024 Hongchuan Zeng, Hongshen Xu, Lu Chen, Kai Yu

MBS overcomes the English-centric limitations of existing methods by sampling calibration data from various languages proportionally to the language distribution of the model training datasets.

Model Compression

Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback

no code implementations27 Mar 2024 Hongshen Xu, Zichen Zhu, Situo Zhang, Da Ma, Shuai Fan, Lu Chen, Kai Yu

Large Language Models (LLMs) often generate erroneous outputs, known as hallucinations, due to their limitations in discerning questions beyond their knowledge scope.

Hallucination

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

no code implementations20 Mar 2024 Yu Xi, Hao Li, Baochen Yang, Haoyu Li, Hainan Xu, Kai Yu

Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention.

Keyword Spotting

ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary

no code implementations5 Mar 2024 Yutong Li, Lu Chen, Aiwei Liu, Kai Yu, Lijie Wen

In this work, we firstly focus on the independent literature summarization step and introduce ChatCite, an LLM agent with human workflow guidance for comparative literature summary.

Retrieval

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding

1 code implementation28 Feb 2024 Hongshen Xu, Lu Chen, Zihan Zhao, Da Ma, Ruisheng Cao, Zichen Zhu, Kai Yu

Additionally, we propose several pre-training tasks to model the interaction among text, structure, and image modalities effectively.

document understanding Information Retrieval +1

A BiRGAT Model for Multi-intent Spoken Language Understanding with Hierarchical Semantic Frames

1 code implementation28 Feb 2024 Hongshen Xu, Ruisheng Cao, Su Zhu, Sheng Jiang, Hanchong Zhang, Lu Chen, Kai Yu

Previous work on spoken language understanding (SLU) mainly focuses on single-intent settings, where each input utterance merely contains one user intent.

Decoder Graph Attention +1

Is Cognition and Action Consistent or Not: Investigating Large Language Model's Personality

no code implementations22 Feb 2024 Yiming Ai, Zhiwei He, Ziyin Zhang, Wenhong Zhu, Hongkun Hao, Kai Yu, Lingjun Chen, Rui Wang

In this study, we investigate the reliability of Large Language Models (LLMs) in professing human-like personality traits through responses to personality questionnaires.

MULTI: Multimodal Understanding Leaderboard with Text and Images

no code implementations5 Feb 2024 Zichen Zhu, Yang Xu, Lu Chen, Jingkai Yang, Yichuan Ma, Yiming Sun, Hailin Wen, Jiaqi Liu, Jinyu Cai, Yingzi Ma, Situo Zhang, Zihan Zhao, Liangtai Sun, Kai Yu

Rapid progress in multimodal large language models (MLLMs) highlights the need to introduce challenging yet realistic benchmarks to the academic community, while existing benchmarks primarily focus on understanding simple natural images and short context.

In-Context Learning

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

no code implementations25 Jan 2024 Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, HUI ZHANG, Xie Chen, Kai Yu

Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt.

Decoder Hallucination

Contrastive Learning With Audio Discrimination For Customizable Keyword Spotting In Continuous Speech

no code implementations12 Jan 2024 Yu Xi, Baochen Yang, Hao Li, Jiaqi Guo, Kai Yu

Furthermore, experiments on the continuous speech dataset LibriSpeech demonstrate that, by incorporating audio discrimination, CLAD achieves significant performance gain over CL without audio discrimination.

Contrastive Learning Keyword Spotting +1

DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors

no code implementations CVPR 2024 Biwen Lei, Kai Yu, Mengyang Feng, Miaomiao Cui, Xuansong Xie

Extensive experiments demonstrate that the proposed framework achieves excellent results in both domain adaptation and text-to-avatar tasks, outperforming existing methods in terms of generation quality and efficiency.

3D Generation Domain Adaptation

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention

no code implementations14 Dec 2023 Junjie Li, Yiwei Guo, Xie Chen, Kai Yu

Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged.

Position Voice Conversion

DreaMoving: A Human Video Generation Framework based on Diffusion Models

no code implementations8 Dec 2023 Mengyang Feng, Jinlin Liu, Kai Yu, Yuan YAO, Zheng Hui, Xiefan Guo, Xianhui Lin, Haolan Xue, Chen Shi, Xiaowen Li, Aojie Li, Xiaoyang Kang, Biwen Lei, Miaomiao Cui, Peiran Ren, Xuansong Xie

In this paper, we present DreaMoving, a diffusion-based controllable video generation framework to produce high-quality customized human videos.

Video Generation

Boosting3D: High-Fidelity Image-to-3D by Boosting 2D Diffusion Prior to 3D Prior with Progressive Learning

no code implementations22 Nov 2023 Kai Yu, Jinlin Liu, Mengyang Feng, Miaomiao Cui, Xuansong Xie

After the progressive training, the LoRA learns the 3D information of the generated object and eventually turns to an object-level 3D prior.

3D Generation Image to 3D +1

In-Context Learning for MIMO Equalization Using Transformer-Based Sequence Models

1 code implementation10 Nov 2023 Matteo Zecchin, Kai Yu, Osvaldo Simeone

In ICL, a decision on a new input is made via a direct mapping of the input and of a few examples from the given task, serving as the task's context, to the output variable.

In-Context Learning Meta-Learning +1

DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder

1 code implementation3 Nov 2023 Tao Liu, Chenpeng Du, Shuai Fan, Feilong Chen, Kai Yu

Our rigorous experiments comprehensively highlight that our ground-breaking approach outpaces existing methods with considerable margins and delivers seamless, intelligible videos in person-generic and multilingual scenarios.

Talking Face Generation Talking Head Generation

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

no code implementations2 Nov 2023 Hanglei Zhang, Yiwei Guo, Sen Liu, Xie Chen, Kai Yu

The LLM selects the best-matching style references from annotated utterances based on external style prompts, which can be raw input text or natural language style descriptions.

Language Modelling Large Language Model +1

ASTormer: An AST Structure-aware Transformer Decoder for Text-to-SQL

no code implementations28 Oct 2023 Ruisheng Cao, Hanchong Zhang, Hongshen Xu, Jieyu Li, Da Ma, Lu Chen, Kai Yu

Text-to-SQL aims to generate an executable SQL program given the user utterance and the corresponding database schema.

Decoder Text-To-SQL

ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought

1 code implementation26 Oct 2023 Hanchong Zhang, Ruisheng Cao, Lu Chen, Hongshen Xu, Kai Yu

Recently Large Language Models (LLMs) have been proven to have strong abilities in various domains and tasks.

In-Context Learning Text-To-SQL

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

1 code implementation14 Sep 2023 Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen

Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques.

Self-Supervised Learning speech-recognition +2

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

no code implementations10 Sep 2023 Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu

Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency.

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

1 code implementation ICCV 2023 Chun-Mei Feng, Kai Yu, Yong liu, Salman Khan, WangMeng Zuo

In this paper, we focus on a particular setting of learning adaptive prompts on the fly for each test sample from an unseen new domain, which is known as test-time prompt tuning (TPT).

Data Augmentation

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

no code implementations25 Jun 2023 Sen Liu, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i. e. speaker similarity) and eliminate the accents from their first language(i. e. nativeness).

Speech Synthesis

Improving Audio Caption Fluency with Automatic Error Correction

no code implementations16 Jun 2023 Hanxue Zhang, Zeyu Xie, Xuenan Xu, Mengyue Wu, Kai Yu

Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips.

Audio captioning Sentence

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation

no code implementations14 Jun 2023 Zheng Liang, Zheshu Song, Ziyang Ma, Chenpeng Du, Kai Yu, Xie Chen

Recently, end-to-end (E2E) automatic speech recognition (ASR) models have made great strides and exhibit excellent performance in general speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Large Language Models Are Semi-Parametric Reinforcement Learning Agents

1 code implementation NeurIPS 2023 Danyang Zhang, Lu Chen, Situo Zhang, Hongshen Xu, Zihan Zhao, Kai Yu

By equipping the LLM with a long-term experience memory, REMEMBERER is capable of exploiting the experiences from the past episodes even for different task goals, which excels an LLM-based agent with fixed exemplars or equipped with a transient working memory.

Language Modelling Large Language Model +1

CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset

1 code implementation25 May 2023 Hanchong Zhang, Jieyu Li, Lu Chen, Ruisheng Cao, Yunyan Zhang, Yu Huang, Yefeng Zheng, Kai Yu

Furthermore, we present CSS, a large-scale CrosS-Schema Chinese text-to-SQL dataset, to carry on corresponding studies.

Benchmarking Text-To-SQL

PointGPT: Auto-regressively Generative Pre-training from Point Clouds

1 code implementation NeurIPS 2023 Guangyan Chen, Meiling Wang, Yi Yang, Kai Yu, Li Yuan, Yufeng Yue

Large language models (LLMs) based on the generative pre-training transformer (GPT) have demonstrated remarkable effectiveness across a diverse range of downstream tasks.

Decoder Few-Shot 3D Point Cloud Classification +1

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction

2 code implementations14 May 2023 Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, Kai Yu

The Graphical User Interface (GUI) is pivotal for human interaction with the digital world, enabling efficient device control and the completion of complex tasks.

Language Modelling

DiffVoice: Text-to-Speech with Latent Diffusion

no code implementations23 Apr 2023 Zhijun Liu, Yiwei Guo, Kai Yu

In this work, we present DiffVoice, a novel text-to-speech model based on latent diffusion.

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

no code implementations30 Mar 2023 Chenpeng Du, Qi Chen, Xie Chen, Kai Yu

Additionally, we propose a novel method for generating continuous video frames with the DDIM image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly.

Decoder Talking Face Generation

Reliable Federated Disentangling Network for Non-IID Domain Feature

2 code implementations30 Jan 2023 Meng Wang, Kai Yu, Chun-Mei Feng, Yiming Qian, Ke Zou, Lianyu Wang, Rick Siow Mong Goh, Yong liu, Huazhu Fu

To the best of our knowledge, our proposed RFedDis is the first work to develop an FL approach based on evidential uncertainty combined with feature disentangling, which enhances the performance and reliability of FL in non-IID domain features.

Federated Learning

On the Structural Generalization in Text-to-SQL

no code implementations12 Jan 2023 Jieyu Li, Lu Chen, Ruisheng Cao, Su Zhu, Hongshen Xu, Zhi Chen, Hanchong Zhang, Kai Yu

Exploring the generalization of a text-to-SQL parser is essential for a system to automatically adapt the real-world databases.

Diversity Text-To-SQL

Spectral Efficiency Analysis of Uplink-Downlink Decoupled Access in C-V2X Networks

1 code implementation5 Dec 2022 Luofang Jiao, Kai Yu, Yunting Xu, Tianqi Zhang, Haibo Zhou, Xuemin, Shen

The uplink (UL)/downlink (DL) decoupled access has been emerging as a novel access architecture to improve the performance gains in cellular networks.

Spectral Efficiency Analysis of Uplink-Downlink Decoupled Access in C-V2X Networks

Reliable Joint Segmentation of Retinal Edema Lesions in OCT Images

no code implementations1 Dec 2022 Meng Wang, Kai Yu, Chun-Mei Feng, Ke Zou, Yanyu Xu, Qingquan Meng, Rick Siow Mong Goh, Yong liu, Huazhu Fu

Specifically, aiming at improving the model's ability to learn the complex pathological features of retinal edema lesions in OCT images, we develop a novel segmentation backbone that integrates a wavelet-enhanced feature extractor network and a multi-scale transformer module of our newly designed.

Segmentation

EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

no code implementations17 Nov 2022 Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and \textit{Neutral} is set to $\alpha$ and $1-\alpha$ respectively.

Denoising

BER: Balanced Error Rate For Speaker Diarization

2 code implementations8 Nov 2022 Tao Liu, Kai Yu

DER is the primary metric to evaluate diarization performance while facing a dilemma: the errors in short utterances or segments tend to be overwhelmed by longer ones.

speaker-diarization Speaker Diarization

D4: a Chinese Dialogue Dataset for Depression-Diagnosis-Oriented Chat

no code implementations24 May 2022 Binwei Yao, Chao Shi, Likai Zou, Lingfeng Dai, Mengyue Wu, Lu Chen, Zhen Wang, Kai Yu

In a depression-diagnosis-directed clinical session, doctors initiate a conversation with ample emotional support that guides the patients to expose their symptoms based on clinical diagnosis criteria.

Response Generation

META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI

no code implementations23 May 2022 Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, Kai Yu

However, this API-based architecture greatly limits the information-searching capability of intelligent assistants and may even lead to task failure if TOD-specific APIs are not available or the task is too complicated to be executed by the provided APIs.

Scheduling

Climate and Weather: Inspecting Depression Detection via Emotion Recognition

no code implementations29 Apr 2022 Wen Wu, Mengyue Wu, Kai Yu

Automatic depression detection has attracted increasing amount of attention but remains a challenging task.

Depression Detection Emotion Recognition

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

no code implementations2 Apr 2022 Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu

The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature.

Speech Synthesis Text-To-Speech Synthesis

Audio-text Retrieval in Context

no code implementations25 Mar 2022 Siyu Lou, Xuenan Xu, Mengyue Wu, Kai Yu

Using pre-trained audio features and a descriptor-based aggregation method, we build our contextual audio-text retrieval system.

AudioCaps Text Retrieval

Unsupervised word-level prosody tagging for controllable speech synthesis

no code implementations15 Feb 2022 Yiwei Guo, Chenpeng Du, Kai Yu

Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference.

Speech Synthesis

Few-Shot NLU with Vector Projection Distance and Abstract Triangular CRF

no code implementations9 Dec 2021 Su Zhu, Lu Chen, Ruisheng Cao, Zhi Chen, Qingliang Miao, Kai Yu

In this paper, we propose to improve prototypical networks with vector projection distance and abstract triangular Conditional Random Field (CRF) for the few-shot NLU.

intent-classification Intent Classification +5

Exploring Separable Attention for Multi-Contrast MR Image Super-Resolution

1 code implementation3 Sep 2021 Chun-Mei Feng, Yunlu Yan, Kai Yu, Yong Xu, Ling Shao, Huazhu Fu

Our SANet could explore the areas of high-intensity and low-intensity regions in the "forward" and "reverse" directions with the help of the auxiliary contrast, while learning clearer anatomical structure and edge information for the SR of a target-contrast MR image.

Image Super-Resolution

THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING

1 code implementation DCASE Challenge 2021 Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu

This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6.

Ranked #4 on Audio captioning on Clotho (using extra training data)

Audio captioning Audio Tagging +3

Quantum Dimensionality Reduction by Linear Discriminant Analysis

no code implementations4 Mar 2021 Kai Yu, Gong-De Guo, Song Lin

In this paper, we present a quantum algorithm and a quantum circuit to efficiently perform linear discriminant analysis (LDA) for dimensionality reduction.

Dimensionality Reduction Quantum Physics

LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching

1 code implementation25 Feb 2021 Boer Lyu, Lu Chen, Su Zhu, Kai Yu

Additionally, we adopt the word lattice graph as input to maintain multi-granularity information.

Text Matching

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

2 code implementations1 Feb 2021 Chenpeng Du, Kai Yu

Generating natural speech with diverse and smooth prosody pattern is a challenging task.

Speech Synthesis Text-To-Speech Synthesis Sound

Towards duration robust weakly supervised sound event detection

1 code implementation19 Jan 2021 Heinrich Dinkel, Mengyue Wu, Kai Yu

Our model outperforms other approaches on the DCASE2018 and URBAN-SED datasets without requiring prior duration knowledge.

Data Augmentation Sound Event Detection Sound Audio and Speech Processing

A 3D Non-stationary MmWave Channel Model for Vacuum Tube Ultra-High-Speed Train Channels

no code implementations17 Jan 2021 YingJie Xu, Kai Yu, Li Li, Xianfu Lei, Li Hao, Cheng-Xiang Wang

As a potential development direction of future transportation, the vacuum tube ultra-high-speed train (UHST) wireless communication systems have newly different channel characteristics from existing high-speed train (HST) scenarios.

A relic sketch extraction framework based on detail-aware hierarchical deep network

no code implementations17 Jan 2021 Jinye Peng, Jiaxin Wang, Jun Wang, Erlei Zhang, Qunxi Zhang, Yongqin Zhang, Xianlin Peng, Kai Yu

For the fine extraction stage, we design a new multiscale U-Net (MSU-Net) to effectively remove disease noise and refine the sketch.

Decoder Edge Detection +1

An Investigation on Different Underlying Quantization Schemes for Pre-trained Language Models

no code implementations14 Oct 2020 Zihan Zhao, Yuncong Liu, Lu Chen, Qi Liu, Rao Ma, Kai Yu

Recently, pre-trained language models like BERT have shown promising performance on multiple natural language processing tasks.

Clustering Quantization

Dual Learning for Dialogue State Tracking

no code implementations22 Sep 2020 Zhi Chen, Lu Chen, Yanbin Zhao, Su Zhu, Kai Yu

In task-oriented multi-turn dialogue systems, dialogue state refers to a compact representation of the user goal in the context of dialogue history.

Dialogue State Tracking Sentence

Structured Hierarchical Dialogue Policy with Graph Neural Networks

no code implementations22 Sep 2020 Zhi Chen, Xiaoyuan Liu, Lu Chen, Kai Yu

A novel ComNet is proposed to model the structure of a hierarchical agent.

Deep Reinforcement Learning for On-line Dialogue State Tracking

no code implementations22 Sep 2020 Zhi Chen, Lu Chen, Xiang Zhou, Kai Yu

To the best of our knowledge, this is the first effort to optimize the DST module within DRL framework for on-line task-oriented spoken dialogue systems.

Dialogue Management Dialogue State Tracking +4

Distributed Structured Actor-Critic Reinforcement Learning for Universal Dialogue Management

no code implementations22 Sep 2020 Zhi Chen, Lu Chen, Xiaoyuan Liu, Kai Yu

The task-oriented spoken dialogue system (SDS) aims to assist a human user in accomplishing a specific task (e. g., hotel booking).

Decision Making Dialogue Management +3

CREDIT: Coarse-to-Fine Sequence Generation for Dialogue State Tracking

no code implementations22 Sep 2020 Zhi Chen, Lu Chen, Zihan Xu, Yanbin Zhao, Su Zhu, Kai Yu

In dialogue systems, a dialogue state tracker aims to accurately find a compact representation of the current dialogue status, based on the entire dialogue history.

Dialogue State Tracking

Vector Projection Network for Few-shot Slot Tagging in Natural Language Understanding

1 code implementation21 Sep 2020 Su Zhu, Ruisheng Cao, Lu Chen, Kai Yu

Few-shot slot tagging becomes appealing for rapid domain transfer and adaptation, motivated by the tremendous development of conversational dialogue systems.

Few-Shot Learning Natural Language Understanding +2

Future Vector Enhanced LSTM Language Model for LVCSR

no code implementations31 Jul 2020 Qi Liu, Yanmin Qian, Kai Yu

For the speech recognition rescoring, although the proposed LSTM LM obtains very slight gains, the new model seems obtain the great complementary with the conventional LSTM LM.

Language Modelling speech-recognition +1

An Investigation on Deep Learning with Beta Stabilizer

no code implementations31 Jul 2020 Qi Liu, Tian Tan, Kai Yu

It is concluded that beta stabilizer parameters can reduce the sensitivity of learning rate with almost the same performance on DNN with relu activation function and LSTM.

Handwriting Recognition speech-recognition +1

Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding

1 code implementation24 May 2020 Chen Liu, Su Zhu, Zijian Zhao, Ruisheng Cao, Lu Chen, Kai Yu

In this paper, a novel BERT based SLU model (WCN-BERT SLU) is proposed to encode WCNs and the dialogue context jointly.

Spoken Language Understanding

Semi-Supervised Text Simplification with Back-Translation and Asymmetric Denoising Autoencoders

no code implementations30 Apr 2020 Yanbin Zhao, Lu Chen, Zhi Chen, Kai Yu

When modeling simple and complex sentences with autoencoders, we introduce different types of noise into the training process.

Denoising Language Modelling +4

Dual Learning for Semi-Supervised Natural Language Understanding

2 code implementations26 Apr 2020 Su Zhu, Ruisheng Cao, Kai Yu

The framework is composed of dual pseudo-labeling and dual learning method, which enables an NLU model to make full use of data (labeled and unlabeled) through a closed-loop of the primal and dual tasks.

Natural Language Understanding Sentence

Voice activity detection in the wild via weakly supervised sound event detection

1 code implementation27 Mar 2020 Heinrich Dinkel, Yefei Chen, Mengyue Wu, Kai Yu

We proposed two GPVAD models, one full (GPV-F), trained on 527 Audioset sound events, and one binary (GPV-B), only distinguishing speech and noise.

Sound Audio and Speech Processing

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition

no code implementations18 Jun 2019 Xu Xiang, Shuai Wang, Houjun Huang, Yanmin Qian, Kai Yu

The proposed approach can achieve the state-of-the-art performance, with 25% ~ 30% equal error rate (EER) reduction on both tasks when compared to strong baselines using cross entropy loss with softmax, obtaining 2. 238% EER on VoxCeleb1 test set and 2. 761% EER on SITW core-core test set, respectively.

Speaker Recognition

Audio Caption in a Car Setting with a Sentence-Level Loss

1 code implementation31 May 2019 Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning.

Audio captioning Decoder +6

AgentGraph: Towards Universal Dialogue Management with Structured Deep Reinforcement Learning

no code implementations27 May 2019 Lu Chen, Zhi Chen, Bowen Tan, Sishan Long, Milica Gasic, Kai Yu

Experiments show that AgentGraph models significantly outperform traditional reinforcement learning approaches on most of the 18 tasks of the PyDial benchmark.

Dialogue Management Management +4

A Hierarchical Decoding Model For Spoken Language Understanding From Unaligned Data

1 code implementation9 Apr 2019 Zijian Zhao, Su Zhu, Kai Yu

In the paper, we focus on spoken language understanding from unaligned data whose annotation is a set of act-slot-value triples.

Spoken Language Understanding

Duration robust sound event detection

1 code implementation8 Apr 2019 Heinrich Dinkel, Kai Yu

Task 4 of the Dcase2018 challenge demonstrated that substantially more research is needed for a real-world application of sound event detection.

Sound Audio and Speech Processing

Text-based depression detection on sparse data

1 code implementation8 Apr 2019 Heinrich Dinkel, Mengyue Wu, Kai Yu

Previous text-based depression detection is commonly based on large user-generated data.

Depression Detection Sentence +1

Audio Caption: Listen and Tell

1 code implementation25 Feb 2019 Mengyue Wu, Heinrich Dinkel, Kai Yu

A baseline encoder-decoder model is provided for both English and Mandarin.

Decoder General Classification

End-to-End Monaural Multi-speaker ASR System without Pretraining

no code implementations5 Nov 2018 Xuankai Chang, Yanmin Qian, Kai Yu, Shinji Watanabe

The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting

no code implementations2 Aug 2018 Zhehuai Chen, Yanmin Qian, Kai Yu

The few studies on sequence discriminative training for KWS are limited for fixed vocabulary or LVCSR based methods and have not been compared to the state-of-the-art deep learning based KWS approaches.

Keyword Spotting speech-recognition +1

Structured Dialogue Policy with Graph Neural Networks

no code implementations COLING 2018 Lu Chen, Bowen Tan, Sishan Long, Kai Yu

The proposed structured deep reinforcement learning is based on graph neural networks (GNN), which consists of some sub-networks, each one for a node on a directed graph.

Automatic Speech Recognition (ASR) Decision Making +5

Binarized LSTM Language Model

no code implementations NAACL 2018 Xuan Liu, Di Cao, Kai Yu

Although excellent performance is obtained for large vocabulary tasks, tremendous memory consumption prohibits the use of LSTM LM in low-resource devices.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

On Modular Training of Neural Acoustics-to-Word Model for LVCSR

no code implementations3 Mar 2018 Zhehuai Chen, Qi Liu, Hao Li, Kai Yu

Finally, modules are integrated into an acousticsto-word model (A2W) and jointly optimized using acoustic data to retain the advantage of sequence modeling.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Affordable On-line Dialogue Policy Learning

no code implementations EMNLP 2017 Cheng Chang, Runzhe Yang, Lu Chen, Xiang Zhou, Kai Yu

The key to building an evolvable dialogue system in real-world scenarios is to ensure an affordable on-line dialogue policy learning, which requires the on-line learning process to be safe, efficient and economical.

Dialogue Management

Concept Transfer Learning for Adaptive Language Understanding

no code implementations WS 2018 Su Zhu, Kai Yu

Concept definition is important in language understanding (LU) adaptation since literal definition difference can easily lead to data sparsity even if different data sets are actually semantically correlated.

Domain Adaptation Transfer Learning

On-line Dialogue Policy Learning with Companion Teaching

no code implementations EACL 2017 Lu Chen, Runzhe Yang, Cheng Chang, Zihao Ye, Xiang Zhou, Kai Yu

On-line dialogue policy learning is the key for building evolvable conversational agent in real world scenarios.

Dialogue Management

A Large-scale Distributed Video Parsing and Evaluation Platform

no code implementations29 Nov 2016 Kai Yu, Yang Zhou, Da Li, Zhang Zhang, Kaiqi Huang

Visual surveillance systems have become one of the largest data sources of Big Visual Data in real world.

Weakly-supervised Learning of Mid-level Features for Pedestrian Attribute Recognition and Localization

no code implementations17 Nov 2016 Kai Yu, Biao Leng, Zhang Zhang, Dangwei Li, Kaiqi Huang

Based on GoogLeNet, firstly, a set of mid-level attribute features are discovered by novelly designed detection layers, where a max-pooling based weakly-supervised object detection technique is used to train these layers with only image-level labels without the need of bounding box annotations of pedestrian attributes.

Attribute Clustering +5

Encoder-decoder with Focus-mechanism for Sequence Labelling Based Spoken Language Understanding

no code implementations6 Aug 2016 Su Zhu, Kai Yu

This paper investigates the framework of encoder-decoder with attention for sequence labelling based spoken language understanding.

Decoder speech-recognition +2

Text Flow: A Unified Text Detection System in Natural Scene Images

no code implementations ICCV 2015 Shangxuan Tian, Yifeng Pan, Chang Huang, Shijian Lu, Kai Yu, Chew Lim Tan

With character candidates detected by cascade boosting, the min-cost flow network model integrates the last three sequential steps into a single process which solves the error accumulation problem at both character level and text line level effectively.

Scene Text Detection Text Detection +1

On Training Bi-directional Neural Network Language Model with Noise Contrastive Estimation

1 code implementation19 Feb 2016 Tianxing He, Yu Zhang, Jasha Droppo, Kai Yu

We propose to train bi-directional neural network language model(NNLM) with noise contrastive estimation(NCE).

Language Modelling

Recurrent Polynomial Network for Dialogue State Tracking

no code implementations14 Jul 2015 Kai Sun, Qizhe Xie, Kai Yu

Dialogue state tracking (DST) is a process to estimate the distribution of the dialogue states as a dialogue progresses.

dialog state tracking Dialogue State Tracking

Deep Multiple Instance Learning for Image Classification and Auto-Annotation

no code implementations CVPR 2015 Jiajun Wu, Yinan Yu, Chang Huang, Kai Yu

The recent development in learning deep representations has demonstrated its wide applications in traditional vision tasks like classification and detection.

Classification General Classification +3

Large Scale Strongly Supervised Ensemble Metric Learning, with Applications to Face Verification and Retrieval

1 code implementation25 Dec 2012 Chang Huang, Shenghuo Zhu, Kai Yu

Learning Mahanalobis distance metrics in a high- dimensional feature space is very difficult especially when structural sparsity and low rank are enforced to improve com- putational efficiency in testing phase.

Face Verification Metric Learning +1

Deep Coding Network

no code implementations NeurIPS 2010 Yuanqing Lin, Tong Zhang, Shenghuo Zhu, Kai Yu

This paper proposes a principled extension of the traditional single-layer flat sparse coding scheme, where a two-layer coding scheme is derived based on theoretical analysis of nonlinear functional approximation that extends recent results for local coordinate coding.

Nonlinear Learning using Local Coordinate Coding

no code implementations NeurIPS 2009 Kai Yu, Tong Zhang, Yihong Gong

This paper introduces a new method for semi-supervised learning on high dimensional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning.

Stochastic Relational Models for Large-scale Dyadic Data using MCMC

no code implementations NeurIPS 2008 Shenghuo Zhu, Kai Yu, Yihong Gong

Stochastic relational models provide a rich family of choices for learning and predicting dyadic data between two sets of entities.

Bayesian Inference Collaborative Filtering

Deep Learning with Kernel Regularization for Visual Recognition

no code implementations NeurIPS 2008 Kai Yu, Wei Xu, Yihong Gong

In this paper we focus on training deep neural networks for visual recognition tasks.

Gaussian Process Models for Link Analysis and Transfer Learning

no code implementations NeurIPS 2007 Kai Yu, Wei Chu

In this paper we develop a Gaussian process (GP) framework to model a collection of reciprocal random variables defined on the \emph{edges} of a network.

Link Prediction Transfer Learning

Predictive Matrix-Variate t Models

no code implementations NeurIPS 2007 Shenghuo Zhu, Kai Yu, Yihong Gong

It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements.

Missing Elements Model Selection

Cannot find the paper you are looking for? You can Submit a new open access paper.