Search Results for author: Kai Shen

Found 17 papers, 8 papers with code

MoonCast: High-Quality Zero-Shot Podcast Generation

1 code implementation18 Mar 2025 Zeqian Ju, Dongchao Yang, Jianwei Yu, Kai Shen, Yichong Leng, Zhengtao Wang, Xu Tan, Xinyu Zhou, Tao Qin, Xiangyang Li

Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers.

Speech Synthesis Text to Speech +1

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

no code implementations6 Mar 2025 Aoxiong Yin, Kai Shen, Yichong Leng, Xu Tan, Xinyu Zhou, Juncheng Li, Siliang Tang

Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models.

Semantic Compression Video Generation

BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving

no code implementations5 Feb 2025 Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, Kai Shen

In this paper, we investigate whether BFS can achieve competitive performance in large-scale theorem proving tasks.

Automated Theorem Proving

Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference

1 code implementation6 Jul 2024 Kai Shen, Lingfei Wu, Siliang Tang, Fangli Xu, Bo Long, Yueting Zhuang, Jian Pei

The visual question generation (VQG) task aims to generate human-like questions from an image and potentially other side information (e. g. answer type).

Graph-to-Sequence Implicit Relations +2

T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text

no code implementations11 Jun 2024 Aoxiong Yin, Haoyuan Li, Kai Shen, Siliang Tang, Yueting Zhuang

In this work, we propose a two-stage sign language production (SLP) paradigm that first encodes sign language sequences into discrete codes and then autoregressively generates sign language from text based on the learned codebook.

Quantization Sign Language Production

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

no code implementations5 Mar 2024 Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt.

Quantization Speech Synthesis +1

PromptTTS 2: Describing and Generating Voices with Text Prompt

no code implementations5 Sep 2023 Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian

TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech.

Language Modelling Large Language Model +1

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

2 code implementations18 Apr 2023 Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, Jiang Bian

To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor.

In-Context Learning Speech Synthesis +1

A Study on ReLU and Softmax in Transformer

no code implementations13 Feb 2023 Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, Jiang Bian

This paper sheds light on the following points: 1) Softmax and ReLU use different normalization methods over elements which lead to different variances of results, and ReLU is good at dealing with a large number of key-value slots; 2) FFN and key-value memory are equivalent, and thus the Transformer can be viewed as a memory network where FFNs and self-attention networks are both key-value memories.

Document Translation

Mask the Correct Tokens: An Embarrassingly Simple Approach for Error Correction

1 code implementation23 Nov 2022 Kai Shen, Yichong Leng, Xu Tan, Siliang Tang, Yuan Zhang, Wenjie Liu, Edward Lin

Since the error rate of the incorrect sentence is usually low (e. g., 10\%), the correction model can only learn to correct on limited error tokens but trivially copy on most tokens (correct tokens), which harms the effective training of error correction.

Decoder Sentence +2

Scenario-based Multi-product Advertising Copywriting Generation for E-Commerce

no code implementations21 May 2022 Xueying Zhang, Kai Shen, Chi Zhang, Xiaochuan Fan, Yun Xiao, Zhen He, Bo Long, Lingfei Wu

In this paper, we proposed an automatic Scenario-based Multi-product Advertising Copywriting Generation system (SMPACG) for E-Commerce, which has been deployed on a leading Chinese e-commerce platform.

Attribute Language Modeling +1

Graph Neural Networks for Natural Language Processing: A Survey

1 code implementation10 Jun 2021 Lingfei Wu, Yu Chen, Kai Shen, Xiaojie Guo, Hanning Gao, Shucheng Li, Jian Pei, Bo Long

Deep learning has become the dominant approach in coping with various tasks in Natural LanguageProcessing (NLP).

Decoder graph construction +2

Cannot find the paper you are looking for? You can Submit a new open access paper.