Search Results for author: Yuxuan Wang

Found 132 papers, 52 papers with code

Simple and Effective Graph-to-Graph Annotation Conversion

1 code implementation COLING 2022 Yuxuan Wang, Zhilin Lei, Yuqiu Ji, Wanxiang Che

Annotation conversion is an effective way to construct datasets under new annotation guidelines based on existing datasets with little human labour.

Sounding that Object: Interactive Object-Aware Image to Audio Generation

no code implementations4 Jun 2025 Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources.

Audio Generation Image Segmentation +2

MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation

1 code implementation31 May 2025 Yakun Song, Jiawei Chen, Xiaobin Zhuang, Chenpeng Du, Ziyang Ma, Jian Wu, Jian Cong, Dongya Jia, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens.

Language Modeling Language Modelling

Discrete Markov Bridge

1 code implementation26 May 2025 Hengli Li, Yuxuan Wang, Song-Chun Zhu, Ying Nian Wu, Zilong Zheng

To address these limitations, we propose Discrete Markov Bridge, a novel framework specifically designed for discrete representation learning.

Representation Learning

Towards Reliable Large Audio Language Model

no code implementations25 May 2025 Ziyang Ma, Xiquan Li, Yakun Song, Wenxi Chen, Chenpeng Du, Jian Wu, Yuanzhe Chen, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound.

Language Modeling Language Modelling +1

Leveraging Large Language Models for Command Injection Vulnerability Analysis in Python: An Empirical Study on Popular Open-Source Projects

no code implementations21 May 2025 Yuxuan Wang, Jingshu Chen, Qingyang Wang

This study evaluates the potential of large language models (LLMs), such as GPT-4, as an alternative approach for automated testing for vulnerability detection.

Vulnerability Detection

Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image

no code implementations20 May 2025 Yuxuan Wang, Xuanyu Yi, Qingshan Xu, Yuan Zhou, Long Chen, Hanwang Zhang

Personalizing 3D scenes from a single reference image enables intuitive user-guided editing, which requires achieving both multi-view consistency across perspectives and referential consistency with the input image.

3D Generation 3DGS +1

Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

1 code implementation19 May 2025 Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, Zilong Zheng

We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space.

GSM8K Math

JAEGER: Dual-Level Humanoid Whole-Body Controller

no code implementations10 May 2025 Ziluo Ding, Haobin Jiang, Yuxuan Wang, Zhenguo Sun, Yu Zhang, Xiaojie Niu, Ming Yang, Weishuai Zeng, Xinrun Xu, Zongqing Lu

This paper presents JAEGER, a dual-level whole-body controller for humanoid robots that addresses the challenges of training a more robust and versatile policy.

Probing and Inducing Combinational Creativity in Vision-Language Models

no code implementations17 Apr 2025 Yongqian Peng, Yuxi Ma, Mengmeng Wang, Yuxuan Wang, Yizhou Wang, Chi Zhang, Yixin Zhu, Zilong Zheng

The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence.

Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

no code implementations10 Apr 2025 ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen, Riwei Chen, Liangqiang Chen, Zixin Chen, Jinsong Chen, Siyan Chen, Kaiyuan Chen, Zhi Chen, Jin Chen, Jiecao Chen, Jinxin Chi, Weinan Dai, Ning Dai, Jiahui Dai, Shihan Dou, Yantao Du, Zhengyin Du, Jianhui Duan, Chen Dun, Ting-Han Fan, Jiazhan Feng, Junda Feng, Ziyuan Feng, Yuwei Fu, Wenqi Fu, Hanjie Fu, Hao Ge, Hongyi Guo, Mingji Han, Li Han, Wenhao Hao, Xintong Hao, Qianyu He, Jerry He, Feng He, Wen Heng, Zehua Hong, Qi Hou, Liang Hu, Shengding Hu, Nan Hu, Kai Hua, Qi Huang, Ziyue Huang, Hongzhi Huang, Zihao Huang, Ting Huang, Wenhao Huang, Wei Jia, Bin Jia, Xiaoying Jia, Yuhua Jiang, Haobin Jiang, Ziheng Jiang, Kaihua Jiang, Chengquan Jiang, Jianpeng Jiao, Xiaoran Jin, Xing Jin, Xunhao Lai, Xiang Li, Liyi Li, Hongkai Li, Zheng Li, Shengxian Wan, Ya Wang, Yunshui Li, Chenggang Li, Niuniu Li, Siyu Li, Xi Li, Xiao Li, Aoyan Li, Yuntao Li, Nianning Liang, Xinnian Liang, Haibin Lin, Weijian Lin, Ye Lin, Zhicheng Liu, Guanlin Liu, Chenxiao Liu, Yan Liu, Gaohong Liu, Juncai Liu, Chundian Liu, Deyi Liu, Kaibo Liu, Siyao Liu, Qi Liu, Yongfei Liu, Kang Liu, Gan Liu, Boyi Liu, Rui Long, Weiqiang Lou, Chenwei Lou, Xiang Luo, Yao Luo, Caiping Lv, Heyang Lv, Bole Ma, Qianli Ma, Hongzhi Ma, Yiyuan Ma, Jin Ma, Wenchang Ma, Tingting Ma, Chen Mao, Qiyang Min, Zhe Nan, Guanghan Ning, Jinxiang Ou, Haojie Pan, Renming Pang, Yanghua Peng, Tao Peng, Lihua Qian, Mu Qiao, Meng Qu, Cheng Ren, Hongbin Ren, Yong Shan, Wei Shen, Ke Shen, Kai Shen, Guangming Sheng, Jinlong Shi, Wenlei Shi, Guang Shi, Shuai Shuai Cao, Yuxin Song, Zuquan Song, Jing Su, Yifan Sun, Tao Sun, Zewei Sun, Borui Wan, Xiaohui Wang, Xi Wang, Shuguang Wang, Jun Wang, Qinlong Wang, Chenyuan Wang, Shuai Wang, Zihan Wang, Changbao Wang, Jiaqiang Wang, Shihang Wang, Xuwu Wang, Zaiyuan Wang, Yuxuan Wang, Wenqi Wang, Taiqing Wang, Chengzhi Wei, Houmin Wei, Ziyun Wei, Shufa Wei, Zheng Wu, Yonghui Wu, Yangjun Wu, Bohong Wu, Shuang Wu, Jingqiao Wu, Ning Wu, Shuangzhi Wu, Jianmin Wu, Chenguang Xi, Fan Xia, Yuqiao Xian, Liang Xiang, Boren Xiang, Bowen Xiao, Zhen Xiao, Xia Xiao, Yongsheng Xiao, Chao Xin, Shulin Xin, Yuwen Xiong, Jingjing Xu, Ziwen Xu, Chenyin Xu, Jiayi Xu, Yifan Xu, Wei Xu, Yufei Xu, Shikun Xu, Shipeng Yan, Shen Yan, Qingping Yang, Xi Yang, Tianhao Yang, Yuehang Yang, Yuan Yang, Ximing Yang, Zeyu Yang, Guang Yang, Yifan Yang, Xuesong Yao, Bairen Yi, Fan Yin, Jianian Yin, Ziqiang Ying, Xiangyu Yu, Hongli Yu, Song Yu, Menghan Yu, Huan Yu, Siyu Yuan, Jun Yuan, Yutao Zeng, Tianyang Zhan, Zheng Zhang, Yun Zhang, Mofan Zhang, Wang Zhang, Ru Zhang, Zhi Zhang, Tianqi Zhang, Xinyi Zhang, Zhexi Zhang, Sijun Zhang, Wenqiang Zhang, Xiangxiang Zhang, Yongtao Zhang, Yuyu Zhang, Ge Zhang, He Zhang, Yue Zhang, Renjie Zheng, Ningxin Zheng, Zhuolin Zheng, Yaowei Zheng, Chen Zheng, Xiaoyun Zhi, Wanjun Zhong, Cheng Zhong, Zheng Zhong, Baoquan Zhong, Xun Zhou, Na Zhou, Huan Zhou, Hang Zhu, Defa Zhu, Wenjia Zhu, Lei Zuo

We introduce Seed1. 5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks.

Mixture-of-Experts reinforcement-learning +1

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

no code implementations CVPR 2025 Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, Zilong Zheng

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data.

Video Understanding

QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions

no code implementations26 Mar 2025 Siyin Wang, Wenyi Yu, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Yu Tsao, Junichi Yamagishi, Yuxuan Wang, Chao Zhang

To bridge this gap, we introduce QualiSpeech, a comprehensive low-level speech quality assessment dataset encompassing 11 key aspects and detailed natural language comments that include reasoning and contextual insights.

Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context

1 code implementation19 Mar 2025 Junyi Ao, Dekun Chen, Xiaohai Tian, Wenjie Feng, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu

Large Language Models (LLMs) have recently shown remarkable ability to process not only text but also multimodal inputs such as speech and audio.

Audio captioning Audio Question Answering +2

A Parallel Hybrid Action Space Reinforcement Learning Model for Real-world Adaptive Traffic Signal Control

1 code implementation18 Mar 2025 Yuxuan Wang, Meng Long, Qiang Wu, Wei Liu, Jiatian Pi, Xinmin Yang

In this study, we introduce a parallel hybrid action space reinforcement learning model (PH-DDPG) that optimizes traffic signal phase and duration of traffic signals simultaneously, eliminating the need for sequential decision-making seen in traditional two-stage models.

Decision Making Sequential Decision Making +1

PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture

no code implementations14 Mar 2025 Xiaokang Wei, BoWen Zhang, Xianghui Yang, Yuxuan Wang, Chunchao Guo, Xi Zhao, Yan Luximon

In this work, we present PBR3DGen, a two-stage mesh generation method with high-quality PBR materials that integrates the novel multi-view PBR material estimation model and a 3D PBR mesh reconstruction model.

3D Generation

From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

1 code implementation26 Feb 2025 Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng

While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental.

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

no code implementations6 Feb 2025 Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, ChuMin Li, Zhen Wei, Yuping Wang, Yuxuan Wang

Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes.

Diversity Language Modeling +1

Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation

no code implementations24 Jan 2025 Yuxuan Wang, Xuanyu Yi, Haohan Weng, Qingshan Xu, Xiaokang Wei, Xianghui Yang, Chunchao Guo, Long Chen, Hanwang Zhang

To address these challenges, we propose Nautilus, a locality-aware autoencoder for artist-like mesh generation that leverages the local properties of manifold meshes to achieve structural fidelity and efficient representation.

LongViTU: Instruction Tuning for Long-Form Video Understanding

no code implementations9 Jan 2025 Rujie Wu, Xiaojian Ma, Hai Ci, Yue Fan, Yuxuan Wang, Haozhe Zhao, Qing Li, Yizhou Wang

Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4. 6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.)).

EgoSchema Form +2

Reasoning Mamba: Hypergraph-Guided Region Relation Calculating for Weakly Supervised Affordance Grounding

no code implementations CVPR 2025 Yuxuan Wang, Aming Wu, Muli Yang, Yukuan Min, Yihang Zhu, Cheng Deng

This paper pays attention to Weakly Supervised Affordance Grounding (WSAG) task that aims to train model to identify affordance regions using human-object interaction images and egocentric images without the need for costly pixel-level annotations.

Human-Object Interaction Detection Mamba +1

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

1 code implementation23 Dec 2024 Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Qun Liu, Dongyan Zhao

Based on this Friends-MMC dataset, we further study two fundamental MMC tasks: conversation speaker identification and conversation response prediction, both of which have the multi-party nature with the video or image as visual context.

Speaker Identification

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

1 code implementation13 Dec 2024 Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, Jingren Zhou

By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode.

In-Context Learning Quantization +1

Pushing Rendering Boundaries: Hard Gaussian Splatting

no code implementations6 Dec 2024 Qingshan Xu, Jiequan Cui, Xuanyu Yi, Yuxuan Wang, Yuan Zhou, Yew-Soon Ong, Hanwang Zhang

To address this problem, we propose Hard Gaussian Splatting, dubbed HGS, which considers multi-view significant positional gradients and rendering errors to grow hard Gaussians that fill the gaps of classical Gaussian Splatting on 3D scenes, thus achieving superior NVS results.

3DGS Novel View Synthesis

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

no code implementations27 Nov 2024 Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang

Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models.

Question Answering Speech Enhancement +4

IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities

no code implementations9 Oct 2024 Xin Zhang, Xiang Lyu, Zhihao Du, Qian Chen, Dong Zhang, Hangrui Hu, Chaohong Tan, Tianyu Zhao, Yuxuan Wang, Bin Zhang, Heng Lu, Yaqian Zhou, Xipeng Qiu

Current methods of building LLMs with voice interaction capabilities rely heavily on explicit text autoregressive generation before or during speech response generation to maintain content quality, which unfortunately brings computational overhead and increases latency in multi-turn interactions.

Response Generation

Metadata Matters for Time Series: Informative Forecasting with Transformers

no code implementations4 Oct 2024 Jiaxiang Dong, Haixu Wu, Yuxuan Wang, Li Zhang, Jianmin Wang, Mingsheng Long

Further, a Transformer encoder is employed to communicate series and metadata tokens, which can extend series representations by metadata information for more accurate forecasting.

Financial Analysis Time Series +1

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

1 code implementation25 Sep 2024 Siyin Wang, Wenyi Yu, Yudong Yang, Changli Tang, Yixuan Li, Jimin Zhuang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang

The results demonstrate that auditory LLMs achieve competitive performance compared to state-of-the-art task-specific small models in predicting MOS and SIM, while also delivering promising results in A/B testing and natural language descriptions.

text-to-speech Text to Speech

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

no code implementations2 Sep 2024 Yuxuan Wang, Cihang Xie, Yang Liu, Zilong Zheng

Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions.

MVBench Video Understanding

Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

1 code implementation2 Sep 2024 Yueqian Wang, Jianxin Liang, Yuxuan Wang, Huishuai Zhang, Dongyan Zhao

To analyze image representations while completely avoiding the influence of all other factors other than the image representation itself, we propose a parametric-free representation alignment metric (Pfram) that can measure the similarities between any two representation systems without requiring additional training parameters.

Hallucination Object +1

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

no code implementations5 Aug 2024 Yuxuan Wang, Alan Yuille, Zhuowan Li, Zilong Zheng

Experimental results on two representative VL programming methods showcase consistent improvements on five compositional reasoning tasks on standard benchmarks.

Visual Reasoning

Deep Time Series Models: A Comprehensive Survey and Benchmark

2 code implementations18 Jul 2024 Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong liu, Mingsheng Long, Jianmin Wang

Further, we develop and release Time Series Library (TSLib) as a fair benchmark of deep time series models for diverse analysis tasks, which implements 24 mainstream models, covers 30 datasets from different domains, and supports five prevalent analysis tasks.

Survey Time Series

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

1 code implementation22 Jun 2024 Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model.

Diversity Language Modeling +3

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

1 code implementation19 Jun 2024 Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu

We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation.

Dialogue Understanding

Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR

no code implementations12 Jun 2024 Yerbolat Khassanov, Zhipeng Chen, Tianfeng Chen, Tze Yuang Chong, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang

This paper addresses challenges in integrating new languages into a pre-trained multilingual automatic speech recognition (mASR) system, particularly in scenarios where training data for existing languages is limited or unavailable.

Automatic Speech Recognition Decoder +2

Progressive Confident Masking Attention Network for Audio-Visual Segmentation

1 code implementation4 Jun 2024 Yuxuan Wang, Jinchao Zhu, Feng Dong, Shuyue Zhu

Audio and visual signals typically occur simultaneously, and humans possess an innate ability to correlate and synchronize information from these two modalities.

A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies

no code implementations31 May 2024 Jinchao Zhu, Yuxuan Wang, Siyuan Pan, Pengfei Wan, Di Zhang, Gao Huang

1) For the tuning method, we design a model assembly strategy to reconstruct a lightweight model while preserving performance through distillation.

Human-Centered LLM-Agent User Interface: A Position Paper

1 code implementation19 May 2024 Daniel Chin, Yuxuan Wang, Gus Xia

Large Language Model (LLM) -in-the-loop applications have been shown to effectively interpret the human user's commands, make plans, and operate external tools/systems accordingly.

Language Modeling Language Modelling +2

Medical Dialogue: A Survey of Categories, Methods, Evaluation and Challenges

no code implementations17 May 2024 Xiaoming Shi, Zeming Liu, Li Du, Yuxuan Wang, Hongru Wang, Yuhang Guo, Tong Ruan, Jie Xu, Shaoting Zhang

As a result, an overview of the categories, methods, and evaluation of medical dialogue systems remain limited and underspecified, hindering the further improvement of this area.

Survey

FIMP-HGA: A Novel Approach to Addressing the Partitioning Min-Max Weighted Matching Problem

no code implementations6 May 2024 Yuxuan Wang, Jiongzhi Zheng, Jinyao Xie, Kun He

Similar to MP$_{\text{LS}}$, FIMP-HGA divides the solving into match and partition stages, iteratively refining the solution.

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

no code implementations10 Apr 2024 Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yuping Wang, Yuxuan Wang, Mingbo Ma

We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre.

Attribute

Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement

no code implementations24 Mar 2024 Yuxuan Wang, Xiaoyuan Liu

Scene Graph Generation (SGG) provides basic language representation of visual scenes, requiring models to grasp complex and diverse semantics between objects.

Diversity Graph Generation +3

View-Consistent 3D Editing with Gaussian Splatting

no code implementations18 Mar 2024 Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, Hanwang Zhang

However, this approach faces a critical issue of multi-view inconsistency, where the guidance images exhibit significant discrepancies across views, leading to mode collapse and visual artifacts of 3DGS.

3DGS

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

1 code implementation15 Mar 2024 Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao

Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos.

Video Grounding Video Question Answering

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

2 code implementations25 Feb 2024 Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Yang Liu, Zilong Zheng

Our framework significantly enhances the temporal capabilities of current MLLMs through three key innovations: an efficient multi-span temporal grounding algorithm applied to low-dimension temporal features projected from flow; a multimodal length extrapolation training paradigm that utilizes low-dimension temporal features to extend the training context window size; and a bootstrapping framework that bridges our model with pluggable MLLMs without requiring annotation.

Computational Efficiency Language Modelling +3

TimeSiam: A Pre-Training Framework for Siamese Time-Series Modeling

1 code implementation4 Feb 2024 Jiaxiang Dong, Haixu Wu, Yuxuan Wang, Yunzhong Qiu, Li Zhang, Jianmin Wang, Mingsheng Long

To emphasize temporal correlation modeling, this paper proposes TimeSiam as a simple but effective self-supervised pre-training framework for Time series based on Siamese networks.

Contrastive Learning Data Augmentation +1

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

1 code implementation8 Jan 2024 Yueqian Wang, Yuxuan Wang, Kai Chen, Dongyan Zhao

However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos.

Question Answering Video Question Answering

A-SDM: Accelerating Stable Diffusion through Redundancy Removal and Performance Optimization

no code implementations24 Dec 2023 Jinchao Zhu, Yuxuan Wang, Xiaobing Tu, Siyuan Pan, Pengfei Wan, Gao Huang

The Stable Diffusion Model (SDM) is a popular and efficient text-to-image (t2i) generation and image-to-image (i2i) generation model.

Quantization

Audio Prompt Tuning for Universal Sound Separation

1 code implementation30 Nov 2023 Yuzhuo Liu, Xubo Liu, Yan Zhao, Yuanyuan Wang, Rui Xia, Pingchuan Tain, Yuxuan Wang

Specifically, APT improves the separation performance of specific sources through training a small number of prompt parameters with limited audio samples, while maintaining the generalization of the USS model by keeping its parameters frozen.

LLaMA Rider: Spurring Large Language Models to Explore the Open World

no code implementations13 Oct 2023 Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu

Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions.

Decision Making Minecraft +1

Teaching Text-to-Image Models to Communicate in Dialog

no code implementations27 Sep 2023 Xiaowen Sun, Jiazhan Feng, Yuxuan Wang, Yuxuan Lai, Xingyu Shen, Dongyan Zhao

In this paper, we focus on the innovative dialog-to-image generation task, where the model synthesizes a high-resolution image aligned with the given dialog context as a response.

Sentence Text to Image Generation +1

Separate Anything You Describe

1 code implementation9 Aug 2023 Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang

In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries.

Audio Source Separation Natural Language Queries +2

Query Encoder Distillation via Embedding Alignment is a Strong Baseline Method to Boost Dense Retriever Online Efficiency

1 code implementation5 Jun 2023 Yuxuan Wang, Hong Lyu

The information retrieval community has made significant progress in improving the efficiency of Dual Encoder (DE) dense passage retrieval systems, making them suitable for latency-sensitive settings.

Passage Retrieval Retrieval

MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning

no code implementations4 Jun 2023 Jianghui Wang, Yuxuan Wang, Dongyan Zhao, Zilong Zheng

We introduce MoviePuzzle, a novel challenge that targets visual narrative reasoning and holistic movie understanding.

Benchmarking Contrastive Learning +1

Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training

1 code implementation30 May 2023 Yuxuan Wang, Jianghui Wang, Dongyan Zhao, Zilong Zheng

We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.

Contrastive Learning

VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions

1 code implementation30 May 2023 Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, Dongyan Zhao

Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues.

Dialogue Generation Dialogue Understanding +2

Language-universal phonetic encoder for low-resource speech recognition

no code implementations19 May 2023 Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

Our main approach and adaptation are effective on extremely low-resource languages, even within domain- and language-mismatched scenarios.

Decoder speech-recognition +1

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

no code implementations19 May 2023 Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

Moreover, on 3 of the 4 languages, comparing to the standard HuBERT, the approach performs better, meanwhile is able to save supervised training data by 1. 5k hours (75%) at most.

Diversity Self-Supervised Learning +2

A unified front-end framework for English text-to-speech synthesis

no code implementations18 May 2023 Zelin Ying, Chen Li, Yu Dong, Qiuqiang Kong, Qiao Tian, YuanYuan Huo, Yuxuan Wang

The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes.

Speech Synthesis Text Normalization +3

Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network

no code implementations12 Dec 2022 Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuanzhe Chen, Mingbo Ma, Yuping Wang, Yuxuan Wang

The goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and speaker identity.

Data Augmentation Disentanglement

How Many Grid-Forming Converters Do We Need? A Perspective From Small Signal Stability and Power Grid Strength

no code implementations21 Sep 2022 Huanhai Xin, Chenxi Liu, Xia Chen, Yuxuan Wang, Eduardo Prieto-Araujo, Linbin Huang

Based on our analysis, we further study the problem of how to configure GFM converters in the grid and how many GFM converters we will need.

Network-Level Adversaries in Federated Learning

1 code implementation27 Aug 2022 Giorgio Severi, Matthew Jagielski, Gökberk Yar, Yuxuan Wang, Alina Oprea, Cristina Nita-Rotaru

Federated learning is a popular strategy for training models on distributed, sensitive data, while preserving data privacy.

Federated Learning

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

no code implementations10 Feb 2022 Maokui He, Xiang Lv, Weilin Zhou, JingJing Yin, Xiaoqi Zhang, Yuxuan Wang, Shutong Niu, Yuhang Cao, Heng Lu, Jun Du, Chin-Hui Lee

We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge.

Action Detection Activity Detection +2

Neural Dubber: Dubbing for Videos According to Scripts

no code implementations NeurIPS 2021 Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao

Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech.

text-to-speech Text to Speech

Deep Superpixel-based Network for Blind Image Quality Assessment

1 code implementation13 Oct 2021 Guangyi Yang, Yang Zhan., Yuxuan Wang

In order to fill this gap, we propose a deep adaptive superpixel-based network, namely DSN-IQA, to assess the quality of image based on multi-scale and superpixel segmentation.

Audiovisual Singing Voice Separation

no code implementations1 Jul 2021 Bochen Li, Yuxuan Wang, Zhiyao Duan

Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques.

VeniBot: Towards Autonomous Venipuncture with Automatic Puncture Area and Angle Regression from NIR Images

no code implementations27 May 2021 Xu Cao, Zijie Chen, Bolin Lai, Yuxuan Wang, Yu Chen, Zhengqing Cao, Zhilin Yang, Nanyang Ye, Junbo Zhao, Xiao-Yun Zhou, Peng Qi

For the automation, we focus on the positioning part and propose a Dual-In-Dual-Out network based on two-step learning and two-task learning, which can achieve fully automatic regression of the suitable puncture area and angle from near-infrared(NIR) images.

Navigate regression

Modeling the Compatibility of Stem Tracks to Generate Music Mashups

no code implementations26 Mar 2021 Jiawen Huang, Ju-Chiang Wang, Jordan B. L. Smith, Xuchen Song, Yuxuan Wang

A music mashup combines audio elements from two or more songs to create a new work.

Listen, Read, and Identify: Multimodal Singing Language Identification of Music

no code implementations2 Mar 2021 Keunwoo Choi, Yuxuan Wang

Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality.

Language Identification

Large-Scale MIDI-based Composer Classification

no code implementations28 Oct 2020 Qiuqiang Kong, Keunwoo Choi, Yuxuan Wang

Music classification is a task to classify a music piece into labels such as genres or composers.

Classification General Classification +1

GiantMIDI-Piano: A large-scale MIDI dataset for classical piano music

3 code implementations11 Oct 2020 Qiuqiang Kong, Bochen Li, Jitong Chen, Yuxuan Wang

In this article, we create a GiantMIDI-Piano (GP) dataset containing 38, 700, 838 transcribed notes and 10, 855 unique solo piano works composed by 2, 786 composers.

Information Retrieval Music Information Retrieval +1

High-resolution Piano Transcription with Pedals by Regressing Onset and Offset Times

3 code implementations5 Oct 2020 Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang

In addition, previous AMT systems are sensitive to the misaligned onset and offset labels of audio recordings.

Music Transcription Sound Audio and Speech Processing

Xiaomingbot: A Multilingual Robot News Reporter

no code implementations ACL 2020 Runxin Xu, Jun Cao, Mingxuan Wang, Jiaze Chen, Hao Zhou, Ying Zeng, Yu-Ping Wang, Li Chen, Xiang Yin, Xijin Zhang, Songcheng Jiang, Yuxuan Wang, Lei LI

This paper proposes the building of Xiaomingbot, an intelligent, multilingual and multimodal software robot equipped with four integral capabilities: news generation, news translation, news reading and avatar animation.

News Generation Translation +1

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

no code implementations26 May 2020 Dongyang Dai, Li Chen, Yu-Ping Wang, Mu Wang, Rui Xia, Xuchen Song, Zhiyong Wu, Yuxuan Wang

Firstly, the speech synthesis model is pre-trained with both multi-speaker clean data and noisy augmented data; then the pre-trained model is adapted on noisy low-resource new speaker data; finally, by setting the clean speech condition, the model can synthesize the new speaker's clean voice.

Decoder Speech Enhancement +1

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

no code implementations19 May 2020 Wenjie Li, Benlai Tang, Xiang Yin, Yushi Zhao, Wei Li, Kang Wang, Hao Huang, Yuxuan Wang, Zejun Ma

Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.

text-to-speech Text to Speech

Review of Text Style Transfer Based on Deep Learning

no code implementations6 May 2020 Xiang-Yang Li, Guo Pu, Keyu Ming, Pu Li, Jie Wang, Yuxuan Wang

In the traditional text style transfer model, the text style is generally relied on by experts knowledge and hand-designed rules, but with the application of deep learning in the field of natural language processing, the text style transfer method based on deep learning Started to be heavily researched.

Deep Learning Style Transfer +1

Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise

no code implementations28 Apr 2020 Shan Yang, Yuxuan Wang, Lei Xie

As for the speech-side noise, we propose to learn a noise-independent feature in the auto-regressive decoder through adversarial training and data augmentation, which does not need an extra speech enhancement model.

Clustering Data Augmentation +6

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

no code implementations23 Apr 2020 Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma

This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders.

Decoder Prediction +1

Convolutional Embedding for Edit Distance

2 code implementations31 Jan 2020 Xinyan Dai, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, James Cheng

Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment.

Triplet

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

no code implementations11 Nov 2019 Junjie Pan, Xiang Yin, Zhiling Zhang, Shichao Liu, Yang Zhang, Zejun Ma, Yuxuan Wang

In Mandarin text-to-speech (TTS) system, the front-end text processing module significantly influences the intelligibility and naturalness of synthesized speech.

Polyphone disambiguation Speech Synthesis +3

Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing

1 code implementation IJCNLP 2019 Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu, Ting Liu

In this approach, a linear transformation is learned from contextual word alignments to align the contextualized embeddings independently trained in different languages.

Dependency Parsing Language Modeling +3

Hierarchical Generative Modeling for Controllable Speech Synthesis

2 code implementations ICLR 2019 Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions.

Attribute Speech Synthesis +2

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

no code implementations30 Aug 2018 Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan

We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.

Decoder Speech Synthesis +2

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

no code implementations4 Aug 2018 Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan

GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style.

Speech Synthesis text-to-speech +2

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

2 code implementations ICML 2018 RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody.

Expressive Speech Synthesis

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

11 code implementations ICML 2018 Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system.

Speech Synthesis Style Transfer +1

The HIT-SCIR System for End-to-End Parsing of Universal Dependencies

no code implementations CONLL 2017 Wanxiang Che, Jiang Guo, Yuxuan Wang, Bo Zheng, Huaipeng Zhao, Yang Liu, Dechuan Teng, Ting Liu

Our system includes three pipelined components: \textit{tokenization}, \textit{Part-of-Speech} (POS) \textit{tagging} and \textit{dependency parsing}.

Dependency Parsing Information Retrieval +4

Cocktail Party Processing via Structured Prediction

no code implementations NeurIPS 2012 Yuxuan Wang, DeLiang Wang

While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison.

General Classification Prediction +2

Cannot find the paper you are looking for? You can Submit a new open access paper.