no code implementations • 22 May 2023 • Shentong Mo, Jing Shi, Yapeng Tian
In this work, we propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA, that can simply fine-tune lightweight visual-text alignment modules with frozen modality-specific encoders to update visual-aligned text embeddings as the condition.
no code implementations • 10 May 2023 • Xiyun Li, Ziyi Ni, Jingqing Ruan, Linghui Meng, Jing Shi, Tielin Zhang, Bo Xu
Inspired by this two-step psychology theory, we propose a biologically plausible mixture of personality (MoP) improved spiking actor network (SAN), whereby a determinantal point process is used to simulate the complex formation and integration of different types of personality in MoP, and dynamic and spiking neurons are incorporated into the SAN for the efficient reinforcement learning.
no code implementations • 7 May 2023 • Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, Bo Xu
(3) Integrating multiple modalities: all single-modal encoders are aligned with the LLM through X2L interfaces to integrate multimodal capabilities into the LLM.
no code implementations • 6 Apr 2023 • Jing Shi, Wei Xiong, Zhe Lin, Hyun Joon Jung
First, we learn the general concept of the input images by converting them to a textual token with a learnable image encoder.
1 code implementation • 2 Mar 2023 • Zefa Hu, Xiuyi Chen, Haoran Wu, Minglun Han, Ziyi Ni, Jing Shi, Shuang Xu, Bo Xu
Medical Slot Filling (MSF) task aims to convert medical queries into structured information, playing an essential role in diagnosis dialogue systems.
1 code implementation • 30 Jan 2023 • Minglun Han, Feilong Chen, Jing Shi, Shuang Xu, Bo Xu
Large-scale pre-trained language models (PLMs) with powerful language modeling capabilities have been widely used in natural language processing.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 28 Oct 2022 • Xuefeng Yang, Li Liu, Wenju Zhou, Jing Shi, Yinggang Zhang, Xin Hu, Huiyu Zhou
Moreover, the privacy of the system is analyzed to ensure the security of the real data.
1 code implementation • 18 Feb 2022 • Feilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu
Finally, we discuss the new frontiers in VLP.
no code implementations • CVPR 2022 • Jing Shi, Ning Xu, Haitian Zheng, Alex Smith, Jiebo Luo, Chenliang Xu
Recently, large pretrained models (e. g., BERT, StyleGAN, CLIP) show great knowledge transfer and generalization capability on various downstream tasks within their domains.
no code implementations • 17 Dec 2021 • Jing Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe, Bo Xu
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification.
no code implementations • 12 Dec 2021 • Guangyu Sun, Zhang Liu, Lianggong Wen, Jing Shi, Chenliang Xu
Video anomaly detection aims to identify abnormal events that occurred in videos.
no code implementations • 30 Nov 2021 • Jing Shi, Ning Xu, Haitian Zheng, Alex Smith, Jiebo Luo, Chenliang Xu
Recently, large pretrained models (e. g., BERT, StyleGAN, CLIP) have shown great knowledge transfer and generalization capability on various downstream tasks within their domains.
no code implementations • 27 Oct 2021 • Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Yanmin Qian
The deep learning based time-domain models, e. g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement.
no code implementations • 9 Oct 2021 • Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-Yi Lee, Shinji Watanabe
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • ICCV 2021 • Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, Yin Li
To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
1 code implementation • CVPR 2021 • Jing Shi, Ning Xu, Yihang Xu, Trung Bui, Franck Dernoncourt, Chenliang Xu
Recently, language-guided global image editing draws increasing attention with growing application potentials.
no code implementations • ICCV 2021 • Jing Shi, Yiwu Zhong, Ning Xu, Yin Li, Chenliang Xu
We investigate the weakly-supervised scene graph generation, which is a challenging task since no correspondence of label and object is provided.
no code implementations • ICCV 2021 • Wentao Jiang, Ning Xu, Jiayun Wang, Chen Gao, Jing Shi, Zhe Lin, Si Liu
Given the cycle, we propose several free augmentation strategies to help our model understand various editing requests given the imbalanced dataset.
no code implementations • 29 Nov 2020 • Peng Zhang, Jiaming Xu, Jing Shi, Yunzhe Hao, Bo Xu
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
no code implementations • 5 Oct 2020 • Jing Shi, Ning Xu, Trung Bui, Franck Dernoncourt, Zheng Wen, Chenliang Xu
To solve this new task, we first present a new language-driven image editing dataset that supports both local and global editing with editing operation and mask annotations.
no code implementations • 3 Oct 2020 • Jing Shi, Jing Bi, Yingru Liu, Chenliang Xu
The marriage of recurrent neural networks and neural ordinary differential networks (ODE-RNN) is effective in modeling irregularly-observed sequences.
no code implementations • 1 Aug 2020 • Jing Shi, Zhiheng Li, Haitian Zheng, Yihang Xu, Tianyou Xiao, Weitao Tan, Xiaoning Guo, Sizhe Li, Bin Yang, Zhexin Xu, Ruitao Lin, Zhongkai Shangguan, Yue Zhao, Jingwen Wang, Rohan Sharma, Surya Iyer, Ajinkya Deshmukh, Raunak Mahalik, Srishti Singh, Jayant G Rohra, Yi-Peng Zhang, Tongyu Yang, Xuan Wen, Ethan Fahnestock, Bryce Ikeda, Ian Lawson, Alan Finkelstein, Kehao Guo, Richard Magnotti, Andrew Sexton, Jeet Ketan Thaker, Yiyang Su, Chenliang Xu
This technical report summarizes submissions and compiles from Actor-Action video classification challenge held as a final project in CSC 249/449 Machine Vision course (Spring 2020) at University of Rochester
no code implementations • NeurIPS 2020 • Jing Shi, Xuankai Chang, Pengcheng Guo, Shinji Watanabe, Yusuke Fujita, Jiaming Xu, Bo Xu, Lei Xie
This model additionally has a simple and efficient stop criterion for the end of the transduction, making it able to infer the variable number of output sequences.
Ranked #2 on
Speech Separation
on WSJ0-5mix
no code implementations • 25 Jun 2020 • Jing Shi, Jiaming Xu, Yusuke Fujita, Shinji Watanabe, Bo Xu
With the predicted speaker information from whole observation, our model is helpful to solve the problem of conventional speech separation and speaker extraction for multi-round long recordings.
Audio and Speech Processing Sound
no code implementations • 11 Jun 2020 • Yingru Liu, Yucheng Xing, Xuewen Yang, Xin Wang, Jing Shi, Di Jin, Zhaoyue Chen
Learning continuous-time stochastic dynamics is a fundamental and essential problem in modeling sporadic time series, whose observations are irregular and sparse in both time and dimension.
1 code implementation • 2 Jun 2020 • Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Jing Shi, Kenji Nagamatsu
Speaker diarization is an essential step for processing multi-speaker audio.
no code implementations • CVPR 2019 • Jing Shi, Jia Xu, Boqing Gong, Chenliang Xu
We invest the problem of weakly-supervised video grounding, where only video-level sentences are provided.
no code implementations • 2 Dec 2018 • Wentian Zhao, Shaojie Wang, Zhihuai Xie, Jing Shi, Chenliang Xu
To overcome such limitation, we propose a GAN based EM learning framework that can maximize the likelihood of images and estimate the latent variables with only the constraint of L-Lipschitz continuity.
no code implementations • 15 Nov 2018 • Jing Shi, Jiaming Xu, Yiqun Yao, Bo Xu
In this paper, we present a memory-augmented neural network which is motivated by the process of human concept learning.
2 code implementations • ECCV 2018 • Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu
In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos.
no code implementations • WS 2016 • Jing Shi, Jiaming Xu, Yiqun Yao, Suncong Zheng, Bo Xu
As the result of the evaluation shows, our solution provides a valuable and brief model which could be used in modelling question answering or sentence semantic relevance.
1 code implementation • COLING 2016 • Jiaming Xu, Jing Shi, Yiqun Yao, Suncong Zheng, Bo Xu
Recently, end-to-end memory networks have shown promising results on Question Answering task, which encode the past facts into an explicit memory and perform reasoning ability by making multiple computational steps on the memory.