no code implementations • Findings (ACL) 2022 • Tao Jin, Zhou Zhao, Meng Zhang, Xingshan Zeng
This paper attacks the challenging problem of sign language translation (SLT), which involves not only visual and textual understanding but also additional prior knowledge learning (i. e. performing style, syntax).
no code implementations • 10 Jun 2022 • Yang Zhao, Xuan Lin, Wenqiang Xu, Maozong Zheng, Zhengyong Liu, Zhou Zhao
In recent days, streaming technology has greatly promoted the development in the field of livestream.
no code implementations • 5 Jun 2022 • Ziyue Jiang, Su Zhe, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, Zhenhui Ye
This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language).
no code implementations • 25 May 2022 • Rongjie Huang, Zhou Zhao, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He
To alleviate the acoustic multimodal problem, we propose bilateral perturbation, which consists of the style normalization and information enhancement stages, to learn only the linguistic information from speech samples and generate more deterministic representations.
no code implementations • 15 May 2022 • Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e. g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data.
1 code implementation • 21 Apr 2022 • Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao
Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time.
Ranked #6 on
Text-To-Speech Synthesis
on LJSpeech
no code implementations • CVPR 2022 • Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, Hao Huang, Heng Tao Shen, Jingkuan Song
The performance of current Scene Graph Generation models is severely hampered by some hard-to-distinguish predicates, e. g., "woman-on/standing on/walking on-beach" or "woman-near/looking at/in front of-child".
no code implementations • 17 Mar 2022 • Dong Yao, Zhou Zhao, Shengyu Zhang, Jieming Zhu, Yudong Zhu, Rui Zhang, Xiuqiang He
We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music.
no code implementations • ACL 2022 • Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Jiaxu Miao, Wenqiao Zhang, Wenming Tan, Jin Wang, Peng Wang, ShiLiang Pu, Fei Wu
To achieve effective grounding under a limited annotation budget, we investigate one-shot video grounding, and learn to ground natural language in all video frames with solely one frame labeled, in an end-to-end manner.
3 code implementations • ACL 2022 • Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao
Furthermore, we propose a latent-mapping algorithm in the latent space to convert the amateur vocal tone to the professional one.
no code implementations • ACL 2022 • Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu
Then we conduct a comprehensive study on NAR-TTS models that use some advanced modeling methods.
1 code implementation • 21 Feb 2022 • Jiong Wang, Zhou Zhao, Weike Jin, Xinyu Duan, Zhen Lei, Baoxing Huai, Yiling Wu, Xiaofei He
In this paper, the VLAD aggregation method is adopted to quantize local features with visual vocabulary locally partitioning the feature space, and hence preserve the local discriminability.
3 code implementations • ICLR 2022 • Luping Liu, Yi Ren, Zhijie Lin, Zhou Zhao
Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs).
Ranked #4 on
Image Generation
on CelebA 64x64
no code implementations • 16 Feb 2022 • Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie Yan, Zhou Zhao
Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV).
no code implementations • 11 Jan 2022 • Shoutong Wang, Jinglin Liu, Yi Ren, Zhen Wang, Changliang Xu, Zhou Zhao
However, they face several challenges: 1) the fixed-size speaker embedding is not powerful enough to capture full details of the target timbre; 2) single reference audio does not contain sufficient timbre information of the target speaker; 3) the pitch inconsistency between different speakers also leads to a degradation in the generated voice.
no code implementations • 10 Jan 2022 • Lei LI, Fuping Wu, Sihan Wang, Xinzhe Luo, Carlos Martin-Isla, Shuwei Zhai, Jianpeng Zhang, Yanfei Liu7, Zhen Zhang, Markus J. Ankenbrand, Haochuan Jiang, Xiaoran Zhang, Linhong Wang, Tewodros Weldebirhan Arega, Elif Altunok, Zhou Zhao, Feiyan Li, Jun Ma, Xiaoping Yang, Elodie Puybareau, Ilkay Oksuz, Stephanie Bricq, Weisheng Li, Kumaradevan Punithakumar, Sotirios A. Tsaftaris, Laura M. Schreiber, Mingjing Yang, Guocai Liu, Yong Xia, Guotai Wang, Sergio Escalera, Xiahai Zhuang
Assessment of myocardial viability is essential in diagnosis and treatment management of patients suffering from myocardial infarction, and classification of pathology on myocardium is the key to this assessment.
1 code implementation • CVPR 2022 • Yan Xia, Zhou Zhao
Audiovisual Event (AVE) localization requires the model to jointly localize an event by observing audio and visual information.
1 code implementation • CVPR 2022 • Wenwen Pan, Haonan Shi, Zhou Zhao, Jieming Zhu, Xiuqiang He, Zhigeng Pan, Lianli Gao, Jun Yu, Fei Wu, Qi Tian
Audio-Guided video semantic segmentation is a challenging problem in visual analysis and editing, which automatically separates foreground objects from background in a video sequence according to the referring audio expressions.
no code implementations • CVPR 2022 • Aoxiong Yin, Zhou Zhao, Weike Jin, Meng Zhang, Xingshan Zeng, Xiaofei He
In addition, we also explore zero-shot translation in sign language and find that our model can achieve comparable performance to the supervised BSLT model on some language pairs.
1 code implementation • MM '21: Proceedings of the 29th ACM International Conference on Multimedia 2021 • Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao
High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost.
no code implementations • 8 Dec 2021 • Aoxiong Yin, Zhou Zhao, Jinglin Liu, Weike Jin, Meng Zhang, Xingshan Zeng, Xiaofei He
Sign language translation as a kind of technology with profound social significance has attracted growing researchers' interest in recent years.
no code implementations • NeurIPS 2021 • Tao Jin, Zhou Zhao
The majority of existing multimodal sequential learning methods focus on how to obtain effective representations and ignore the importance of multimodal fusion.
no code implementations • 11 Nov 2021 • Jianyun Zou, Min Yang, Lichao Zhang, Yechen Xu, Qifan Pan, Fengqing Jiang, Ran Qin, Shushu Wang, Yifan He, Songfang Huang, Zhou Zhao
We finally analyze the performance of SOTA KBQA models on this dataset and identify the challenges facing Chinese KBQA.
no code implementations • 14 Oct 2021 • Ziyue Jiang, Yi Ren, Ming Lei, Zhou Zhao
Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally.
no code implementations • 14 Oct 2021 • Rongjie Huang, Chenye Cui, Feiyang Chen, Yi Ren, Jinglin Liu, Zhou Zhao, Baoxing Huai, Zhefeng Wang
In this work, we propose SingGAN, a generative adversarial network designed for high-fidelity singing voice synthesis.
no code implementations • 8 Oct 2021 • Yujie Lu, Yingxuan Huang, Shengyu Zhang, Wei Han, Hui Chen, Zhou Zhao, Fei Wu
In this paper, we propose the DMR framework to explicitly model dynamic multi-trends of users' current preference and make predictions based on both the history and future potential trends.
no code implementations • 8 Oct 2021 • Shengyu Zhang, Kun Kuang, Jiezhong Qiu, Jin Yu, Zhou Zhao, Hongxia Yang, Zhongfei Zhang, Fei Wu
The results demonstrate that our method outperforms various SOTA GNNs for stable prediction on graphs with agnostic distribution shift, including shift caused by node labels and attributes.
no code implementations • 6 Oct 2021 • Fuming You, Jingjing Li, Zhou Zhao
An previous solution is test time normalization, which substitutes the source statistics in BN layers with the target batch statistics.
3 code implementations • NeurIPS 2021 • Yi Ren, Jinglin Liu, Zhou Zhao
Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel.
no code implementations • 29 Sep 2021 • Ziyue Jiang, Yi Ren, Zhou Zhao
In this work, we propose a novel phase-oriented algorithm named PhaseFool that can efficiently construct imperceptible audio adversarial examples with energy dissipation.
no code implementations • 29 Sep 2021 • Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Zhou Zhao, Yi Ren
Learning generalizable speech representations for unseen samples in different domains has been a challenge with ever increasing importance to date.
no code implementations • 29 Sep 2021 • Zhijie Lin, Zijian Zhang, Zhou Zhao
Score-based generative models involve sequentially corrupting the data distribution with noise and then learns to recover the data distribution based on score matching.
1 code implementation • 26 Sep 2021 • Jiahao Xun, Shengyu Zhang, Zhou Zhao, Jieming Zhu, Qi Zhang, Jingjie Li, Xiuqiang He, Xiaofei He, Tat-Seng Chua, Fei Wu
In this work, inspired by the fact that users make their click decisions mostly based on the visual impression they perceive when browsing news, we propose to capture such visual impression information with visual-semantic modeling for news recommendation.
no code implementations • 11 Sep 2021 • Shengyu Zhang, Dong Yao, Zhou Zhao, Tat-Seng Chua, Fei Wu
In this paper, we propose to learn accurate and robust user representations, which are required to be less sensitive to (attack on) noisy behaviors and trust more on the indispensable ones, by modeling counterfactual data distribution.
no code implementations • 31 Aug 2021 • Zhijie Lin, Zhou Zhao, Haoyuan Li, Jinglin Liu, Meng Zhang, Xingshan Zeng, Xiaofei He
Lip reading, aiming to recognize spoken sentences according to the given video of lip movements without relying on the audio stream, has attracted great interest due to its application in many scenarios.
1 code implementation • 14 Jul 2021 • Jinglin Liu, Zhiying Zhu, Yi Ren, Wencan Huang, Baoxing Huai, Nicholas Yuan, Zhou Zhao
However, the AR decoding manner generates current lip frame conditioned on frames generated previously, which inherently hinders the inference speed, and also has a detrimental effect on the quality of generated lip frames due to error propagation.
no code implementations • CVPR 2021 • Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, Zheng Qin
Toward this end, we contribute a multi-modal relational graph to capture the interactions among objects from the visual and textual content to identify the differences among similar video moment candidates.
no code implementations • CVPR 2021 • Yang Zhao, Zhou Zhao, Zhu Zhang, Zhijie Lin
Temporal video grounding aims to localize the target segment which is semantically aligned with the given sentence in an untrimmed video.
no code implementations • 17 Jun 2021 • Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, Zhou Zhao
Finally, by showing a comparable performance in the emotional speech synthesis task, we successfully demonstrate the ability of the proposed model.
no code implementations • 2 Jun 2021 • Zhu Zhang, Chang Zhou, Jianxin Ma, Zhijie Lin, Jingren Zhou, Hongxia Yang, Zhou Zhao
Further, we design a history sampler to select informative fragments for rehearsal training, making the memory focus on the crucial information.
5 code implementations • 6 May 2021 • Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Zhou Zhao
Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e. g., mel-spectrogram) given a music score.
no code implementations • 1 Apr 2021 • Dong Yao, Shengyu Zhang, Zhou Zhao, Wenyan Fan, Jieming Zhu, Xiuqiang He, Fei Wu
Personalized recommendation system has become pervasive in various video platform.
no code implementations • 1 Jan 2021 • Zhijie Lin, Zhou Zhao, Zhu Zhang, Huai Baoxing, Jing Yuan
Model Agnostic Meta-Learning~(MAML)~(\cite{finn2017model}) is one of the most well-known gradient-based meta learning algorithms, that learns the meta-initialization through the inner and outer optimization loop.
no code implementations • 1 Jan 2021 • Zhu Zhang, Chang Zhou, Zhou Zhao, Zhijie Lin, Jingren Zhou, Hongxia Yang
Existing reasoning tasks often follow the setting of "reasoning while experiencing", which has an important assumption that the raw contents can be always accessed while reasoning.
no code implementations • ICCV 2021 • Min Zhang, Yang Guo, Na lei, Zhou Zhao, Jianfeng Wu, Xiaoyin Xu, Yalin Wang, Xianfeng GU
Shape analysis has been playing an important role in early diagnosis and prognosis of neurodegenerative diseases such as Alzheimer's diseases (AD).
no code implementations • NeurIPS 2020 • Zhu Zhang, Zhou Zhao, Zhijie Lin, Jieming Zhu, Xiuqiang He
Weakly-supervised vision-language grounding aims to localize a target moment in a video or a specific region in an image according to the given sentence query, where only video-level or image-level sentence annotations are provided during training.
no code implementations • 1 Nov 2020 • Yujie Lu, Shengyu Zhang, Yingxuan Huang, Luyao Wang, Xinyao Yu, Zhou Zhao, Fei Wu
By diverse trends, supposing the future preferences can be diversified, we propose the diverse trends extractor and the time-aware mechanism to represent the possible trends of preferences for a given user with multiple vectors.
no code implementations • 2 Oct 2020 • Shengyu Zhang, Donghui Wang, Zhou Zhao, Siliang Tang, Di Xie, Fei Wu
In this paper, we investigate the problem of text-to-pedestrian synthesis, which has many potential applications in art, design, and video surveillance.
1 code implementation • 19 Aug 2020 • Zhu Zhang, Zhijie Lin, Zhou Zhao, Jieming Zhu, Xiuqiang He
Thus, these methods fail to distinguish the target moment from plausible negative moments.
1 code implementation • 18 Aug 2020 • Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu
To improve harmony, in this paper, we propose a novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks.
no code implementations • 16 Aug 2020 • Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Nicholas Jing Yuan
Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence.
1 code implementation • 16 Aug 2020 • Shengyu Zhang, Tan Jiang, Tan Wang, Kun Kuang, Zhou Zhao, Jianke Zhu, Jin Yu, Hongxia Yang, Fei Wu
In this paper, we propose to investigate the problem of out-of-domain visio-linguistic pretraining, where the pretraining data distribution differs from that of downstream data on which the pretrained model will be fine-tuned.
1 code implementation • 16 Aug 2020 • Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, Fei Wu
Then, based on the aspects of the video-associated product, we perform knowledge-enhanced spatial-temporal inference on those graphs for capturing the dynamic change of fine-grained product-part characteristics.
no code implementations • 6 Aug 2020 • Jinglin Liu, Yi Ren, Zhou Zhao, Chen Zhang, Baoxing Huai, Nicholas Jing Yuan
NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading.
no code implementations • 17 Jul 2020 • Jinglin Liu, Yi Ren, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, Tie-Yan Liu
SAT contains a hyperparameter k, and each k value defines a SAT task with different degrees of parallelism.
no code implementations • 9 Jul 2020 • Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, Tie-Yan Liu
DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers.
no code implementations • ACL 2020 • Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, Tie-Yan Liu
In this work, we develop SimulSpeech, an end-to-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently.
1 code implementation • 24 Jun 2020 • Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Tan Jiang, Jingren Zhou, Hongxia Yang, Fei Wu
In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive in volume.
25 code implementations • ICLR 2021 • Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.
no code implementations • ACL 2020 • Yi Ren, Jinglin Liu, Xu Tan, Zhou Zhao, Sheng Zhao, Tie-Yan Liu
In this work, we conduct a study to understand the difficulty of NAR sequence generation and try to answer: (1) Why NAR models can catch up with AR models in some tasks but not all?
1 code implementation • 21 Apr 2020 • Yang Sun, Fajie Yuan, Min Yang, Guoao Wei, Zhou Zhao, Duo Liu
Current state-of-the-art sequential recommender models are typically based on a sandwich-structured deep neural network, where one or more middle (hidden) layers are placed between the input embedding layer and output softmax layer.
no code implementations • 29 Feb 2020 • Shengyu Zhang, Tan Jiang, Qinghao Huang, Ziqi Tan, Zhou Zhao, Siliang Tang, Jin Yu, Hongxia Yang, Yi Yang, Fei Wu
Existing image completion procedure is highly subjective by considering only visual context, which may trigger unpredictable results which are plausible but not faithful to a grounded knowledge.
1 code implementation • 31 Jan 2020 • Shuwen Xiao, Zhou Zhao, Zijian Zhang, Xiaohui Yan, Min Yang
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs and aims to generate a query-focused video summary.
1 code implementation • CVPR 2020 • Zhu Zhang, Zhou Zhao, Yang Zhao, Qi. Wang, Huasheng Liu, Lianli Gao
In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).
no code implementations • 14 Jan 2020 • Boyuan Pan, Yazheng Yang, Zhou Zhao, Yueting Zhuang, Deng Cai
Neural Machine Translation (NMT) has become a popular technology in recent years, and the encoder-decoder framework is the mainstream among all the methods.
no code implementations • IEEE Transactions on Cybernetics 2020 • Wei Zhao, Benyou Wang, Min Yang, Jianbo Ye, Zhou Zhao, Xiaojun Chen, and Ying Shen
Movie recommendation systems provide users with ranked lists of movies based on individual’s preferences and constraints.
no code implementations • 19 Nov 2019 • Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi. Wang, Huasheng Liu
Video moment retrieval is to search the moment that is most relevant to the given natural language query.
no code implementations • IJCNLP 2019 • Weike Jin, Zhou Zhao, Mao Gu, Jun Xiao, Furu Wei, Yueting Zhuang
Video dialog is a new and challenging task, which requires the agent to answer questions combining video information with dialog history.
2 code implementations • 5 Sep 2019 • Hongyang Xue, Wenqing Chu, Zhou Zhao, Deng Cai
We propose a new attention model for video question answering.
1 code implementation • 27 Aug 2019 • Yinwei Wei, Zhiyong Cheng, Xuzheng Yu, Zhou Zhao, Lei Zhu, Liqiang Nie
The hashtags, that a user provides to a post (e. g., a micro-video), are the ones which in her mind can well describe the post content where she is interested in.
1 code implementation • ACL 2018 • Boyuan Pan, Yazheng Yang, Zhou Zhao, Yueting Zhuang, Deng Cai, Xiaofei He
We observe that people usually use some discourse markers such as "so" or "but" to represent the logical relationship between two sentences.
Ranked #12 on
Natural Language Inference
on SNLI
no code implementations • NeurIPS 2018 • Boyuan Pan, Yazheng Yang, Hao Li, Zhou Zhao, Yueting Zhuang, Deng Cai, Xiaofei He
In this paper, we transfer knowledge learned from machine comprehension to the sequence-to-sequence tasks to deepen the understanding of the text.
no code implementations • 1 Jul 2019 • Yutong Wang, Jiyuan Zheng, Qijiong Liu, Zhou Zhao, Jun Xiao, Yueting Zhuang
More specifically, we devise a discriminator, Relation Guider, to capture the relations between the whole passage and the associated answer and then the Multi-Interaction mechanism is deployed to transfer the knowledge dynamically for our question generation system.
no code implementations • 28 Jun 2019 • Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Deng Cai
Thus, we consider a new task to localize unseen activities in videos via image queries, named Image-Based Activity Localization.
no code implementations • 28 Jun 2019 • Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Xiaofei He
Concretely, we first develop a hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video context.
1 code implementation • 16 Jun 2019 • Lianli Gao, Xiaosu Zhu, Jingkuan Song, Zhou Zhao, Heng Tao Shen
In this work, we propose a deep progressive quantization (DPQ) model, as an alternative to PQ, for large scale image retrieval.
1 code implementation • 6 Jun 2019 • Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, DaCheng Tao
It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA).
Ranked #4 on
Video Question Answering
on ActivityNet-QA
1 code implementation • 6 Jun 2019 • Zhu Zhang, Zhijie Lin, Zhou Zhao, Zhenxin Xiao
Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query.
10 code implementations • 22 May 2019 • Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).
18 code implementations • NeurIPS 2019 • Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.
Ranked #9 on
Text-To-Speech Synthesis
on LJSpeech
no code implementations • 13 May 2019 • Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data.
1 code implementation • ICLR 2019 • Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, Tie-Yan Liu
Multilingual machine translation, which translates multiple languages with a single model, has attracted much attention due to its efficiency of offline training and online serving.
2 code implementations • 17 Nov 2018 • Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, Philip S. Yu
To the best of our knowledge, most state-of-the-art approaches follow an encoder-decoder framework which encodes the code into a hidden space and then decode it into natural language space, suffering from two major drawbacks: a) Their encoders only consider the sequential content of code, ignoring the tree structure which is also critical for the task of code summarization, b) Their decoders are typically trained to predict the next word by maximizing the likelihood of next ground-truth word with previous ground-truth word given.
no code implementations • 12 Nov 2018 • Yao Wan, Wenqiang Yan, Jianwei Gao, Zhou Zhao, Jian Wu, Philip S. Yu
Dialogue Act (DA) classification is a challenging problem in dialogue interpretation, which aims to attach semantic labels to utterances and characterize the speaker's intention.
Ranked #4 on
Dialogue Act Classification
on Switchboard corpus
no code implementations • 1 Nov 2018 • Haojie Pan, Junpei Zhou, Zhou Zhao, Yan Liu, Deng Cai, Min Yang
We first propose a new task named Dialogue Description (Dial2Desc).
no code implementations • 24 Oct 2018 • Zhou Zhao, Hanbing Zhan, Lingtao Meng, Jun Xiao, Jun Yu, Min Yang, Fei Wu, Deng Cai
In this paper, we study the problem of image retweet prediction in social media, which predicts the image sharing behavior that the user reposts the image tweets from their followees.
1 code implementation • 9 May 2018 • Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, DaCheng Tao
Visual grounding aims to localize an object in an image referred to by a textual query phrase.
Ranked #7 on
Phrase Grounding
on Flickr30k Entities Test
4 code implementations • EMNLP 2018 • Wei Zhao, Jianbo Ye, Min Yang, Zeyang Lei, Suofei Zhang, Zhou Zhao
In this study, we explore capsule networks with dynamic routing for text classification.
Ranked #6 on
Sentiment Analysis
on MR
no code implementations • SIGIR 2018 • Zheqian Chen, Rongqin Yang, Zhou Zhao, Deng Cai, Xiaofei He
Dialogue Act Recognition (DAR) is a challenging problem in dialogue interpretation, which aims to attach semantic labels to utterances and characterize the speaker's intention.
no code implementations • 1 Nov 2017 • Boyuan Pan, Hao Li, Zhou Zhao, Deng Cai, Xiaofei He
In this paper, we propose a novel neural network system that consists a Demand Optimization Model based on a passage-attention neural machine translation and a Reader Model that can find the answer given the optimized question.
no code implementations • 8 Oct 2017 • Zheqian Chen, Rongqin Yang, Bin Cao, Zhou Zhao, Deng Cai, Xiaofei He
Machine Comprehension (MC) is a challenging task in Natural Language Processing field, which aims to guide the machine to comprehend a passage and answer the given question.
Ranked #34 on
Question Answering
on SQuAD1.1 dev
no code implementations • EMNLP 2017 • Min Yang, Jincheng Mei, Heng Ji, Wei Zhao, Zhou Zhao, Xiaojun Chen
We study the problem of identifying the topics and sentiments and tracking their shifts from social media texts in different geographical regions during emergencies and disasters.
no code implementations • 28 Jul 2017 • Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai, Xiaofei He
Machine comprehension(MC) style question answering is a representative problem in natural language processing.
Ranked #19 on
Question Answering
on TriviaQA
no code implementations • 20 Jul 2017 • Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, Yueting Zhuang
Video Question Answering is a challenging problem in visual information retrieval, which provides the answer to the referenced video content according to the question.
no code implementations • 3 May 2017 • Hongyang Xue, Zhou Zhao, Deng Cai
Then we propose a TGIF-QA dataset for video question answering with the help of automatic question generation.
no code implementations • 24 Nov 2016 • Zheqian Chen, Chi Zhang, Zhou Zhao, Deng Cai
The challenges in this task are the lexical gaps between questions for the word ambiguity and word mismatch problem.
no code implementations • 24 Nov 2016 • Zheqian Chen, Ben Gao, Huimin Zhang, Zhou Zhao, Deng Cai
However, it is not effective for users to attain answers within minutes.