Search Results for author: Zhou Zhao

Found 97 papers, 32 papers with code

Prior Knowledge and Memory Enriched Transformer for Sign Language Translation

no code implementations Findings (ACL) 2022 Tao Jin, Zhou Zhao, Meng Zhang, Xingshan Zeng

This paper attacks the challenging problem of sign language translation (SLT), which involves not only visual and textual understanding but also additional prior knowledge learning (i. e. performing style, syntax).

POS Sign Language Translation +1

AntPivot: Livestream Highlight Detection via Hierarchical Attention Mechanism

no code implementations10 Jun 2022 Yang Zhao, Xuan Lin, Wenqiang Xu, Maozong Zheng, Zhengyong Liu, Zhou Zhao

In recent days, streaming technology has greatly promoted the development in the field of livestream.

Highlight Detection

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

no code implementations5 Jun 2022 Ziyue Jiang, Su Zhe, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, Zhenhui Ye

This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language).

Polyphone disambiguation

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

no code implementations25 May 2022 Rongjie Huang, Zhou Zhao, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He

To alleviate the acoustic multimodal problem, we propose bilateral perturbation, which consists of the style normalization and information enhancement stages, to learn only the linguistic information from speech samples and generate more deterministic representations.

Representation Learning Speech Synthesis +2

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis

no code implementations15 May 2022 Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e. g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data.

Speech Synthesis Style Transfer +1

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

1 code implementation21 Apr 2022 Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao

Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time.

Denoising Speech Synthesis +1

Fine-Grained Predicates Learning for Scene Graph Generation

no code implementations CVPR 2022 Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, Hao Huang, Heng Tao Shen, Jingkuan Song

The performance of current Scene Graph Generation models is severely hampered by some hard-to-distinguish predicates, e. g., "woman-on/standing on/walking on-beach" or "woman-near/looking at/in front of-child".

Fine-Grained Image Classification Graph Generation +2

Contrastive Learning with Positive-Negative Frame Mask for Music Representation

no code implementations17 Mar 2022 Dong Yao, Zhou Zhao, Shengyu Zhang, Jieming Zhu, Yudong Zhu, Rui Zhang, Xiuqiang He

We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music.

Contrastive Learning Cover song identification +2

End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding

no code implementations ACL 2022 Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Jiaxu Miao, Wenqiao Zhang, Wenming Tan, Jin Wang, Peng Wang, ShiLiang Pu, Fei Wu

To achieve effective grounding under a limited annotation budget, we investigate one-shot video grounding, and learn to ground natural language in all video frames with solely one frame labeled, in an end-to-end manner.

Representation Learning Video Grounding

Learning the Beauty in Songs: Neural Singing Voice Beautifier

3 code implementations ACL 2022 Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao

Furthermore, we propose a latent-mapping algorithm in the latent space to convert the amateur vocal tone to the professional one.

Dynamic Time Warping

Revisiting Over-Smoothness in Text to Speech

no code implementations ACL 2022 Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu

Then we conduct a comprehensive study on NAR-TTS models that use some advanced modeling methods.

VLAD-VSA: Cross-Domain Face Presentation Attack Detection with Vocabulary Separation and Adaptation

1 code implementation21 Feb 2022 Jiong Wang, Zhou Zhao, Weike Jin, Xinyu Duan, Zhen Lei, Baoxing Huai, Yiling Wu, Xiaofei He

In this paper, the VLAD aggregation method is adopted to quantize local features with visual vocabulary locally partitioning the feature space, and hence preserve the local discriminability.

Face Presentation Attack Detection

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

no code implementations16 Feb 2022 Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie Yan, Zhou Zhao

Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV).

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

no code implementations11 Jan 2022 Shoutong Wang, Jinglin Liu, Yi Ren, Zhen Wang, Changliang Xu, Zhou Zhao

However, they face several challenges: 1) the fixed-size speaker embedding is not powerful enough to capture full details of the target timbre; 2) single reference audio does not contain sufficient timbre information of the target speaker; 3) the pitch inconsistency between different speakers also leads to a degradation in the generated voice.

Cross-Modal Background Suppression for Audio-Visual Event Localization

1 code implementation CVPR 2022 Yan Xia, Zhou Zhao

Audiovisual Event (AVE) localization requires the model to jointly localize an event by observing audio and visual information.

audio-visual event localization

Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks

1 code implementation CVPR 2022 Wenwen Pan, Haonan Shi, Zhou Zhao, Jieming Zhu, Xiuqiang He, Zhigeng Pan, Lianli Gao, Jun Yu, Fei Wu, Qi Tian

Audio-Guided video semantic segmentation is a challenging problem in visual analysis and editing, which automatically separates foreground objects from background in a video sequence according to the referring audio expressions.

Denoising Semantic Segmentation +2

MLSLT: Towards Multilingual Sign Language Translation

no code implementations CVPR 2022 Aoxiong Yin, Zhou Zhao, Weike Jin, Meng Zhang, Xingshan Zeng, Xiaofei He

In addition, we also explore zero-shot translation in sign language and find that our model can achieve comparable performance to the supervised BSLT model on some language pairs.

Sign Language Translation Translation

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

1 code implementation MM '21: Proceedings of the 29th ACM International Conference on Multimedia 2021 Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao

High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost.

Audio Generation Text-To-Speech Synthesis

SimulSLT: End-to-End Simultaneous Sign Language Translation

no code implementations8 Dec 2021 Aoxiong Yin, Zhou Zhao, Jinglin Liu, Weike Jin, Meng Zhang, Xingshan Zeng, Xiaofei He

Sign language translation as a kind of technology with profound social significance has attracted growing researchers' interest in recent years.

Sign Language Translation Translation

Generalizable Multi-linear Attention Network

no code implementations NeurIPS 2021 Tao Jin, Zhou Zhao

The majority of existing multimodal sequential learning methods focus on how to obtain effective representations and ignore the importance of multimodal fusion.

Multimodal Sentiment Analysis Video Retrieval

FedSpeech: Federated Text-to-Speech with Continual Learning

no code implementations14 Oct 2021 Ziyue Jiang, Yi Ren, Ming Lei, Zhou Zhao

Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally.

Continual Learning Federated Learning

SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation

no code implementations14 Oct 2021 Rongjie Huang, Chenye Cui, Feiyang Chen, Yi Ren, Jinglin Liu, Zhou Zhao, Baoxing Huai, Zhefeng Wang

In this work, we propose SingGAN, a generative adversarial network designed for high-fidelity singing voice synthesis.

Speech Synthesis

Multi-trends Enhanced Dynamic Micro-video Recommendation

no code implementations8 Oct 2021 Yujie Lu, Yingxuan Huang, Shengyu Zhang, Wei Han, Hui Chen, Zhou Zhao, Fei Wu

In this paper, we propose the DMR framework to explicitly model dynamic multi-trends of users' current preference and make predictions based on both the history and future potential trends.

Recommendation Systems

Stable Prediction on Graphs with Agnostic Distribution Shift

no code implementations8 Oct 2021 Shengyu Zhang, Kun Kuang, Jiezhong Qiu, Jin Yu, Zhou Zhao, Hongxia Yang, Zhongfei Zhang, Fei Wu

The results demonstrate that our method outperforms various SOTA GNNs for stable prediction on graphs with agnostic distribution shift, including shift caused by node labels and attributes.

Graph Learning Recommendation Systems

Test-time Batch Statistics Calibration for Covariate Shift

no code implementations6 Oct 2021 Fuming You, Jingjing Li, Zhou Zhao

An previous solution is test time normalization, which substitutes the source statistics in BN layers with the target batch statistics.

Domain Generalization Image Classification +1

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

3 code implementations NeurIPS 2021 Yi Ren, Jinglin Liu, Zhou Zhao

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel.

Text-To-Speech Synthesis Word Alignment

PhaseFool: Phase-oriented Audio Adversarial Examples via Energy Dissipation

no code implementations29 Sep 2021 Ziyue Jiang, Yi Ren, Zhou Zhao

In this work, we propose a novel phase-oriented algorithm named PhaseFool that can efficiently construct imperceptible audio adversarial examples with energy dissipation.

Automatic Speech Recognition

SynCLR: A Synthesis Framework for Contrastive Learning of out-of-domain Speech Representations

no code implementations29 Sep 2021 Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Zhou Zhao, Yi Ren

Learning generalizable speech representations for unseen samples in different domains has been a challenge with ever increasing importance to date.

Contrastive Learning Data Augmentation +4

ST-DDPM: Explore Class Clustering for Conditional Diffusion Probabilistic Models

no code implementations29 Sep 2021 Zhijie Lin, Zijian Zhang, Zhou Zhao

Score-based generative models involve sequentially corrupting the data distribution with noise and then learns to recover the data distribution based on score matching.

Conditional Image Generation

Why Do We Click: Visual Impression-aware News Recommendation

1 code implementation26 Sep 2021 Jiahao Xun, Shengyu Zhang, Zhou Zhao, Jieming Zhu, Qi Zhang, Jingjie Li, Xiuqiang He, Xiaofei He, Tat-Seng Chua, Fei Wu

In this work, inspired by the fact that users make their click decisions mostly based on the visual impression they perceive when browsing news, we propose to capture such visual impression information with visual-semantic modeling for news recommendation.

Decision Making News Recommendation

CauseRec: Counterfactual User Sequence Synthesis for Sequential Recommendation

no code implementations11 Sep 2021 Shengyu Zhang, Dong Yao, Zhou Zhao, Tat-Seng Chua, Fei Wu

In this paper, we propose to learn accurate and robust user representations, which are required to be less sensitive to (attack on) noisy behaviors and trust more on the indispensable ones, by modeling counterfactual data distribution.

Representation Learning Sequential Recommendation

SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory

no code implementations31 Aug 2021 Zhijie Lin, Zhou Zhao, Haoyuan Li, Jinglin Liu, Meng Zhang, Xingshan Zeng, Xiaofei He

Lip reading, aiming to recognize spoken sentences according to the given video of lip movements without relying on the audio stream, has attracted great interest due to its application in many scenarios.

Lip Reading

Parallel and High-Fidelity Text-to-Lip Generation

1 code implementation14 Jul 2021 Jinglin Liu, Zhiying Zhu, Yi Ren, Wencan Huang, Baoxing Huai, Nicholas Yuan, Zhou Zhao

However, the AR decoding manner generates current lip frame conditioned on frames generated previously, which inherently hinders the inference speed, and also has a detrimental effect on the quality of generated lip frames due to error propagation.

Talking Face Generation Text-to-Face Generation

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

no code implementations CVPR 2021 Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, Zheng Qin

Toward this end, we contribute a multi-modal relational graph to capture the interactions among objects from the visual and textual content to identify the differences among similar video moment candidates.

Cross-Modal Retrieval Graph Matching +1

Cascaded Prediction Network via Segment Tree for Temporal Video Grounding

no code implementations CVPR 2021 Yang Zhao, Zhou Zhao, Zhu Zhang, Zhijie Lin

Temporal video grounding aims to localize the target segment which is semantically aligned with the given sentence in an untrimmed video.

Video Grounding

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

no code implementations17 Jun 2021 Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, Zhou Zhao

Finally, by showing a comparable performance in the emotional speech synthesis task, we successfully demonstrate the ability of the proposed model.

Emotional Speech Synthesis Emotion Classification

Learning to Rehearse in Long Sequence Memorization

no code implementations2 Jun 2021 Zhu Zhang, Chang Zhou, Jianxin Ma, Zhijie Lin, Jingren Zhou, Hongxia Yang, Zhou Zhao

Further, we design a history sampler to select informative fragments for rehearsal training, making the memory focus on the crucial information.

Question Answering Video Question Answering

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

5 code implementations6 May 2021 Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Zhou Zhao

Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e. g., mel-spectrogram) given a music score.

Text-To-Speech Synthesis

To Learn Effective Features: Understanding the Task-Specific Adaptation of MAML

no code implementations1 Jan 2021 Zhijie Lin, Zhou Zhao, Zhu Zhang, Huai Baoxing, Jing Yuan

Model Agnostic Meta-Learning~(MAML)~(\cite{finn2017model}) is one of the most well-known gradient-based meta learning algorithms, that learns the meta-initialization through the inner and outer optimization loop.

Contrastive Learning Meta-Learning

Continual Memory: Can We Reason After Long-Term Memorization?

no code implementations1 Jan 2021 Zhu Zhang, Chang Zhou, Zhou Zhao, Zhijie Lin, Jingren Zhou, Hongxia Yang

Existing reasoning tasks often follow the setting of "reasoning while experiencing", which has an important assumption that the raw contents can be always accessed while reasoning.

Cortical Surface Shape Analysis Based on Alexandrov Polyhedra

no code implementations ICCV 2021 Min Zhang, Yang Guo, Na lei, Zhou Zhao, Jianfeng Wu, Xiaoyin Xu, Yalin Wang, Xianfeng GU

Shape analysis has been playing an important role in early diagnosis and prognosis of neurodegenerative diseases such as Alzheimer's diseases (AD).

Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding

no code implementations NeurIPS 2020 Zhu Zhang, Zhou Zhao, Zhijie Lin, Jieming Zhu, Xiuqiang He

Weakly-supervised vision-language grounding aims to localize a target moment in a video or a specific region in an image according to the given sentence query, where only video-level or image-level sentence annotations are provided during training.

Contrastive Learning

Future-Aware Diverse Trends Framework for Recommendation

no code implementations1 Nov 2020 Yujie Lu, Shengyu Zhang, Yingxuan Huang, Luyao Wang, Xinyao Yu, Zhou Zhao, Fei Wu

By diverse trends, supposing the future preferences can be diversified, we propose the diverse trends extractor and the time-aware mechanism to represent the possible trends of preferences for a given user with multiple vectors.

Representation Learning Sequential Recommendation

MGD-GAN: Text-to-Pedestrian generation through Multi-Grained Discrimination

no code implementations2 Oct 2020 Shengyu Zhang, Donghui Wang, Zhou Zhao, Siliang Tang, Di Xie, Fei Wu

In this paper, we investigate the problem of text-to-pedestrian synthesis, which has many potential applications in art, design, and video surveillance.

Image Generation

PopMAG: Pop Music Accompaniment Generation

1 code implementation18 Aug 2020 Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu

To improve harmony, in this paper, we propose a novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks.

Music Modeling

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

no code implementations16 Aug 2020 Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Nicholas Jing Yuan

Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence.

Spatio-Temporal Video Grounding Video Grounding

DeVLBert: Learning Deconfounded Visio-Linguistic Representations

1 code implementation16 Aug 2020 Shengyu Zhang, Tan Jiang, Tan Wang, Kun Kuang, Zhou Zhao, Jianke Zhu, Jin Yu, Hongxia Yang, Fei Wu

In this paper, we propose to investigate the problem of out-of-domain visio-linguistic pretraining, where the pretraining data distribution differs from that of downstream data on which the pretrained model will be fine-tuned.

Image Retrieval Question Answering +1

Poet: Product-oriented Video Captioner for E-commerce

1 code implementation16 Aug 2020 Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, Fei Wu

Then, based on the aspects of the video-associated product, we perform knowledge-enhanced spatial-temporal inference on those graphs for capturing the dynamic change of fine-grained product-part characteristics.

Video Captioning

FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

no code implementations6 Aug 2020 Jinglin Liu, Yi Ren, Zhou Zhao, Chen Zhang, Baoxing Huai, Nicholas Jing Yuan

NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading.

Language Modelling Lipreading

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

no code implementations9 Jul 2020 Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, Tie-Yan Liu

DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers.

SimulSpeech: End-to-End Simultaneous Speech to Text Translation

no code implementations ACL 2020 Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, Tie-Yan Liu

In this work, we develop SimulSpeech, an end-to-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently.

Automatic Speech Recognition Knowledge Distillation +3

Comprehensive Information Integration Modeling Framework for Video Titling

1 code implementation24 Jun 2020 Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Tan Jiang, Jingren Zhou, Hongxia Yang, Fei Wu

In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive in volume.

Video Captioning

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

25 code implementations ICLR 2021 Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.

Knowledge Distillation Speech Synthesis

A Study of Non-autoregressive Model for Sequence Generation

no code implementations ACL 2020 Yi Ren, Jinglin Liu, Xu Tan, Zhou Zhao, Sheng Zhao, Tie-Yan Liu

In this work, we conduct a study to understand the difficulty of NAR sequence generation and try to answer: (1) Why NAR models can catch up with AR models in some tasks but not all?

Automatic Speech Recognition Knowledge Distillation +1

A Generic Network Compression Framework for Sequential Recommender Systems

1 code implementation21 Apr 2020 Yang Sun, Fajie Yuan, Min Yang, Guoao Wei, Zhou Zhao, Duo Liu

Current state-of-the-art sequential recommender models are typically based on a sandwich-structured deep neural network, where one or more middle (hidden) layers are placed between the input embedding layer and output softmax layer.

Sequential Recommendation

Grounded and Controllable Image Completion by Incorporating Lexical Semantics

no code implementations29 Feb 2020 Shengyu Zhang, Tan Jiang, Qinghao Huang, Ziqi Tan, Zhou Zhao, Siliang Tang, Jin Yu, Hongxia Yang, Yi Yang, Fei Wu

Existing image completion procedure is highly subjective by considering only visual context, which may trigger unpredictable results which are plausible but not faithful to a grounded knowledge.

Convolutional Hierarchical Attention Network for Query-Focused Video Summarization

1 code implementation31 Jan 2020 Shuwen Xiao, Zhou Zhao, Zijian Zhang, Xiaohui Yan, Min Yang

This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs and aims to generate a query-focused video summary.

Video Summarization

Bi-Decoder Augmented Network for Neural Machine Translation

no code implementations14 Jan 2020 Boyuan Pan, Yazheng Yang, Zhou Zhao, Yueting Zhuang, Deng Cai

Neural Machine Translation (NMT) has become a popular technology in recent years, and the encoder-decoder framework is the mainstream among all the methods.

Machine Translation Translation

Video Dialog via Progressive Inference and Cross-Transformer

no code implementations IJCNLP 2019 Weike Jin, Zhou Zhao, Mao Gu, Jun Xiao, Furu Wei, Yueting Zhuang

Video dialog is a new and challenging task, which requires the agent to answer questions combining video information with dialog history.

Answer Generation Question Answering +3

Personalized Hashtag Recommendation for Micro-videos

1 code implementation27 Aug 2019 Yinwei Wei, Zhiyong Cheng, Xuzheng Yu, Zhou Zhao, Lei Zhu, Liqiang Nie

The hashtags, that a user provides to a post (e. g., a micro-video), are the ones which in her mind can well describe the post content where she is interested in.

Weak Supervision Enhanced Generative Network for Question Generation

no code implementations1 Jul 2019 Yutong Wang, Jiyuan Zheng, Qijiong Liu, Zhou Zhao, Jun Xiao, Yueting Zhuang

More specifically, we devise a discriminator, Relation Guider, to capture the relations between the whole passage and the associated answer and then the Multi-Interaction mechanism is deployed to transfer the knowledge dynamically for our question generation system.

Question Answering Question Generation

Localizing Unseen Activities in Video via Image Query

no code implementations28 Jun 2019 Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Deng Cai

Thus, we consider a new task to localize unseen activities in videos via image queries, named Image-Based Activity Localization.

Action Localization Video Understanding

Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

no code implementations28 Jun 2019 Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Xiaofei He

Concretely, we first develop a hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video context.

Answer Generation Question Answering +1

Beyond Product Quantization: Deep Progressive Quantization for Image Retrieval

1 code implementation16 Jun 2019 Lianli Gao, Xiaosu Zhu, Jingkuan Song, Zhou Zhao, Heng Tao Shen

In this work, we propose a deep progressive quantization (DPQ) model, as an alternative to PQ, for large scale image retrieval.

Image Retrieval Quantization

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

1 code implementation6 Jun 2019 Zhu Zhang, Zhijie Lin, Zhou Zhao, Zhenxin Xiao

Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query.

Moment Retrieval Representation Learning

FastSpeech: Fast,Robustand Controllable Text-to-Speech

10 code implementations22 May 2019 Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).

Text-To-Speech Synthesis

Almost Unsupervised Text to Speech and Automatic Speech Recognition

no code implementations13 May 2019 Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data.

Automatic Speech Recognition Denoising

Multilingual Neural Machine Translation with Knowledge Distillation

1 code implementation ICLR 2019 Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, Tie-Yan Liu

Multilingual machine translation, which translates multiple languages with a single model, has attracted much attention due to its efficiency of offline training and online serving.

Knowledge Distillation Machine Translation +1

Improving Automatic Source Code Summarization via Deep Reinforcement Learning

2 code implementations17 Nov 2018 Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, Philip S. Yu

To the best of our knowledge, most state-of-the-art approaches follow an encoder-decoder framework which encodes the code into a hidden space and then decode it into natural language space, suffering from two major drawbacks: a) Their encoders only consider the sequential content of code, ignoring the tree structure which is also critical for the task of code summarization, b) Their decoders are typically trained to predict the next word by maximizing the likelihood of next ground-truth word with previous ground-truth word given.

Code Summarization reinforcement-learning +1

Improved Dynamic Memory Network for Dialogue Act Classification with Adversarial Training

no code implementations12 Nov 2018 Yao Wan, Wenqiang Yan, Jianwei Gao, Zhou Zhao, Jian Wu, Philip S. Yu

Dialogue Act (DA) classification is a challenging problem in dialogue interpretation, which aims to attach semantic labels to utterances and characterize the speaker's intention.

Classification Dialogue Act Classification +3

Textually Guided Ranking Network for Attentional Image Retweet Modeling

no code implementations24 Oct 2018 Zhou Zhao, Hanbing Zhan, Lingtao Meng, Jun Xiao, Jun Yu, Min Yang, Fei Wu, Deng Cai

In this paper, we study the problem of image retweet prediction in social media, which predicts the image sharing behavior that the user reposts the image tweets from their followees.

Dialogue Act Recognition via CRF-Attentive Structured Network

no code implementations SIGIR 2018 Zheqian Chen, Rongqin Yang, Zhou Zhao, Deng Cai, Xiaofei He

Dialogue Act Recognition (DAR) is a challenging problem in dialogue interpretation, which aims to attach semantic labels to utterances and characterize the speaker's intention.

Dialogue Act Classification Dialogue Interpretation +1

Keyword-based Query Comprehending via Multiple Optimized-Demand Augmentation

no code implementations1 Nov 2017 Boyuan Pan, Hao Li, Zhou Zhao, Deng Cai, Xiaofei He

In this paper, we propose a novel neural network system that consists a Demand Optimization Model based on a passage-attention neural machine translation and a Reader Model that can find the answer given the optimized question.

Machine Reading Comprehension Machine Translation

Smarnet: Teaching Machines to Read and Comprehend Like Human

no code implementations8 Oct 2017 Zheqian Chen, Rongqin Yang, Bin Cao, Zhou Zhao, Deng Cai, Xiaofei He

Machine Comprehension (MC) is a challenging task in Natural Language Processing field, which aims to guide the machine to comprehend a passage and answer the given question.

Natural Language Processing Question Answering +1

Identifying and Tracking Sentiments and Topics from Social Media Texts during Natural Disasters

no code implementations EMNLP 2017 Min Yang, Jincheng Mei, Heng Ji, Wei Zhao, Zhou Zhao, Xiaojun Chen

We study the problem of identifying the topics and sentiments and tracking their shifts from social media texts in different geographical regions during emergencies and disasters.

Topic Models

Video Question Answering via Attribute-Augmented Attention Network Learning

no code implementations20 Jul 2017 Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, Yueting Zhuang

Video Question Answering is a challenging problem in visual information retrieval, which provides the answer to the referenced video content according to the question.

Information Retrieval Multiple-choice +4

The Forgettable-Watcher Model for Video Question Answering

no code implementations3 May 2017 Hongyang Xue, Zhou Zhao, Deng Cai

Then we propose a TGIF-QA dataset for video question answering with the help of automatic question generation.

Question Answering Question Generation +2

Question Retrieval for Community-based Question Answering via Heterogeneous Network Integration Learning

no code implementations24 Nov 2016 Zheqian Chen, Chi Zhang, Zhou Zhao, Deng Cai

The challenges in this task are the lexical gaps between questions for the word ambiguity and word mismatch problem.

Question Answering

Cannot find the paper you are looking for? You can Submit a new open access paper.