no code implementations • Findings (ACL) 2022 • Tao Jin, Zhou Zhao, Meng Zhang, Xingshan Zeng
This paper attacks the challenging problem of sign language translation (SLT), which involves not only visual and textual understanding but also additional prior knowledge learning (i. e. performing style, syntax).
no code implementations • 30 Nov 2023 • Liangcai Su, Fan Yan, Jieming Zhu, Xi Xiao, Haoyi Duan, Zhou Zhao, Zhenhua Dong, Ruiming Tang
Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications.
1 code implementation • NeurIPS 2023 • Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features.
no code implementations • 15 Oct 2023 • Zijian Zhang, Luping Liu, Zhijie Lin, Yichen Zhu, Zhou Zhao
We propose the first unsupervised and learning-based method to identify interpretable directions in h-space of pre-trained diffusion models.
1 code implementation • 13 Oct 2023 • Zehan Wang, Ziang Zhang, Luping Liu, Yang Zhao, Haifeng Huang, Tao Jin, Zhou Zhao
Inspired by recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR), a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities by integrating the knowledge of existing MCR spaces.
no code implementations • 14 Sep 2023 • Yongqi Wang, Jionghao Bai, Rongjie Huang, RuiQi Li, Zhiqing Hong, Zhou Zhao
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation.
no code implementations • 28 Aug 2023 • Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao
The dataset comprises 236, 220 pairs of style prompt in natural text descriptions with five style factors and corresponding speech samples.
no code implementations • 17 Aug 2023 • Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Zhou Zhao
This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes.
no code implementations • 25 Jul 2023 • Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
no code implementations • 19 Jul 2023 • Jiahao Xun, Shengyu Zhang, Yanting Yang, Jieming Zhu, Liqun Deng, Zhou Zhao, Zhenhua Dong, RuiQi Li, Lichao Zhang, Fei Wu
We analyze the CSI task in a disentanglement view with the causal graph technique, and identify the intra-version and inter-version effects biasing the invariant learning.
no code implementations • ICCV 2023 • Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao
To accomplish this, we design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
no code implementations • 14 Jul 2023 • Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Chen Zhang, Zhenhui Ye, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao
In this paper, we introduce Mega-TTS 2, a generic zero-shot multispeaker TTS model that is capable of synthesizing speech for unseen speakers with arbitrary-length prompts.
1 code implementation • CVPR 2023 • Aoxiong Yin, Tianyun Zhong, Li Tang, Weike Jin, Tao Jin, Zhou Zhao
We find that it can provide two aspects of information for the model, 1) it can help the model implicitly learn the location of semantic boundaries in continuous sign language videos, 2) it can help the model understand the sign language video globally.
Ranked #2 on
Gloss-free Sign Language Translation
on PHOENIX14T
1 code implementation • 12 Jun 2023 • Yazheng Yang, Zhou Zhao, Qi Liu
Our proposed method addresses this issue by assigning individual style vector to each token in a text, allowing for fine-grained control and manipulation of the style strength.
1 code implementation • 10 Jun 2023 • Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan, Zhou Zhao
We demonstrate that OpenSR enables modality transfer from one to any in three different settings (zero-, few- and full-shot), and achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.
no code implementations • 6 Jun 2023 • Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao
3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies.
no code implementations • 6 Jun 2023 • Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao
We are interested in a novel task, namely low-resource text-to-talking avatar.
1 code implementation • 4 Jun 2023 • Luping Liu, Zijian Zhang, Yi Ren, Rongjie Huang, Xiang Yin, Zhou Zhao
Previous works identify the problem of information mixing in the CLIP text encoder and introduce the T5 text encoder or incorporate strong prior knowledge to assist with the alignment.
no code implementations • 30 May 2023 • Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu
Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common.
no code implementations • 29 May 2023 • Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao
Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data.
Ranked #4 on
Audio Generation
on AudioCaps
no code implementations • 24 May 2023 • Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao
Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date.
no code implementations • NeurIPS 2023 • Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jiageng Liu, Li Tang, Linjun Li, Yongqi Wang, Aoxiong Yin, Ziang Zhang, Zhou Zhao
This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR).
no code implementations • 22 May 2023 • Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, Zhou Zhao
To mitigate the data scarcity in learning visual acoustic information, we 1) introduce a self-supervised learning framework to enhance both the visual-text encoder and denoiser decoder; 2) leverage the diffusion transformer scalable in terms of parameters and capacity to learn visual scene information.
no code implementations • 21 May 2023 • Huadai Liu, Rongjie Huang, Jinzheng He, Gang Sun, Ran Shen, Xize Cheng, Zhou Zhao
Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries given relational databases, which has been traditionally implemented in a cascaded manner while facing the following challenges: 1) model training is faced with the major issue of data scarcity, where limited parallel data is available; and 2) the systems should be robust enough to handle diverse out-of-domain speech samples that differ from the source data.
no code implementations • 18 May 2023 • Jinzheng He, Jinglin Liu, Zhenhui Ye, Rongjie Huang, Chenye Cui, Huadai Liu, Zhou Zhao
To tackle these challenges, we propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input, eliminating most of the tedious manual annotation and avoiding the aforementioned inconvenience.
no code implementations • 8 May 2023 • RuiQi Li, Rongjie Huang, Lichao Zhang, Jinglin Liu, Zhou Zhao
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings while facing a major challenge: the alignment between the target (singing) pitch contour and the source (speech) content is difficult to learn in a text-free situation.
no code implementations • 6 May 2023 • Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, WeiJie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, Wen Zhang
Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations.
1 code implementation • CVPR 2023 • Zhou Yu, Lixiang Zheng, Zhou Zhao, Fei Wu, Jianping Fan, Kui Ren, Jun Yu
A recent benchmark AGQA poses a promising paradigm to generate QA pairs automatically from pre-annotated scene graphs, enabling it to measure diverse reasoning abilities with granular control.
no code implementations • 3 May 2023 • Dong Yao, Shengyu Zhang, Zhou Zhao, Jieming Zhu, Wenqiao Zhang, Rui Zhang, Xiaofei He, Fei Wu
In contrast, modalities that do not cause users' behaviors are potential noises and might mislead the learning of a recommendation model.
no code implementations • 1 May 2023 • Zhenhui Ye, Jinzheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang, Jinglin Liu, Yi Ren, Xiang Yin, Zejun Ma, Zhou Zhao
Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
1 code implementation • 25 Apr 2023 • Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe
In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i. e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue.
no code implementations • 13 Apr 2023 • Jiong Wang, Zhou Zhao, Fei Wu
Thus we propose to separate the identity features with the variance features in a light-weighted set-based disentanglement framework.
no code implementations • CVPR 2023 • Haoyuan Li, Hao Jiang, Tao Jin, Mengyan Li, Yan Chen, Zhijie Lin, Yang Zhao, Zhou Zhao
Then, we present two cooperative seekers to simultaneously search the image for PR and localize the product for PG.
no code implementations • 24 Mar 2023 • Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren, Zhou Zhao
ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) focuses on prompting a wide range of spoken language processing (SLP) research on meeting transcripts, as SLP applications are critical to improve users' efficiency in grasping important information in meetings.
1 code implementation • 24 Mar 2023 • Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren, Zhou Zhao
To prompt SLP advancement, we establish a large-scale general Meeting Understanding and Generation Benchmark (MUG) to benchmark the performance of a wide range of SLP tasks, including topic segmentation, topic-level and session-level extractive summarization and topic title generation, keyphrase extraction, and action item detection.
2 code implementations • ICCV 2023 • Xize Cheng, Linjun Li, Tao Jin, Rongjie Huang, Wang Lin, Zehan Wang, Huangdai Liu, Ye Wang, Aoxiong Yin, Zhou Zhao
However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech.
no code implementations • 5 Feb 2023 • Zijian Zhang, Zhou Zhao, Jun Yu, Qi Tian
In this paper, we propose a novel and flexible conditional diffusion model by introducing conditions into the forward process.
no code implementations • 31 Jan 2023 • Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Jinzheng He, Zhou Zhao
Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality.
1 code implementation • 30 Jan 2023 • Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
Ranked #8 on
Audio Generation
on AudioCaps
no code implementations • CVPR 2023 • Mengze Li, Han Wang, Wenqiao Zhang, Jiaxu Miao, Zhou Zhao, Shengyu Zhang, Wei Ji, Fei Wu
WINNER first builds the language decomposition tree in a bottom-up manner, upon which the structural attention mechanism and top-down feature backtracking jointly build a multi-modal decomposition tree, permitting a hierarchical understanding of unstructured videos.
no code implementations • ICCV 2023 • Wang Lin, Tao Jin, Ye Wang, Wenwen Pan, Linjun Li, Xize Cheng, Zhou Zhao
In this study, we propose a new task, group video captioning, which aims to infer the desired content among a group of target videos and describe it with another group of related reference videos.
no code implementations • ICCV 2023 • Jiong Wang, Huiming Zhang, Haiwen Hong, Xuan Jin, Yuan He, Hui Xue, Zhou Zhao
For the classification task, we introduce an open corpus classifier by reconstructing original classifier with similar words in text space.
1 code implementation • 26 Dec 2022 • Zijian Zhang, Zhou Zhao, Zhijie Lin
These imply that the gap corresponds to the lost information of the image, and we can reconstruct the image by filling the gap.
no code implementations • 14 Dec 2022 • Jinglin Liu, Zhenhui Ye, Qian Chen, Siqi Zheng, Wen Wang, Qinglin Zhang, Zhou Zhao
Recently, binaural audio synthesis (BAS) has emerged as a promising research field for its applications in augmented and virtual realities.
1 code implementation • 21 Nov 2022 • Luping Liu, Yi Ren, Xize Cheng, Rongjie Huang, Chongxuan Li, Zhou Zhao
In this paper, we introduce a new perceptron bias assumption that suggests discriminator models are more sensitive to certain features of the input, leading to the overconfidence problem.
no code implementations • 19 Nov 2022 • Chenye Cui, Yi Ren, Jinglin Liu, Rongjie Huang, Zhou Zhao
In this paper, we pose the task of generating sound with a specific timbre given a video input and a reference audio sample.
no code implementations • 8 Sep 2022 • Jiong Wang, Zhou Zhao, Weike Jin
Multi-modal video question answering aims to predict correct answer and localize the temporal boundary relevant to the question.
1 code implementation • 1 Sep 2022 • Yan Xia, Zhou Zhao, Shangwei Ye, Yang Zhao, Haoyuan Li, Yi Ren
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise.
no code implementations • 20 Aug 2022 • Yang Zhao, Wenqiang Xu, Xuan Lin, Jingjing Huo, Hong Chen, Zhou Zhao
The task of argument mining aims to detect all possible argumentative components and identify their relationships automatically.
no code implementations • 17 Aug 2022 • Shengyu Zhang, Bofang Li, Dong Yao, Fuli Feng, Jieming Zhu, Wenyan Fan, Zhou Zhao, Xiaofei He, Tat-Seng Chua, Fei Wu
Micro-video recommender systems suffer from the ubiquitous noises in users' behaviors, which might render the learned user representation indiscriminating, and lead to trivial recommendations (e. g., popular items) or even weird ones that are far beyond users' interests.
1 code implementation • 17 Aug 2022 • Shengyu Zhang, Lingxiao Yang, Dong Yao, Yujie Lu, Fuli Feng, Zhou Zhao, Tat-Seng Chua, Fei Wu
Specifically, Re4 encapsulates three backward flows, i. e., 1) Re-contrast, which drives each interest embedding to be distinct from other interests using contrastive learning; 2) Re-attend, which ensures the interest-item correlation estimation in the forward flow to be consistent with the criterion used in final recommendation; and 3) Re-construct, which ensures that each interest embedding can semantically reflect the information of representative items that relate to the corresponding interest.
3 code implementations • 13 Jul 2022 • Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, Yi Ren
Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling.
no code implementations • 8 Jul 2022 • Yongqi Wang, Zhou Zhao
To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size.
no code implementations • 10 Jun 2022 • Yang Zhao, Xuan Lin, Wenqiang Xu, Maozong Zheng, Zhengyong Liu, Zhou Zhao
In recent days, streaming technology has greatly promoted the development in the field of livestream.
1 code implementation • 5 Jun 2022 • Ziyue Jiang, Zhe Su, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, Zhenhui Ye
This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language).
1 code implementation • 25 May 2022 • Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He, Zhou Zhao
Specifically, a sequence of discrete representations derived in a self-supervised manner are predicted from the model and passed to a vocoder for speech reconstruction, while still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e. g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2) high latency: current S2ST systems utilize autoregressive models which predict each unit conditioned on the sequence previously generated, failing to take full advantage of parallelism.
2 code implementations • 15 May 2022 • Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e. g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data.
2 code implementations • 21 Apr 2022 • Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao
Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time.
Ranked #7 on
Text-To-Speech Synthesis
on LJSpeech
(using extra training data)
1 code implementation • CVPR 2022 • Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, Hao Huang, Heng Tao Shen, Jingkuan Song
The performance of current Scene Graph Generation models is severely hampered by some hard-to-distinguish predicates, e. g., "woman-on/standing on/walking on-beach" or "woman-near/looking at/in front of-child".
no code implementations • 17 Mar 2022 • Dong Yao, Zhou Zhao, Shengyu Zhang, Jieming Zhu, Yudong Zhu, Rui Zhang, Xiuqiang He
We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music.
no code implementations • ACL 2022 • Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Jiaxu Miao, Wenqiao Zhang, Wenming Tan, Jin Wang, Peng Wang, ShiLiang Pu, Fei Wu
To achieve effective grounding under a limited annotation budget, we investigate one-shot video grounding, and learn to ground natural language in all video frames with solely one frame labeled, in an end-to-end manner.
3 code implementations • ACL 2022 • Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao
Furthermore, we propose a latent-mapping algorithm in the latent space to convert the amateur vocal tone to the professional one.
no code implementations • ACL 2022 • Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu
Then we conduct a comprehensive study on NAR-TTS models that use some advanced modeling methods.
1 code implementation • 21 Feb 2022 • Jiong Wang, Zhou Zhao, Weike Jin, Xinyu Duan, Zhen Lei, Baoxing Huai, Yiling Wu, Xiaofei He
In this paper, the VLAD aggregation method is adopted to quantize local features with visual vocabulary locally partitioning the feature space, and hence preserve the local discriminability.
6 code implementations • ICLR 2022 • Luping Liu, Yi Ren, Zhijie Lin, Zhou Zhao
Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs).
Ranked #10 on
Image Generation
on CelebA 64x64
no code implementations • 16 Feb 2022 • Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie Yan, Zhou Zhao
Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV).
no code implementations • 11 Jan 2022 • Shoutong Wang, Jinglin Liu, Yi Ren, Zhen Wang, Changliang Xu, Zhou Zhao
However, they face several challenges: 1) the fixed-size speaker embedding is not powerful enough to capture full details of the target timbre; 2) single reference audio does not contain sufficient timbre information of the target speaker; 3) the pitch inconsistency between different speakers also leads to a degradation in the generated voice.
no code implementations • 10 Jan 2022 • Lei LI, Fuping Wu, Sihan Wang, Xinzhe Luo, Carlos Martin-Isla, Shuwei Zhai, Jianpeng Zhang, Yanfei Liu7, Zhen Zhang, Markus J. Ankenbrand, Haochuan Jiang, Xiaoran Zhang, Linhong Wang, Tewodros Weldebirhan Arega, Elif Altunok, Zhou Zhao, Feiyan Li, Jun Ma, Xiaoping Yang, Elodie Puybareau, Ilkay Oksuz, Stephanie Bricq, Weisheng Li, Kumaradevan Punithakumar, Sotirios A. Tsaftaris, Laura M. Schreiber, Mingjing Yang, Guocai Liu, Yong Xia, Guotai Wang, Sergio Escalera, Xiahai Zhuang
Assessment of myocardial viability is essential in diagnosis and treatment management of patients suffering from myocardial infarction, and classification of pathology on myocardium is the key to this assessment.
1 code implementation • CVPR 2022 • Wenwen Pan, Haonan Shi, Zhou Zhao, Jieming Zhu, Xiuqiang He, Zhigeng Pan, Lianli Gao, Jun Yu, Fei Wu, Qi Tian
Audio-Guided video semantic segmentation is a challenging problem in visual analysis and editing, which automatically separates foreground objects from background in a video sequence according to the referring audio expressions.
1 code implementation • CVPR 2022 • Yan Xia, Zhou Zhao
Audiovisual Event (AVE) localization requires the model to jointly localize an event by observing audio and visual information.
no code implementations • CVPR 2022 • Aoxiong Yin, Zhou Zhao, Weike Jin, Meng Zhang, Xingshan Zeng, Xiaofei He
In addition, we also explore zero-shot translation in sign language and find that our model can achieve comparable performance to the supervised BSLT model on some language pairs.
1 code implementation • MM '21: Proceedings of the 29th ACM International Conference on Multimedia 2021 • Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao
High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost.
no code implementations • 8 Dec 2021 • Aoxiong Yin, Zhou Zhao, Jinglin Liu, Weike Jin, Meng Zhang, Xingshan Zeng, Xiaofei He
Sign language translation as a kind of technology with profound social significance has attracted growing researchers' interest in recent years.
no code implementations • NeurIPS 2021 • Tao Jin, Zhou Zhao
The majority of existing multimodal sequential learning methods focus on how to obtain effective representations and ignore the importance of multimodal fusion.
no code implementations • 11 Nov 2021 • Jianyun Zou, Min Yang, Lichao Zhang, Yechen Xu, Qifan Pan, Fengqing Jiang, Ran Qin, Shushu Wang, Yifan He, Songfang Huang, Zhou Zhao
We finally analyze the performance of SOTA KBQA models on this dataset and identify the challenges facing Chinese KBQA.
no code implementations • 14 Oct 2021 • Ziyue Jiang, Yi Ren, Ming Lei, Zhou Zhao
Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally.
no code implementations • 14 Oct 2021 • Rongjie Huang, Chenye Cui, Feiyang Chen, Yi Ren, Jinglin Liu, Zhou Zhao, Baoxing Huai, Zhefeng Wang
In this work, we propose SingGAN, a generative adversarial network designed for high-fidelity singing voice synthesis.
no code implementations • 8 Oct 2021 • Shengyu Zhang, Kun Kuang, Jiezhong Qiu, Jin Yu, Zhou Zhao, Hongxia Yang, Zhongfei Zhang, Fei Wu
The results demonstrate that our method outperforms various SOTA GNNs for stable prediction on graphs with agnostic distribution shift, including shift caused by node labels and attributes.
no code implementations • 8 Oct 2021 • Yujie Lu, Yingxuan Huang, Shengyu Zhang, Wei Han, Hui Chen, Zhou Zhao, Fei Wu
In this paper, we propose the DMR framework to explicitly model dynamic multi-trends of users' current preference and make predictions based on both the history and future potential trends.
no code implementations • 6 Oct 2021 • Fuming You, Jingjing Li, Zhou Zhao
An previous solution is test time normalization, which substitutes the source statistics in BN layers with the target batch statistics.
3 code implementations • NeurIPS 2021 • Yi Ren, Jinglin Liu, Zhou Zhao
Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel.
Text-To-Speech Synthesis
Vocal Bursts Intensity Prediction
+1
no code implementations • 29 Sep 2021 • Ziyue Jiang, Yi Ren, Zhou Zhao
In this work, we propose a novel phase-oriented algorithm named PhaseFool that can efficiently construct imperceptible audio adversarial examples with energy dissipation.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 29 Sep 2021 • Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Zhou Zhao, Yi Ren
Learning generalizable speech representations for unseen samples in different domains has been a challenge with ever increasing importance to date.
no code implementations • 29 Sep 2021 • Zhijie Lin, Zijian Zhang, Zhou Zhao
Score-based generative models involve sequentially corrupting the data distribution with noise and then learns to recover the data distribution based on score matching.
1 code implementation • 26 Sep 2021 • Jiahao Xun, Shengyu Zhang, Zhou Zhao, Jieming Zhu, Qi Zhang, Jingjie Li, Xiuqiang He, Xiaofei He, Tat-Seng Chua, Fei Wu
In this work, inspired by the fact that users make their click decisions mostly based on the visual impression they perceive when browsing news, we propose to capture such visual impression information with visual-semantic modeling for news recommendation.
no code implementations • 11 Sep 2021 • Shengyu Zhang, Dong Yao, Zhou Zhao, Tat-Seng Chua, Fei Wu
In this paper, we propose to learn accurate and robust user representations, which are required to be less sensitive to (attack on) noisy behaviors and trust more on the indispensable ones, by modeling counterfactual data distribution.
no code implementations • 31 Aug 2021 • Zhijie Lin, Zhou Zhao, Haoyuan Li, Jinglin Liu, Meng Zhang, Xingshan Zeng, Xiaofei He
Lip reading, aiming to recognize spoken sentences according to the given video of lip movements without relying on the audio stream, has attracted great interest due to its application in many scenarios.
1 code implementation • 14 Jul 2021 • Jinglin Liu, Zhiying Zhu, Yi Ren, Wencan Huang, Baoxing Huai, Nicholas Yuan, Zhou Zhao
However, the AR decoding manner generates current lip frame conditioned on frames generated previously, which inherently hinders the inference speed, and also has a detrimental effect on the quality of generated lip frames due to error propagation.
no code implementations • CVPR 2021 • Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, Zheng Qin
Toward this end, we contribute a multi-modal relational graph to capture the interactions among objects from the visual and textual content to identify the differences among similar video moment candidates.
no code implementations • CVPR 2021 • Yang Zhao, Zhou Zhao, Zhu Zhang, Zhijie Lin
Temporal video grounding aims to localize the target segment which is semantically aligned with the given sentence in an untrimmed video.
no code implementations • 17 Jun 2021 • Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, Zhou Zhao
Finally, by showing a comparable performance in the emotional speech synthesis task, we successfully demonstrate the ability of the proposed model.
no code implementations • 2 Jun 2021 • Zhu Zhang, Chang Zhou, Jianxin Ma, Zhijie Lin, Jingren Zhou, Hongxia Yang, Zhou Zhao
Further, we design a history sampler to select informative fragments for rehearsal training, making the memory focus on the crucial information.
6 code implementations • 6 May 2021 • Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Zhou Zhao
Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e. g., mel-spectrogram) given a music score.
no code implementations • 1 Apr 2021 • Dong Yao, Shengyu Zhang, Zhou Zhao, Wenyan Fan, Jieming Zhu, Xiuqiang He, Fei Wu
Personalized recommendation system has become pervasive in various video platform.
no code implementations • ICCV 2021 • Min Zhang, Yang Guo, Na lei, Zhou Zhao, Jianfeng Wu, Xiaoyin Xu, Yalin Wang, Xianfeng GU
Shape analysis has been playing an important role in early diagnosis and prognosis of neurodegenerative diseases such as Alzheimer's diseases (AD).
no code implementations • 1 Jan 2021 • Zhijie Lin, Zhou Zhao, Zhu Zhang, Huai Baoxing, Jing Yuan
Model Agnostic Meta-Learning~(MAML)~(\cite{finn2017model}) is one of the most well-known gradient-based meta learning algorithms, that learns the meta-initialization through the inner and outer optimization loop.
no code implementations • 1 Jan 2021 • Zhu Zhang, Chang Zhou, Zhou Zhao, Zhijie Lin, Jingren Zhou, Hongxia Yang
Existing reasoning tasks often follow the setting of "reasoning while experiencing", which has an important assumption that the raw contents can be always accessed while reasoning.
no code implementations • NeurIPS 2020 • Zhu Zhang, Zhou Zhao, Zhijie Lin, Jieming Zhu, Xiuqiang He
Weakly-supervised vision-language grounding aims to localize a target moment in a video or a specific region in an image according to the given sentence query, where only video-level or image-level sentence annotations are provided during training.
1 code implementation • 1 Nov 2020 • Yujie Lu, Shengyu Zhang, Yingxuan Huang, Luyao Wang, Xinyao Yu, Zhou Zhao, Fei Wu
By diverse trends, supposing the future preferences can be diversified, we propose the diverse trends extractor and the time-aware mechanism to represent the possible trends of preferences for a given user with multiple vectors.
no code implementations • 2 Oct 2020 • Shengyu Zhang, Donghui Wang, Zhou Zhao, Siliang Tang, Di Xie, Fei Wu
In this paper, we investigate the problem of text-to-pedestrian synthesis, which has many potential applications in art, design, and video surveillance.
no code implementations • IEEE Transactions on Circuits and Systems for Video Technology 2020 • Aming Wu, Yahong Han, Zhou Zhao, Yi Yang
In this article, we devise a novel memory decoder for visual narrating.
Ranked #13 on
Visual Storytelling
on VIST
1 code implementation • 19 Aug 2020 • Zhu Zhang, Zhijie Lin, Zhou Zhao, Jieming Zhu, Xiuqiang He
Thus, these methods fail to distinguish the target moment from plausible negative moments.
1 code implementation • 18 Aug 2020 • Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu
To improve harmony, in this paper, we propose a novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks.
no code implementations • 16 Aug 2020 • Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Nicholas Jing Yuan
Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence.
1 code implementation • 16 Aug 2020 • Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, Fei Wu
Then, based on the aspects of the video-associated product, we perform knowledge-enhanced spatial-temporal inference on those graphs for capturing the dynamic change of fine-grained product-part characteristics.
1 code implementation • 16 Aug 2020 • Shengyu Zhang, Tan Jiang, Tan Wang, Kun Kuang, Zhou Zhao, Jianke Zhu, Jin Yu, Hongxia Yang, Fei Wu
In this paper, we propose to investigate the problem of out-of-domain visio-linguistic pretraining, where the pretraining data distribution differs from that of downstream data on which the pretrained model will be fine-tuned.
no code implementations • 6 Aug 2020 • Jinglin Liu, Yi Ren, Zhou Zhao, Chen Zhang, Baoxing Huai, Nicholas Jing Yuan
NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading.
no code implementations • 17 Jul 2020 • Jinglin Liu, Yi Ren, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, Tie-Yan Liu
SAT contains a hyperparameter k, and each k value defines a SAT task with different degrees of parallelism.
no code implementations • 9 Jul 2020 • Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, Tie-Yan Liu
DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers.
no code implementations • ACL 2020 • Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, Tie-Yan Liu
In this work, we develop SimulSpeech, an end-to-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+6
1 code implementation • 24 Jun 2020 • Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Tan Jiang, Jingren Zhou, Hongxia Yang, Fei Wu
In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive in volume.
33 code implementations • ICLR 2021 • Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.
Ranked #6 on
Text-To-Speech Synthesis
on LJSpeech
(using extra training data)
no code implementations • ACL 2020 • Yi Ren, Jinglin Liu, Xu Tan, Zhou Zhao, Sheng Zhao, Tie-Yan Liu
In this work, we conduct a study to understand the difficulty of NAR sequence generation and try to answer: (1) Why NAR models can catch up with AR models in some tasks but not all?
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
1 code implementation • 21 Apr 2020 • Yang Sun, Fajie Yuan, Min Yang, Guoao Wei, Zhou Zhao, Duo Liu
Current state-of-the-art sequential recommender models are typically based on a sandwich-structured deep neural network, where one or more middle (hidden) layers are placed between the input embedding layer and output softmax layer.
no code implementations • 29 Feb 2020 • Shengyu Zhang, Tan Jiang, Qinghao Huang, Ziqi Tan, Zhou Zhao, Siliang Tang, Jin Yu, Hongxia Yang, Yi Yang, Fei Wu
Existing image completion procedure is highly subjective by considering only visual context, which may trigger unpredictable results which are plausible but not faithful to a grounded knowledge.
1 code implementation • 31 Jan 2020 • Shuwen Xiao, Zhou Zhao, Zijian Zhang, Xiaohui Yan, Min Yang
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs and aims to generate a query-focused video summary.
1 code implementation • CVPR 2020 • Zhu Zhang, Zhou Zhao, Yang Zhao, Qi. Wang, Huasheng Liu, Lianli Gao
In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).
no code implementations • 14 Jan 2020 • Boyuan Pan, Yazheng Yang, Zhou Zhao, Yueting Zhuang, Deng Cai
Neural Machine Translation (NMT) has become a popular technology in recent years, and the encoder-decoder framework is the mainstream among all the methods.
no code implementations • IEEE Transactions on Cybernetics 2020 • Wei Zhao, Benyou Wang, Min Yang, Jianbo Ye, Zhou Zhao, Xiaojun Chen, and Ying Shen
Movie recommendation systems provide users with ranked lists of movies based on individual’s preferences and constraints.
no code implementations • 19 Nov 2019 • Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi. Wang, Huasheng Liu
Video moment retrieval is to search the moment that is most relevant to the given natural language query.
no code implementations • IJCNLP 2019 • Weike Jin, Zhou Zhao, Mao Gu, Jun Xiao, Furu Wei, Yueting Zhuang
Video dialog is a new and challenging task, which requires the agent to answer questions combining video information with dialog history.
2 code implementations • 5 Sep 2019 • Hongyang Xue, Wenqing Chu, Zhou Zhao, Deng Cai
We propose a new attention model for video question answering.
1 code implementation • 27 Aug 2019 • Yinwei Wei, Zhiyong Cheng, Xuzheng Yu, Zhou Zhao, Lei Zhu, Liqiang Nie
The hashtags, that a user provides to a post (e. g., a micro-video), are the ones which in her mind can well describe the post content where she is interested in.
1 code implementation • ACL 2018 • Boyuan Pan, Yazheng Yang, Zhou Zhao, Yueting Zhuang, Deng Cai, Xiaofei He
We observe that people usually use some discourse markers such as "so" or "but" to represent the logical relationship between two sentences.
Ranked #14 on
Natural Language Inference
on SNLI
no code implementations • NeurIPS 2018 • Boyuan Pan, Yazheng Yang, Hao Li, Zhou Zhao, Yueting Zhuang, Deng Cai, Xiaofei He
In this paper, we transfer knowledge learned from machine comprehension to the sequence-to-sequence tasks to deepen the understanding of the text.
no code implementations • 1 Jul 2019 • Yutong Wang, Jiyuan Zheng, Qijiong Liu, Zhou Zhao, Jun Xiao, Yueting Zhuang
More specifically, we devise a discriminator, Relation Guider, to capture the relations between the whole passage and the associated answer and then the Multi-Interaction mechanism is deployed to transfer the knowledge dynamically for our question generation system.
no code implementations • 28 Jun 2019 • Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Deng Cai
Thus, we consider a new task to localize unseen activities in videos via image queries, named Image-Based Activity Localization.
no code implementations • 28 Jun 2019 • Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Xiaofei He
Concretely, we first develop a hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video context.
1 code implementation • 16 Jun 2019 • Lianli Gao, Xiaosu Zhu, Jingkuan Song, Zhou Zhao, Heng Tao Shen
In this work, we propose a deep progressive quantization (DPQ) model, as an alternative to PQ, for large scale image retrieval.
1 code implementation • 6 Jun 2019 • Zhu Zhang, Zhijie Lin, Zhou Zhao, Zhenxin Xiao
Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query.
1 code implementation • 6 Jun 2019 • Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, DaCheng Tao
It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA).
Ranked #15 on
Video Question Answering
on ActivityNet-QA
Visual Question Answering (VQA)
Zero-Shot Video Question Answer
21 code implementations • NeurIPS 2019 • Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.
Ranked #10 on
Text-To-Speech Synthesis
on LJSpeech
(using extra training data)
11 code implementations • 22 May 2019 • Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).
no code implementations • 13 May 2019 • Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • ICLR 2019 • Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, Tie-Yan Liu
Multilingual machine translation, which translates multiple languages with a single model, has attracted much attention due to its efficiency of offline training and online serving.
2 code implementations • 17 Nov 2018 • Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, Philip S. Yu
To the best of our knowledge, most state-of-the-art approaches follow an encoder-decoder framework which encodes the code into a hidden space and then decode it into natural language space, suffering from two major drawbacks: a) Their encoders only consider the sequential content of code, ignoring the tree structure which is also critical for the task of code summarization, b) Their decoders are typically trained to predict the next word by maximizing the likelihood of next ground-truth word with previous ground-truth word given.
no code implementations • 12 Nov 2018 • Yao Wan, Wenqiang Yan, Jianwei Gao, Zhou Zhao, Jian Wu, Philip S. Yu
Dialogue Act (DA) classification is a challenging problem in dialogue interpretation, which aims to attach semantic labels to utterances and characterize the speaker's intention.
Ranked #5 on
Dialogue Act Classification
on Switchboard corpus
no code implementations • 1 Nov 2018 • Haojie Pan, Junpei Zhou, Zhou Zhao, Yan Liu, Deng Cai, Min Yang
We first propose a new task named Dialogue Description (Dial2Desc).
no code implementations • 24 Oct 2018 • Zhou Zhao, Hanbing Zhan, Lingtao Meng, Jun Xiao, Jun Yu, Min Yang, Fei Wu, Deng Cai
In this paper, we study the problem of image retweet prediction in social media, which predicts the image sharing behavior that the user reposts the image tweets from their followees.
1 code implementation • 9 May 2018 • Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, DaCheng Tao
Visual grounding aims to localize an object in an image referred to by a textual query phrase.
Ranked #9 on
Phrase Grounding
on Flickr30k Entities Test
4 code implementations • EMNLP 2018 • Wei Zhao, Jianbo Ye, Min Yang, Zeyang Lei, Suofei Zhang, Zhou Zhao
In this study, we explore capsule networks with dynamic routing for text classification.
Ranked #7 on
Sentiment Analysis
on MR
no code implementations • SIGIR 2018 • Zheqian Chen, Rongqin Yang, Zhou Zhao, Deng Cai, Xiaofei He
Dialogue Act Recognition (DAR) is a challenging problem in dialogue interpretation, which aims to attach semantic labels to utterances and characterize the speaker's intention.
no code implementations • 1 Nov 2017 • Boyuan Pan, Hao Li, Zhou Zhao, Deng Cai, Xiaofei He
In this paper, we propose a novel neural network system that consists a Demand Optimization Model based on a passage-attention neural machine translation and a Reader Model that can find the answer given the optimized question.
no code implementations • 8 Oct 2017 • Zheqian Chen, Rongqin Yang, Bin Cao, Zhou Zhao, Deng Cai, Xiaofei He
Machine Comprehension (MC) is a challenging task in Natural Language Processing field, which aims to guide the machine to comprehend a passage and answer the given question.
Ranked #33 on
Question Answering
on SQuAD1.1 dev
no code implementations • EMNLP 2017 • Min Yang, Jincheng Mei, Heng Ji, Wei Zhao, Zhou Zhao, Xiaojun Chen
We study the problem of identifying the topics and sentiments and tracking their shifts from social media texts in different geographical regions during emergencies and disasters.
no code implementations • 28 Jul 2017 • Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai, Xiaofei He
Machine comprehension(MC) style question answering is a representative problem in natural language processing.
Ranked #32 on
Question Answering
on TriviaQA
no code implementations • 20 Jul 2017 • Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, Yueting Zhuang
Video Question Answering is a challenging problem in visual information retrieval, which provides the answer to the referenced video content according to the question.
no code implementations • 3 May 2017 • Hongyang Xue, Zhou Zhao, Deng Cai
Then we propose a TGIF-QA dataset for video question answering with the help of automatic question generation.
no code implementations • 24 Nov 2016 • Zheqian Chen, Ben Gao, Huimin Zhang, Zhou Zhao, Deng Cai
However, it is not effective for users to attain answers within minutes.
no code implementations • 24 Nov 2016 • Zheqian Chen, Chi Zhang, Zhou Zhao, Deng Cai
The challenges in this task are the lexical gaps between questions for the word ambiguity and word mismatch problem.