no code implementations • COLING 2022 • Lisung Chen, Nuo Chen, Yuexian Zou, Yong Wang, Xinzhong Sun
Furthermore, we propose a threshold-free intent multi-intent classifier that utilizes the output of IND task and detects the multiple intents without depending on the threshold.
1 code implementation • 15 Mar 2023 • Ziyu Yao, Xuxin Cheng, Yuexian Zou
Moreover, we introduce a pose-level method, PoseRAC, which is based on this representation and achieves state-of-the-art performance on two new version datasets by using Pose Saliency Annotation to annotate salient poses for training.
no code implementations • 12 Mar 2023 • Tengtao Song, Nuo Chen, Ji Jiang, Zhihong Zhu, Yuexian Zou
Since incorporating syntactic information like dependency structures into neural models can promote a better understanding of the sentences, such a method has been widely used in NLP tasks.
1 code implementation • 11 Mar 2023 • Bang Yang, Fenglin Liu, Yuexian Zou, Xian Wu, YaoWei Wang, David A. Clifton
We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and believable outputs and significantly outperforms existing zero-shot methods.
no code implementations • 10 Mar 2023 • Yifei Xin, Dongchao Yang, Fan Cui, Yujun Wang, Yuexian Zou
Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i. e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision.
1 code implementation • 23 Feb 2023 • Bowen Cao, Qichen Ye, Weiyuan Xu, Yuexian Zou
Existing neighborhood aggregation strategies fail to capture either the short-term features or the long-term features of temporal graph attributes, leading to unsatisfactory model performance and even poor robustness and domain generality of the representation learning method.
1 code implementation • 23 Feb 2023 • Qichen Ye, Bowen Cao, Nuo Chen, Weiyuan Xu, Yuexian Zou
Despite the promising result of recent KAQA systems which tend to integrate linguistic knowledge from pre-trained language models (PLM) and factual knowledge from knowledge graphs (KG) to answer complex questions, a bottleneck exists in effectively fusing the representations from PLMs and KGs because of (i) the semantic and distributional gaps between them, and (ii) the difficulties in joint reasoning over the provided knowledge from both modalities.
no code implementations • 15 Jan 2023 • Hongxiang Li, Meng Cao, Xuxin Cheng, Zhihong Zhu, Yaowei Li, Yuexian Zou
Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video.
no code implementations • 7 Dec 2022 • Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Yuexian Zou
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)?
no code implementations • 22 Nov 2022 • Fenglin Liu, Xian Wu, Chenyu You, Shen Ge, Yuexian Zou, Xu sun
To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language.
1 code implementation • 8 Nov 2022 • Zhihong Zhu, Weiyuan Xu, Xuxin Cheng, Tengtao Song, Yuexian Zou
Multi-intent detection and slot filling joint models are gaining increasing traction since they are closer to complicated real-world scenarios.
no code implementations • 28 Oct 2022 • Fenglin Liu, Xian Wu, Shen Ge, Xuancheng Ren, Wei Fan, Xu sun, Yuexian Zou
To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format.
no code implementations • 19 Oct 2022 • Fenglin Liu, Xuewei Ma, Xuancheng Ren, Xian Wu, Wei Fan, Yuexian Zou, Xu sun
Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words.
1 code implementation • 6 Oct 2022 • Ji Jiang, Meng Cao, Tengtao Song, Yuexian Zou
To this end, we introduce two new datasets (i. e., VID-Entity and VidSTG-Entity) by augmenting the VIDSentence and VidSTG datasets with the explicitly referred words in the whole sentence, respectively.
1 code implementation • 21 Jul 2022 • Meng Cao, Tianyu Yang, Junwu Weng, Can Zhang, Jue Wang, Yuexian Zou
To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships.
1 code implementation • 21 Jul 2022 • Meng Cao, Ji Jiang, Long Chen, Yuexian Zou
Extensive experiments demonstrate that our DCNet achieves state-of-the-art performance on both video and image REC benchmarks.
1 code implementation • 20 Jul 2022 • Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu
In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
Ranked #3 on
Audio Generation
on AudioCaps
1 code implementation • 5 Jun 2022 • Jinchuan Tian, Jianwei Yu, Chunlei Zhang, Chao Weng, Yuexian Zou, Dong Yu
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level and shows superior performance on both monolingual and multilingual ASR tasks.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 3 May 2022 • Xinmeng Xu, Rongzhi Gu, Yuexian Zou
Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems.
no code implementations • Findings (NAACL) 2022 • Chenyu You, Nuo Chen, Fenglin Liu, Shen Ge, Xian Wu, Yuexian Zou
To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations.
Ranked #1 on
Spoken Language Understanding
on Spoken-SQuAD
Conversational Question Answering
Spoken Language Understanding
+1
no code implementations • 15 Apr 2022 • Zifeng Zhao, Rongzhi Gu, Dongchao Yang, Jinchuan Tian, Yuexian Zou
Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered.
no code implementations • 4 Apr 2022 • Zifeng Zhao, Dongchao Yang, Rongzhi Gu, Haoran Zhang, Yuexian Zou
However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings.
no code implementations • 31 Mar 2022 • Li Wang, Rongzhi Gu, Weiji Zhuang, Peng Gao, Yujun Wang, Yuexian Zou
Bearing this in mind, a two-branch deep network (KWS branch and SV branch) with the same network structure is developed and a novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously where speaker-invariant keyword representations and keyword-invariant speaker representations are expected respectively.
no code implementations • 31 Mar 2022 • Liyu Wu, Can Zhang, Yuexian Zou
Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information from the body joints and parts.
1 code implementation • 29 Mar 2022 • Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu
However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding; the on-the-fly decoding process in MBR-based methods results in the need for pre-trained models and slow training speeds.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • CVPR 2022 • Can Zhang, Tianyu Yang, Junwu Weng, Meng Cao, Jue Wang, Yuexian Zou
These pre-trained models can be sub-optimal for temporal localization tasks due to the inherent discrepancy between video-level classification and clip-level localization.
1 code implementation • 6 Jan 2022 • Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu
Then, the LM score of the hypothesis is obtained by intersecting the generated lattice with an external word N-gram LM.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 5 Dec 2021 • Jinchuan Tian, Jianwei Yu, Chao Weng, Shi-Xiong Zhang, Dan Su, Dong Yu, Yuexian Zou
Recently, End-to-End (E2E) frameworks have achieved remarkable results on various Automatic Speech Recognition (ASR) tasks.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • 30 Nov 2021 • Bang Yang, Tong Zhang, Yuexian Zou
DCD is an auxiliary task that requires a caption model to learn the correspondence between video content and concepts and the co-occurrence relations between concepts.
Ranked #7 on
Video Captioning
on MSR-VTT
no code implementations • 18 Sep 2021 • Dongsheng Chen, Zhiqi Huang, Xian Wu, Shen Ge, Yuexian Zou
Intent detection (ID) and Slot filling (SF) are two major tasks in spoken language understanding (SLU).
no code implementations • EMNLP 2021 • Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou
Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression.
no code implementations • Findings (EMNLP) 2021 • Chenyu You, Nuo Chen, Yuexian Zou
In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage.
no code implementations • 26 Aug 2021 • Dongsheng Chen, Zhiqi Huang, Yuexian Zou
Spoken Language Understanding (SLU), including intent detection and slot filling, is a core component in human-computer interaction.
no code implementations • 25 Aug 2021 • Cong Wang, Yan Huang, Yuexian Zou, Yong Xu
However, it is noted that ASM-based SIDM degrades its performance in dehazing real world hazy images due to the limited modelling ability of ASM where the atmospheric light factor (ALF) and the angular scattering coefficient (ASC) are assumed as constants for one image.
no code implementations • 18 Aug 2021 • Lisong Chen, Peilin Zhou, Yuexian Zou
With the auxiliary knowledge provided by the MIL Intent Decoder, we set Final Slot Decoder as the teacher model that imparts knowledge back to Initial Slot Decoder to complete the loop.
no code implementations • 12 Aug 2021 • Meng Cao, Can Zhang, Long Chen, Mike Zheng Shou, Yuexian Zou
In this paper, we analyze that the motion cues behind the optical flow features are complementary informative.
Optical Flow Estimation
Weakly-supervised Temporal Action Localization
+1
no code implementations • 12 Aug 2021 • Li Wang, Rongzhi Gu, Nuo Chen, Yuexian Zou
Recently proposed metric learning approaches improved the generalizability of models for the KWS task, and 1D-CNN based KWS models have achieved the state-of-the-arts (SOTA) in terms of model size.
no code implementations • Findings (ACL) 2021 • Fenglin Liu, Xuancheng Ren, Xian Wu, Bang Yang, Shen Ge, Yuexian Zou, Xu sun
Video captioning combines video understanding and language generation.
no code implementations • 4 Jul 2021 • Zhiqi Huang, Fenglin Liu, Xian Wu, Shen Ge, Helin Wang, Wei Fan, Yuexian Zou
As a result, the proposed approach can handle various tasks including: Audio-Oriented Multimodal Machine Comprehension, Machine Reading Comprehension and Machine Listening Comprehension, in a single model, making fair comparisons possible between our model and the existing unimodal MC models.
no code implementations • 30 Jun 2021 • Liyu Wu, Yuexian Zou, Can Zhang
Efficient long-short temporal modeling is key for enhancing the performance of action recognition task.
no code implementations • 29 Jun 2021 • Ranyu Ning, Can Zhang, Yuexian Zou
Current mainstream one-stage TAD approaches localize and classify action proposals relying on pre-defined anchors, where the location and scale for action instances are set by designers.
no code implementations • 24 Jun 2021 • Meng Cao, Can Zhang, Dongming Yang, Yuexian Zou
Compared to the traditional single-stage segmentation network, our NASK conducts the detection in a coarse-to-fine manner with the first stage segmentation spotting the rectangle text proposals and the second one retrieving compact representations.
no code implementations • 20 Jun 2021 • Fenglin Liu, Meng Gao, Tianhao Zhang, Yuexian Zou
To further improve the quality of captions generated by the model, we propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image.
no code implementations • CVPR 2021 • Fenglin Liu, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou
In detail, PoKE explores the posterior knowledge, which provides explicit abnormal visual regions to alleviate visual data bias; PrKE explores the prior knowledge from the prior medical knowledge graph (medical knowledge) and prior radiology reports (working experience) to alleviate textual data bias.
no code implementations • Findings (ACL) 2021 • Xuewei Ma, Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Yuexian Zou, Ping Zhang, Xu sun
In addition, according to the analysis, the CA model can help existing models better attend to the abnormal regions and provide more accurate descriptions which are crucial for an interpretable diagnosis.
no code implementations • 4 Jun 2021 • Nuo Chen, Chenyu You, Yuexian Zou
We also utilize the proposed self-supervised learning tasks to capture intra-sentence coherence.
no code implementations • 15 May 2021 • Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang, Xu sun, Yuexian Zou
In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could be addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection.
no code implementations • 30 Apr 2021 • Dongming Yang, Yuexian Zou, Can Zhang, Meng Cao, Jie Chen
Upon the frame, an Interaction Intensifier Module and a Correlation Parsing Module are carefully designed, where: a) interactive semantics from humans can be exploited and passed to objects to intensify interactions, b) interactive correlations among humans, objects and interactions are integrated to promote predictions.
no code implementations • 8 Apr 2021 • Jinchuan Tian, Rongzhi Gu, Helin Wang, Yuexian Zou
Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance.
no code implementations • 31 Mar 2021 • Helin Wang, Yuexian Zou, Wenwu Wang
In this paper, we present SpecAugment++, a novel data augmentation method for deep neural networks based acoustic scene classification (ASC).
1 code implementation • CVPR 2021 • Can Zhang, Meng Cao, Dongming Yang, Jie Chen, Yuexian Zou
In this paper, we argue that learning by comparing helps identify these hard snippets and we propose to utilize snippet Contrastive learning to Localize Actions, CoLA for short.
no code implementations • 21 Jan 2021 • Cong Wang, Yan Huang, Yuexian Zou, Yong Xu
However, for images taken in real-world, the illumination is not uniformly distributed over whole image which brings model mismatch and possibly results in color shift of the deep models using ASM.
no code implementations • 20 Dec 2020 • Nuo Chen, Fenglin Liu, Chenyu You, Peilin Zhou, Yuexian Zou
To predict the answer, it is common practice to employ a predictor to draw information only from the final encoder layer which generates the \textit{coarse-grained} representations of the source sequences, i. e., passage and question.
no code implementations • COLING 2020 • Zhiqi Huang, Fenglin Liu, Yuexian Zou
To this end, we propose a federated learning framework, which could unify various types of datasets as well as tasks to learn and fuse various types of knowledge, i. e., text representations, from different datasets and tasks, without the sharing of downstream task data.
no code implementations • COLING 2020 • Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang, Xu sun, Yuexian Zou
In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could by addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection.
no code implementations • NeurIPS 2020 • Fenglin Liu, Xuancheng Ren, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou, Xu sun
Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words.
no code implementations • 21 Oct 2020 • Chenyu You, Nuo Chen, Yuexian Zou
Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow given the speech utterances and text corpora.
Audio Signal Processing
Conversational Question Answering
+2
no code implementations • 21 Oct 2020 • Chenyu You, Nuo Chen, Yuexian Zou
However, the recent work shows that ASR systems generate highly noisy transcripts, which critically limit the capability of machine comprehension on the SQA task.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
no code implementations • 18 Oct 2020 • Chenyu You, Nuo Chen, Fenglin Liu, Dongchao Yang, Yuexian Zou
In spoken question answering, QA systems are designed to answer questions from contiguous text spans within the related speech transcripts.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 28 Sep 2020 • Peilin Zhou, Zhiqi Huang, Fenglin Liu, Yuexian Zou
However, we noted that, so far, the efforts to obtain better performance by supporting bidirectional and explicit information exchange between ID and SF are not well studied. In addition, few studies attempt to capture the local context information to enhance the performance of SF.
2 code implementations • 8 Aug 2020 • Can Zhang, Yuexian Zou, Guang Chen, Lei Gan
In contrast to optical flow, our PA focuses more on distilling the motion information at boundaries.
Ranked #2 on
Action Recognition
on Jester (Gesture Recognition)
no code implementations • 14 Jul 2020 • Dongming Yang, Yuexian Zou
However, recent HOI detection methods mostly rely on additional annotations (e. g., human pose) and neglect powerful interactive reasoning beyond convolutions.
no code implementations • 26 Apr 2020 • Meng Cao, Yuexian Zou
Specifically, \textit{NASK} consists of a Text Instance Segmentation network namely \textit{TIS} (\(1^{st}\) stage), a Text RoI Pooling module and a Fiducial pOint eXpression module termed as \textit{FOX} (\(2^{nd}\) stage).
no code implementations • 16 Mar 2020 • Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Lian-Wu Chen, Yuexian Zou, Dong Yu
Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers.
no code implementations • 11 Mar 2020 • Dongming Yang, Yuexian Zou, Jian Zhang, Ge Li
GID block breaks through the local neighborhoods and captures long-range dependency of pixels both in global-level and instance-level from the scene to help detecting interactions between instances.
no code implementations • 9 Mar 2020 • Rongzhi Gu, Shi-Xiong Zhang, Lian-Wu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu
Hand-crafted spatial features (e. g., inter-channel phase difference, IPD) play a fundamental role in recent deep learning based multi-channel speech separation (MCSS) methods.
no code implementations • 2 Jan 2020 • Rongzhi Gu, Yuexian Zou
To address these challenges, we propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture in reverberant environments, assisted with directional information of the speaker(s).
no code implementations • 14 Dec 2019 • Helin Wang, Yuexian Zou, Dading Chong, Wenwu Wang
Convolutional neural networks (CNN) are one of the best-performing neural network architectures for environmental sound classification (ESC).
1 code implementation • 27 Nov 2019 • Bang Yang, Yuexian Zou, Fenglin Liu, Can Zhang
However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer generating generic descriptions due to the insufficient training of visual words (e. g., nouns and verbs) and inadequate decoding paradigm.
no code implementations • 19 Aug 2019 • Dongming Yang, Yuexian Zou, Jian Zhang, Ge Li
Although two-stage detectors like Faster R-CNN achieved big successes in object detection due to the strategy of extracting region proposals by region proposal network, they show their poor adaption in real-world object detection as a result of without considering mining hard samples during extracting region proposals.
no code implementations • 15 May 2019 • Rongzhi Gu, Jian Wu, Shi-Xiong Zhang, Lian-Wu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu
This paper extended the previous approach and proposed a new end-to-end model for multi-channel speech separation.
no code implementations • 17 Oct 2014 • Weiyang Liu, Zhiding Yu, Lijia Lu, Yandong Wen, Hui Li, Yuexian Zou
The LCD similarity measure can be kernelized under KCRC, which theoretically links CRC and LCD under the kernel method.