Search Results for author: Yuexian Zou

Found 90 papers, 23 papers with code

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

1 code implementation • 20 Jul 2022 • Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu

In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.

Ranked #13 on Audio Generation on AudioCaps

Audio Generation

330

Paper
Code

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

3 code implementations • 30 Mar 2023 • Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.

Ranked #1 on Zero-Shot Environment Sound Classification on ESC-50 (using extra training data)

Audio captioning Event Detection +6

173

Paper
Code

Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI

1 code implementation • 5 Dec 2021 • Jinchuan Tian, Jianwei Yu, Chao Weng, Shi-Xiong Zhang, Dan Su, Dong Yu, Yuexian Zou

Recently, End-to-End (E2E) frameworks have achieved remarkable results on various Automatic Speech Recognition (ASR) tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

160

Paper
Code

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

1 code implementation • 6 Jan 2022 • Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

Then, the LM score of the hypothesis is obtained by intersecting the generated lattice with an external word N-gram LM.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

160

Paper
Code

Integrating Lattice-Free MMI into End-to-End Speech Recognition

1 code implementation • 29 Mar 2022 • Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding; the on-the-fly decoding process in MBR-based methods results in the need for pre-trained models and slow training speeds.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

160

Paper
Code

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

1 code implementation • 5 Jun 2022 • Jinchuan Tian, Jianwei Yu, Chunlei Zhang, Chao Weng, Yuexian Zou, Dong Yu

Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level and shows superior performance on both monolingual and multilingual ASR tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

160

Paper
Code

PAN: Towards Fast Action Recognition via Learning Persistence of Appearance

2 code implementations • 8 Aug 2020 • Can Zhang, Yuexian Zou, Guang Chen, Lei Gan

In contrast to optical flow, our PA focuses more on distilling the motion information at boundaries.

Ranked #2 on Action Recognition on Jester (Gesture Recognition)

Action Recognition Optical Flow Estimation +1

103

Paper
Code

CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning

1 code implementation • CVPR 2021 • Can Zhang, Meng Cao, Dongming Yang, Jie Chen, Yuexian Zou

In this paper, we argue that learning by comparing helps identify these hard snippets and we propose to utilize snippet Contrastive learning to Localize Actions, CoLA for short.

Ranked #4 on Weakly Supervised Action Localization on ActivityNet-1.2

CoLA Contrastive Learning +3

Paper
Code

Non-Autoregressive Coarse-to-Fine Video Captioning

1 code implementation • 27 Nov 2019 • Bang Yang, Yuexian Zou, Fenglin Liu, Can Zhang

However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer generating generic descriptions due to the insufficient training of visual words (e. g., nouns and verbs) and inadequate decoding paradigm.

Sentence Video Captioning

Paper
Code

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

1 code implementation • 16 Nov 2023 • Chris Kelly, Luhui Hu, Cindy Yang, Yu Tian, Deshun Yang, Bang Yang, Zaoshan Huang, Zihao Li, Yuexian Zou

In the current landscape of artificial intelligence, foundation models serve as the bedrock for advancements in both language and vision domains.

Paper
Code

LocVTP: Video-Text Pre-training for Temporal Localization

1 code implementation • 21 Jul 2022 • Meng Cao, Tianyu Yang, Junwu Weng, Can Zhang, Jue Wang, Yuexian Zou

To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships.

Retrieval Temporal Localization +1

Paper
Code

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

1 code implementation • 25 Aug 2023 • Bang Yang, Fenglin Liu, Xian Wu, YaoWei Wang, Xu sun, Yuexian Zou

To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets.

Image Captioning Video Captioning

Paper
Code

Unsupervised Pre-training for Temporal Action Localization Tasks

1 code implementation • CVPR 2022 • Can Zhang, Tianyu Yang, Junwu Weng, Meng Cao, Jue Wang, Yuexian Zou

These pre-trained models can be sub-optimal for temporal localization tasks due to the inherent discrepancy between video-level classification and clip-level localization.

Contrastive Learning Representation Learning +4

Paper
Code

ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

1 code implementation • 11 Mar 2023 • Bang Yang, Fenglin Liu, Yuexian Zou, Xian Wu, YaoWei Wang, David A. Clifton

We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and believable outputs and significantly outperforms existing zero-shot methods.

Image Captioning Machine Translation +5

Paper
Code

Correspondence Matters for Video Referring Expression Comprehension

1 code implementation • 21 Jul 2022 • Meng Cao, Ji Jiang, Long Chen, Yuexian Zou

Extensive experiments demonstrate that our DCNet achieves state-of-the-art performance on both video and image REC benchmarks.

Contrastive Learning Referring Expression +3

Paper
Code

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

1 code implementation • ICCV 2023 • Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhihong Zhu, Yuexian Zou

Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations.

Contrastive Learning Video Grounding

Paper
Code

FTM: A Frame-level Timeline Modeling Method for Temporal Graph Representation Learning

1 code implementation • 23 Feb 2023 • Bowen Cao, Qichen Ye, Weiyuan Xu, Yuexian Zou

Existing neighborhood aggregation strategies fail to capture either the short-term features or the long-term features of temporal graph attributes, leading to unsatisfactory model performance and even poor robustness and domain generality of the representation learning method.

Graph Representation Learning

Paper
Code

FiTs: Fine-grained Two-stage Training for Knowledge-aware Question Answering

1 code implementation • 23 Feb 2023 • Qichen Ye, Bowen Cao, Nuo Chen, Weiyuan Xu, Yuexian Zou

Despite the promising result of recent KAQA systems which tend to integrate linguistic knowledge from pre-trained language models (PLM) and factual knowledge from knowledge graphs (KG) to answer complex questions, a bottleneck exists in effectively fusing the representations from PLMs and KGs because of (i) the semantic and distributional gaps between them, and (ii) the difficulties in joint reasoning over the provided knowledge from both modalities.

Knowledge Graphs Question Answering +1

Paper
Code

PoseRAC: Pose Saliency Transformer for Repetitive Action Counting

1 code implementation • 15 Mar 2023 • Ziyu Yao, Xuxin Cheng, Yuexian Zou

Moreover, we introduce a pose-level method, PoseRAC, which is based on this representation and achieves state-of-the-art performance on two new version datasets by using Pose Saliency Annotation to annotate salient poses for training.

Ranked #1 on Repetitive Action Counting on RepCount

Repetitive Action Counting

Paper
Code

A Dynamic Graph Interactive Framework with Label-Semantic Injection for Spoken Language Understanding

1 code implementation • 8 Nov 2022 • Zhihong Zhu, Weiyuan Xu, Xuxin Cheng, Tengtao Song, Yuexian Zou

Multi-intent detection and slot filling joint models are gaining increasing traction since they are closer to complicated real-world scenarios.

Intent Detection slot-filling +2

Paper
Code

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

1 code implementation • 30 Nov 2021 • Bang Yang, Tong Zhang, Yuexian Zou

DCD is an auxiliary task that requires a caption model to learn the correspondence between video content and concepts and the co-occurrence relations between concepts.

Ranked #16 on Video Captioning on MSR-VTT

Caption Generation Representation Learning +1

Paper
Code

Video Referring Expression Comprehension via Transformer with Content-aware Query

1 code implementation • 6 Oct 2022 • Ji Jiang, Meng Cao, Tengtao Song, Yuexian Zou

To this end, we introduce two new datasets (i. e., VID-Entity and VidSTG-Entity) by augmenting the VIDSentence and VidSTG datasets with the explicitly referred words in the whole sentence, respectively.

Referring Expression Referring Expression Comprehension +1

Paper
Code

KCRC-LCD: Discriminative Kernel Collaborative Representation with Locality Constrained Dictionary for Visual Categorization

no code implementations • 17 Oct 2014 • Weiyang Liu, Zhiding Yu, Lijia Lu, Yandong Wen, Hui Li, Yuexian Zou

The LCD similarity measure can be kernelized under KCRC, which theoretically links CRC and LCD under the kernel method.

Classification General Classification +1

Paper
Add Code

End-to-End Multi-Channel Speech Separation

no code implementations • 15 May 2019 • Rongzhi Gu, Jian Wu, Shi-Xiong Zhang, Lian-Wu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu

This paper extended the previous approach and proposed a new end-to-end model for multi-channel speech separation.

Speech Separation

Paper
Add Code

C-RPNs: Promoting Object Detection in real world via a Cascade Structure of Region Proposal Networks

no code implementations • 19 Aug 2019 • Dongming Yang, Yuexian Zou, Jian Zhang, Ge Li

Although two-stage detectors like Faster R-CNN achieved big successes in object detection due to the strategy of extracting region proposals by region proposal network, they show their poor adaption in real-world object detection as a result of without considering mining hard samples during extracting region proposals.

Object object-detection +2

Paper
Add Code

Environmental Sound Classification with Parallel Temporal-spectral Attention

no code implementations • 14 Dec 2019 • Helin Wang, Yuexian Zou, Dading Chong, Wenwu Wang

Convolutional neural networks (CNN) are one of the best-performing neural network architectures for environmental sound classification (ESC).

Acoustic Scene Classification Environmental Sound Classification +3

Paper
Add Code

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

no code implementations • 2 Jan 2020 • Rongzhi Gu, Yuexian Zou

To address these challenges, we propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture in reverberant environments, assisted with directional information of the speaker(s).

Speech Separation

Paper
Add Code

GID-Net: Detecting Human-Object Interaction with Global and Instance Dependency

no code implementations • 11 Mar 2020 • Dongming Yang, Yuexian Zou, Jian Zhang, Ge Li

GID block breaks through the local neighborhoods and captures long-range dependency of pixels both in global-level and instance-level from the scene to help detecting interactions between instances.

Human-Object Interaction Detection Object

Paper
Add Code

Enhancing End-to-End Multi-channel Speech Separation via Spatial Feature Learning

no code implementations • 9 Mar 2020 • Rongzhi Gu, Shi-Xiong Zhang, Lian-Wu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu

Hand-crafted spatial features (e. g., inter-channel phase difference, IPD) play a fundamental role in recent deep learning based multi-channel speech separation (MCSS) methods.

Speech Separation

Paper
Add Code

Multi-modal Multi-channel Target Speech Separation

no code implementations • 16 Mar 2020 • Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Lian-Wu Chen, Yuexian Zou, Dong Yu

Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers.

Speech Separation

Paper
Add Code

All you need is a second look: Towards Tighter Arbitrary shape text detection

no code implementations • 26 Apr 2020 • Meng Cao, Yuexian Zou

Specifically, \textit{NASK} consists of a Text Instance Segmentation network namely \textit{TIS} (\(1^{st}\) stage), a Text RoI Pooling module and a Fiducial pOint eXpression module termed as \textit{FOX} (\(2^{nd}\) stage).

Instance Segmentation Scene Text Detection +3

Paper
Add Code

A Graph-based Interactive Reasoning for Human-Object Interaction Detection

no code implementations • 14 Jul 2020 • Dongming Yang, Yuexian Zou

However, recent HOI detection methods mostly rely on additional annotations (e. g., human pose) and neglect powerful interactive reasoning beyond convolutions.

Human-Object Interaction Detection

Paper
Add Code

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

no code implementations • 28 Sep 2020 • Peilin Zhou, Zhiqi Huang, Fenglin Liu, Yuexian Zou

However, we noted that, so far, the efforts to obtain better performance by supporting bidirectional and explicit information exchange between ID and SF are not well studied. In addition, few studies attempt to capture the local context information to enhance the performance of SF.

Intent Detection Language Modelling +3

Paper
Add Code

Towards Data Distillation for End-to-end Spoken Conversational Question Answering

no code implementations • 18 Oct 2020 • Chenyu You, Nuo Chen, Fenglin Liu, Dongchao Yang, Yuexian Zou

In spoken question answering, QA systems are designed to answer questions from contiguous text spans within the related speech transcripts.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Knowledge Distillation for Improved Accuracy in Spoken Question Answering

no code implementations • 21 Oct 2020 • Chenyu You, Nuo Chen, Yuexian Zou

However, the recent work shows that ASR systems generate highly noisy transcripts, which critically limit the capability of machine comprehension on the SQA task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

Contextualized Attention-based Knowledge Transfer for Spoken Conversational Question Answering

no code implementations • 21 Oct 2020 • Chenyu You, Nuo Chen, Yuexian Zou

Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow given the speech utterances and text corpora.

Audio Signal Processing Conversational Question Answering +2

Paper
Add Code

Prophet Attention: Predicting Attention with Future Attention

no code implementations • NeurIPS 2020 • Fenglin Liu, Xuancheng Ren, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou, Xu sun

Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words.

Image Captioning

Paper
Add Code

Adaptive Bi-directional Attention: Exploring Multi-Granularity Representations for Machine Reading Comprehension

no code implementations • 20 Dec 2020 • Nuo Chen, Fenglin Liu, Chenyu You, Peilin Zhou, Yuexian Zou

To predict the answer, it is common practice to employ a predictor to draw information only from the final encoder layer which generates the \textit{coarse-grained} representations of the source sequences, i. e., passage and question.

Machine Reading Comprehension

Paper
Add Code

FWB-Net:Front White Balance Network for Color Shift Correction in Single Image Dehazing via Atmospheric Light Estimation

no code implementations • 21 Jan 2021 • Cong Wang, Yan Huang, Yuexian Zou, Yong Xu

However, for images taken in real-world, the illumination is not uniformly distributed over whole image which brings model mismatch and possibly results in color shift of the deep models using ASM.

Image Dehazing Single Image Dehazing

Paper
Add Code

SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification

no code implementations • 31 Mar 2021 • Helin Wang, Yuexian Zou, Wenwu Wang

In this paper, we present SpecAugment++, a novel data augmentation method for deep neural networks based acoustic scene classification (ASC).

Acoustic Scene Classification Data Augmentation +2

Paper
Add Code

RR-Net: Injecting Interactive Semantics in Human-Object Interaction Detection

no code implementations • 30 Apr 2021 • Dongming Yang, Yuexian Zou, Can Zhang, Meng Cao, Jie Chen

Upon the frame, an Interaction Intensifier Module and a Correlation Parsing Module are carefully designed, where: a) interactive semantics from humans can be exploited and passed to objects to intensify interactions, b) interactive correlations among humans, objects and interactions are integrated to promote predictions.

Human-Object Interaction Detection Relation

Paper
Add Code

Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency

no code implementations • 8 Apr 2021 • Jinchuan Tian, Rongzhi Gu, Helin Wang, Yuexian Zou

Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance.

speech-recognition Speech Recognition

Paper
Add Code

Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

no code implementations • 15 May 2021 • Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang, Xu sun, Yuexian Zou

In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could be addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection.

Image Classification Machine Translation +1

Paper
Add Code

Self-supervised Dialogue Learning for Spoken Conversational Question Answering

no code implementations • 4 Jun 2021 • Nuo Chen, Chenyu You, Yuexian Zou

We also utilize the proposed self-supervised learning tasks to capture intra-sentence coherence.

Conversational Question Answering coreference-resolution +2

Paper
Add Code

Contrastive Attention for Automatic Chest X-ray Report Generation

no code implementations • Findings (ACL) 2021 • Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Ping Zhang, Yuexian Zou, Xu sun

In addition, according to the analysis, the CA model can help existing models better attend to the abnormal regions and provide more accurate descriptions which are crucial for an interpretable diagnosis.

Paper
Add Code

Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation

no code implementations • CVPR 2021 • Fenglin Liu, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou

In detail, PoKE explores the posterior knowledge, which provides explicit abnormal visual regions to alleviate visual data bias; PrKE explores the prior knowledge from the prior medical knowledge graph (medical knowledge) and prior radiology reports (working experience) to alleviate textual data bias.

Paper
Add Code

Exploring Semantic Relationships for Unpaired Image Captioning

no code implementations • 20 Jun 2021 • Fenglin Liu, Meng Gao, Tianhao Zhang, Yuexian Zou

To further improve the quality of captions generated by the model, we propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image.

Image Captioning Sentence

Paper
Add Code

All You Need is a Second Look: Towards Arbitrary-Shaped Text Detection

no code implementations • 24 Jun 2021 • Meng Cao, Can Zhang, Dongming Yang, Yuexian Zou

Compared to the traditional single-stage segmentation network, our NASK conducts the detection in a coarse-to-fine manner with the first stage segmentation spotting the rectangle text proposals and the second one retrieving compact representations.

Instance Segmentation Segmentation +2

Paper
Add Code

SRF-Net: Selective Receptive Field Network for Anchor-Free Temporal Action Detection

no code implementations • 29 Jun 2021 • Ranyu Ning, Can Zhang, Yuexian Zou

Current mainstream one-stage TAD approaches localize and classify action proposals relying on pre-defined anchors, where the location and scale for action instances are set by designers.

Action Detection

Paper
Add Code

Long-Short Temporal Modeling for Efficient Action Recognition

no code implementations • 30 Jun 2021 • Liyu Wu, Yuexian Zou, Can Zhang

Efficient long-short temporal modeling is key for enhancing the performance of action recognition task.

Action Recognition

Paper
Add Code

Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model

no code implementations • 4 Jul 2021 • Zhiqi Huang, Fenglin Liu, Xian Wu, Shen Ge, Helin Wang, Wei Fan, Yuexian Zou

As a result, the proposed approach can handle various tasks including: Audio-Oriented Multimodal Machine Comprehension, Machine Reading Comprehension and Machine Listening Comprehension, in a single model, making fair comparisons possible between our model and the existing unimodal MC models.

Knowledge Distillation Machine Reading Comprehension

Paper
Add Code

O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

no code implementations • Findings (ACL) 2021 • Fenglin Liu, Xuancheng Ren, Xian Wu, Bang Yang, Shen Ge, Yuexian Zou, Xu sun

Video captioning combines video understanding and language generation.

Attribute Caption Generation +4

Paper
Add Code

Deep Motion Prior for Weakly-Supervised Temporal Action Localization

no code implementations • 12 Aug 2021 • Meng Cao, Can Zhang, Long Chen, Mike Zheng Shou, Yuexian Zou

In this paper, we analyze that the motion cues behind the optical flow features are complementary informative.

Optical Flow Estimation Weakly-supervised Temporal Action Localization +1

Paper
Add Code

Text Anchor Based Metric Learning for Small-footprint Keyword Spotting

no code implementations • 12 Aug 2021 • Li Wang, Rongzhi Gu, Nuo Chen, Yuexian Zou

Recently proposed metric learning approaches improved the generalizability of models for the KWS task, and 1D-CNN based KWS models have achieved the state-of-the-arts (SOTA) in terms of model size.

Metric Learning Small-Footprint Keyword Spotting

Paper
Add Code

Joint Multiple Intent Detection and Slot Filling via Self-distillation

no code implementations • 18 Aug 2021 • Lisong Chen, Peilin Zhou, Yuexian Zou

With the auxiliary knowledge provided by the MIL Intent Decoder, we set Final Slot Decoder as the teacher model that imparts knowledge back to Initial Slot Decoder to complete the loop.

Intent Detection Multiple Instance Learning +3

Paper
Add Code

Fully Non-Homogeneous Atmospheric Scattering Modeling with Convolutional Neural Networks for Single Image Dehazing

no code implementations • 25 Aug 2021 • Cong Wang, Yan Huang, Yuexian Zou, Yong Xu

However, it is noted that ASM-based SIDM degrades its performance in dehazing real world hazy images due to the limited modelling ability of ASM where the atmospheric light factor (ALF) and the angular scattering coefficient (ASC) are assumed as constants for one image.

Image Dehazing Single Image Dehazing

Paper
Add Code

HAN: Higher-order Attention Network for Spoken Language Understanding

no code implementations • 26 Aug 2021 • Dongsheng Chen, Zhiqi Huang, Yuexian Zou

Spoken Language Understanding (SLU), including intent detection and slot filling, is a core component in human-computer interaction.

Intent Detection slot-filling +2

Paper
Add Code

Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering

no code implementations • Findings (EMNLP) 2021 • Chenyu You, Nuo Chen, Yuexian Zou

In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage.

Question Answering Representation Learning

Paper
Add Code

On Pursuit of Designing Multi-modal Transformer for Video Grounding

no code implementations • EMNLP 2021 • Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou

Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression.

Sentence Video Grounding

Paper
Add Code

Federated Learning for Spoken Language Understanding

no code implementations • COLING 2020 • Zhiqi Huang, Fenglin Liu, Yuexian Zou

To this end, we propose a federated learning framework, which could unify various types of datasets as well as tasks to learn and fuse various types of knowledge, i. e., text representations, from different datasets and tasks, without the sharing of downstream task data.

Intent Detection slot-filling +4

Paper
Add Code

Rethinking Skip Connection with Layer Normalization

no code implementations • COLING 2020 • Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang, Xu sun, Yuexian Zou

In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could by addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection.

Image Classification Machine Translation +1

Paper
Add Code

Towards Joint Intent Detection and Slot Filling via Higher-order Attention

no code implementations • 18 Sep 2021 • Dongsheng Chen, Zhiqi Huang, Xian Wu, Shen Ge, Yuexian Zou

Intent detection (ID) and Slot filling (SF) are two major tasks in spoken language understanding (SLU).

Intent Detection slot-filling +2

Paper
Add Code

Learning Decoupling Features Through Orthogonality Regularization

no code implementations • 31 Mar 2022 • Li Wang, Rongzhi Gu, Weiji Zhuang, Peng Gao, Yujun Wang, Yuexian Zou

Bearing this in mind, a two-branch deep network (KWS branch and SV branch) with the same network structure is developed and a novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously where speaker-invariant keyword representations and keyword-invariant speaker representations are expected respectively.

Keyword Spotting Speaker Verification

Paper
Add Code

SpatioTemporal Focus for Skeleton-based Action Recognition

no code implementations • 31 Mar 2022 • Liyu Wu, Can Zhang, Yuexian Zou

Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information from the body joints and parts.

Action Recognition Skeleton Based Action Recognition

Paper
Add Code

Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

no code implementations • 4 Apr 2022 • Zifeng Zhao, Dongchao Yang, Rongzhi Gu, Haoran Zhang, Yuexian Zou

However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings.

blind source separation Metric Learning +2

Paper
Add Code

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

no code implementations • 15 Apr 2022 • Zifeng Zhao, Rongzhi Gu, Dongchao Yang, Jinchuan Tian, Yuexian Zou

Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered.

Domain Adaptation

Paper
Add Code

End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

no code implementations • Findings (NAACL) 2022 • Chenyu You, Nuo Chen, Fenglin Liu, Shen Ge, Xian Wu, Yuexian Zou

To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations.

Ranked #1 on Spoken Language Understanding on Spoken-SQuAD

4k Conversational Question Answering +2

Paper
Add Code

Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

no code implementations • 3 May 2022 • Xinmeng Xu, Rongzhi Gu, Yuexian Zou

Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems.

Multi-Task Learning Speech Enhancement

Paper
Add Code

Competence-based Multimodal Curriculum Learning for Medical Report Generation

no code implementations • ACL 2021 • Fenglin Liu, Shen Ge, Yuexian Zou, Xian Wu

Medical report generation task, which targets to produce long and coherent descriptions of medical images, has attracted growing research interests recently.

Image Captioning Medical Report Generation

Paper
Add Code

A Transformer-based Threshold-Free Framework for Multi-Intent NLU

no code implementations • COLING 2022 • Lisung Chen, Nuo Chen, Yuexian Zou, Yong Wang, Xinzhong Sun

Furthermore, we propose a threshold-free intent multi-intent classifier that utilizes the output of IND task and detects the multiple intents without depending on the threshold.

Multi-Task Learning Natural Language Understanding

Paper
Add Code

Prophet Attention: Predicting Attention with Future Attention for Image Captioning

no code implementations • 19 Oct 2022 • Fenglin Liu, Xuancheng Ren, Xian Wu, Wei Fan, Yuexian Zou, Xu sun

Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words.

Image Captioning

Paper
Add Code

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

no code implementations • 28 Oct 2022 • Fenglin Liu, Xian Wu, Shen Ge, Xuancheng Ren, Wei Fan, Xu sun, Yuexian Zou

To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format.

Image Captioning Language Modelling +3

Paper
Add Code

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

no code implementations • 22 Nov 2022 • Fenglin Liu, Xian Wu, Chenyu You, Shen Ge, Yuexian Zou, Xu sun

To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language.

Translation Video Captioning

Paper
Add Code

M3ST: Mix at Three Levels for Speech Translation

no code implementations • 7 Dec 2022 • Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Yuexian Zou

How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)?

Data Augmentation Machine Translation +3

Paper
Add Code

Exploiting Auxiliary Caption for Video Grounding

no code implementations • 15 Jan 2023 • Hongxiang Li, Meng Cao, Xuxin Cheng, Zhihong Zhu, Yaowei Li, Yuexian Zou

Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video.

Contrastive Learning Dense Video Captioning +2

Paper
Add Code

Improving Weakly Supervised Sound Event Detection with Causal Intervention

no code implementations • 10 Mar 2023 • Yifei Xin, Dongchao Yang, Fan Cui, Yujun Wang, Yuexian Zou

Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i. e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision.

Event Detection Sound Event Detection

Paper
Add Code

Improve Retrieval-based Dialogue System via Syntax-Informed Attention

no code implementations • 12 Mar 2023 • Tengtao Song, Nuo Chen, Ji Jiang, Zhihong Zhu, Yuexian Zou

Since incorporating syntactic information like dependency structures into neural models can promote a better understanding of the sentences, such a method has been widely used in NLP tasks.

Retrieval Sentence

Paper
Add Code

Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

no code implementations • ICCV 2023 • Yaowei Li, Bang Yang, Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yuexian Zou

Automatic radiology report generation has attracted enormous research interest due to its practical value in reducing the workload of radiologists.

Sentence

Paper
Add Code

TLAG: An Informative Trigger and Label-Aware Knowledge Guided Model for Dialogue-based Relation Extraction

no code implementations • 30 Mar 2023 • Hao An, Dongsheng Chen, Weiyuan Xu, Zhihong Zhu, Yuexian Zou

However, these methods are not able to fully leverage the trigger information and even bring noise to relation extraction.

Relation Relation Extraction

Paper
Add Code

Iterative Proposal Refinement for Weakly-Supervised Video Grounding

no code implementations • CVPR 2023 • Meng Cao, Fangyun Wei, Can Xu, Xiubo Geng, Long Chen, Can Zhang, Yuexian Zou, Tao Shen, Daxin Jiang

Weakly-Supervised Video Grounding (WSVG) aims to localize events of interest in untrimmed videos with only video-level annotations.

Sentence Video Grounding

Paper
Add Code

Customizing General-Purpose Foundation Models for Medical Report Generation

no code implementations • 9 Jun 2023 • Bang Yang, Asif Raza, Yuexian Zou, Tong Zhang

In this work, we propose customizing off-the-shelf general-purpose large-scale pre-trained models, i. e., foundation models (FMs), in computer vision and natural language processing with a specific focus on medical report generation.

Medical Report Generation Transfer Learning

Paper
Add Code

Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels

no code implementations • 5 Jul 2023 • Bang Yang, Fenglin Liu, Zheng Li, Qingyu Yin, Chenyu You, Bing Yin, Yuexian Zou

We observe that the core challenges of novel product title generation are the understanding of novel product characteristics and the generation of titles in a novel writing style.

Image Captioning Text Generation

Paper
Add Code

Video Referring Expression Comprehension via Transformer with Content-conditioned Query

no code implementations • 25 Oct 2023 • Ji Jiang, Meng Cao, Tengtao Song, Long Chen, Yi Wang, Yuexian Zou

Video Referring Expression Comprehension (REC) aims to localize a target object in videos based on the queried natural language.

Referring Expression Referring Expression Comprehension +1

Paper
Add Code

ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding

no code implementations • 19 Nov 2023 • Xuxin Cheng, Bowen Cao, Qichen Ye, Zhihong Zhu, Hongxiang Li, Yuexian Zou

Specifically, in fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively, aiming to iteratively share knowledge between these two models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

1 code implementation • 30 Jan 2024 • Bang Yang, Yong Dai, Xuxin Cheng, Yaowei Li, Asif Raza, Yuexian Zou

To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training.

Text Retrieval

Paper
Code

Retrieval is Accurate Generation

no code implementations • 27 Feb 2024 • Bowen Cao, Deng Cai, Leyang Cui, Xuxin Cheng, Wei Bi, Yuexian Zou, Shuming Shi

To address this, we propose to initialize the training oracles using linguistic heuristics and, more importantly, bootstrap the oracles through iterative self-reinforcement.

Language Modelling Retrieval +1

Paper
Add Code

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

no code implementations • 2 Mar 2024 • Chenchen Tao, Chong Wang, Yuexian Zou, Xiaohao Peng, Jiafei Wu, Jiangbo Qian

Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly.

Anomaly Detection Multiple Instance Learning +1

Paper
Add Code

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

no code implementations • 10 Mar 2024 • Deshun Yang, Luhui Hu, Yu Tian, Zihao Li, Chris Kelly, Bang Yang, Cindy Yang, Yuexian Zou

Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content.

Video Generation

Paper
Add Code

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

no code implementations • 14 Mar 2024 • Chris Kelly, Luhui Hu, Jiayin Hu, Yu Tian, Deshun Yang, Bang Yang, Cindy Yang, Zihao Li, Zaoshan Huang, Yuexian Zou

It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models, identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs such as text prompts.

Paper
Add Code

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

no code implementations • 14 Mar 2024 • Chris Kelly, Luhui Hu, Bang Yang, Yu Tian, Deshun Yang, Cindy Yang, Zaoshan Huang, Zihao Li, Jiayin Hu, Yuexian Zou

With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question.

Language Modelling Large Language Model +2

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.