Search Results for author: Yuexian Zou

Found 90 papers, 23 papers with code

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

1 code implementation20 Jul 2022 Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu

In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.

Audio Generation

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

3 code implementations30 Mar 2023 Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.

 Ranked #1 on Zero-Shot Environment Sound Classification on ESC-50 (using extra training data)

Audio captioning Event Detection +6

Integrating Lattice-Free MMI into End-to-End Speech Recognition

1 code implementation29 Mar 2022 Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding; the on-the-fly decoding process in MBR-based methods results in the need for pre-trained models and slow training speeds.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

1 code implementation5 Jun 2022 Jinchuan Tian, Jianwei Yu, Chunlei Zhang, Chao Weng, Yuexian Zou, Dong Yu

Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level and shows superior performance on both monolingual and multilingual ASR tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning

1 code implementation CVPR 2021 Can Zhang, Meng Cao, Dongming Yang, Jie Chen, Yuexian Zou

In this paper, we argue that learning by comparing helps identify these hard snippets and we propose to utilize snippet Contrastive learning to Localize Actions, CoLA for short.

CoLA Contrastive Learning +3

Non-Autoregressive Coarse-to-Fine Video Captioning

1 code implementation27 Nov 2019 Bang Yang, Yuexian Zou, Fenglin Liu, Can Zhang

However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer generating generic descriptions due to the insufficient training of visual words (e. g., nouns and verbs) and inadequate decoding paradigm.

Sentence Video Captioning

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

1 code implementation16 Nov 2023 Chris Kelly, Luhui Hu, Cindy Yang, Yu Tian, Deshun Yang, Bang Yang, Zaoshan Huang, Zihao Li, Yuexian Zou

In the current landscape of artificial intelligence, foundation models serve as the bedrock for advancements in both language and vision domains.

LocVTP: Video-Text Pre-training for Temporal Localization

1 code implementation21 Jul 2022 Meng Cao, Tianyu Yang, Junwu Weng, Can Zhang, Jue Wang, Yuexian Zou

To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships.

Retrieval Temporal Localization +1

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

1 code implementation25 Aug 2023 Bang Yang, Fenglin Liu, Xian Wu, YaoWei Wang, Xu sun, Yuexian Zou

To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets.

Image Captioning Video Captioning

Unsupervised Pre-training for Temporal Action Localization Tasks

1 code implementation CVPR 2022 Can Zhang, Tianyu Yang, Junwu Weng, Meng Cao, Jue Wang, Yuexian Zou

These pre-trained models can be sub-optimal for temporal localization tasks due to the inherent discrepancy between video-level classification and clip-level localization.

Contrastive Learning Representation Learning +4

ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

1 code implementation11 Mar 2023 Bang Yang, Fenglin Liu, Yuexian Zou, Xian Wu, YaoWei Wang, David A. Clifton

We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and believable outputs and significantly outperforms existing zero-shot methods.

Image Captioning Machine Translation +5

Correspondence Matters for Video Referring Expression Comprehension

1 code implementation21 Jul 2022 Meng Cao, Ji Jiang, Long Chen, Yuexian Zou

Extensive experiments demonstrate that our DCNet achieves state-of-the-art performance on both video and image REC benchmarks.

Contrastive Learning Referring Expression +3

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

1 code implementation ICCV 2023 Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhihong Zhu, Yuexian Zou

Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations.

Contrastive Learning Video Grounding

FTM: A Frame-level Timeline Modeling Method for Temporal Graph Representation Learning

1 code implementation23 Feb 2023 Bowen Cao, Qichen Ye, Weiyuan Xu, Yuexian Zou

Existing neighborhood aggregation strategies fail to capture either the short-term features or the long-term features of temporal graph attributes, leading to unsatisfactory model performance and even poor robustness and domain generality of the representation learning method.

Graph Representation Learning

FiTs: Fine-grained Two-stage Training for Knowledge-aware Question Answering

1 code implementation23 Feb 2023 Qichen Ye, Bowen Cao, Nuo Chen, Weiyuan Xu, Yuexian Zou

Despite the promising result of recent KAQA systems which tend to integrate linguistic knowledge from pre-trained language models (PLM) and factual knowledge from knowledge graphs (KG) to answer complex questions, a bottleneck exists in effectively fusing the representations from PLMs and KGs because of (i) the semantic and distributional gaps between them, and (ii) the difficulties in joint reasoning over the provided knowledge from both modalities.

Knowledge Graphs Question Answering +1

PoseRAC: Pose Saliency Transformer for Repetitive Action Counting

1 code implementation15 Mar 2023 Ziyu Yao, Xuxin Cheng, Yuexian Zou

Moreover, we introduce a pose-level method, PoseRAC, which is based on this representation and achieves state-of-the-art performance on two new version datasets by using Pose Saliency Annotation to annotate salient poses for training.

Repetitive Action Counting

A Dynamic Graph Interactive Framework with Label-Semantic Injection for Spoken Language Understanding

1 code implementation8 Nov 2022 Zhihong Zhu, Weiyuan Xu, Xuxin Cheng, Tengtao Song, Yuexian Zou

Multi-intent detection and slot filling joint models are gaining increasing traction since they are closer to complicated real-world scenarios.

Intent Detection slot-filling +2

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

1 code implementation30 Nov 2021 Bang Yang, Tong Zhang, Yuexian Zou

DCD is an auxiliary task that requires a caption model to learn the correspondence between video content and concepts and the co-occurrence relations between concepts.

Caption Generation Representation Learning +1

Video Referring Expression Comprehension via Transformer with Content-aware Query

1 code implementation6 Oct 2022 Ji Jiang, Meng Cao, Tengtao Song, Yuexian Zou

To this end, we introduce two new datasets (i. e., VID-Entity and VidSTG-Entity) by augmenting the VIDSentence and VidSTG datasets with the explicitly referred words in the whole sentence, respectively.

Referring Expression Referring Expression Comprehension +1

End-to-End Multi-Channel Speech Separation

no code implementations15 May 2019 Rongzhi Gu, Jian Wu, Shi-Xiong Zhang, Lian-Wu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu

This paper extended the previous approach and proposed a new end-to-end model for multi-channel speech separation.

Speech Separation

C-RPNs: Promoting Object Detection in real world via a Cascade Structure of Region Proposal Networks

no code implementations19 Aug 2019 Dongming Yang, Yuexian Zou, Jian Zhang, Ge Li

Although two-stage detectors like Faster R-CNN achieved big successes in object detection due to the strategy of extracting region proposals by region proposal network, they show their poor adaption in real-world object detection as a result of without considering mining hard samples during extracting region proposals.

Object object-detection +2

Environmental Sound Classification with Parallel Temporal-spectral Attention

no code implementations14 Dec 2019 Helin Wang, Yuexian Zou, Dading Chong, Wenwu Wang

Convolutional neural networks (CNN) are one of the best-performing neural network architectures for environmental sound classification (ESC).

Acoustic Scene Classification Environmental Sound Classification +3

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

no code implementations2 Jan 2020 Rongzhi Gu, Yuexian Zou

To address these challenges, we propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture in reverberant environments, assisted with directional information of the speaker(s).

Speech Separation

GID-Net: Detecting Human-Object Interaction with Global and Instance Dependency

no code implementations11 Mar 2020 Dongming Yang, Yuexian Zou, Jian Zhang, Ge Li

GID block breaks through the local neighborhoods and captures long-range dependency of pixels both in global-level and instance-level from the scene to help detecting interactions between instances.

Human-Object Interaction Detection Object

Enhancing End-to-End Multi-channel Speech Separation via Spatial Feature Learning

no code implementations9 Mar 2020 Rongzhi Gu, Shi-Xiong Zhang, Lian-Wu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu

Hand-crafted spatial features (e. g., inter-channel phase difference, IPD) play a fundamental role in recent deep learning based multi-channel speech separation (MCSS) methods.

Speech Separation

Multi-modal Multi-channel Target Speech Separation

no code implementations16 Mar 2020 Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Lian-Wu Chen, Yuexian Zou, Dong Yu

Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers.

Speech Separation

All you need is a second look: Towards Tighter Arbitrary shape text detection

no code implementations26 Apr 2020 Meng Cao, Yuexian Zou

Specifically, \textit{NASK} consists of a Text Instance Segmentation network namely \textit{TIS} (\(1^{st}\) stage), a Text RoI Pooling module and a Fiducial pOint eXpression module termed as \textit{FOX} (\(2^{nd}\) stage).

Instance Segmentation Scene Text Detection +3

A Graph-based Interactive Reasoning for Human-Object Interaction Detection

no code implementations14 Jul 2020 Dongming Yang, Yuexian Zou

However, recent HOI detection methods mostly rely on additional annotations (e. g., human pose) and neglect powerful interactive reasoning beyond convolutions.

Human-Object Interaction Detection

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

no code implementations28 Sep 2020 Peilin Zhou, Zhiqi Huang, Fenglin Liu, Yuexian Zou

However, we noted that, so far, the efforts to obtain better performance by supporting bidirectional and explicit information exchange between ID and SF are not well studied. In addition, few studies attempt to capture the local context information to enhance the performance of SF.

Intent Detection Language Modelling +3

Towards Data Distillation for End-to-end Spoken Conversational Question Answering

no code implementations18 Oct 2020 Chenyu You, Nuo Chen, Fenglin Liu, Dongchao Yang, Yuexian Zou

In spoken question answering, QA systems are designed to answer questions from contiguous text spans within the related speech transcripts.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Knowledge Distillation for Improved Accuracy in Spoken Question Answering

no code implementations21 Oct 2020 Chenyu You, Nuo Chen, Yuexian Zou

However, the recent work shows that ASR systems generate highly noisy transcripts, which critically limit the capability of machine comprehension on the SQA task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Contextualized Attention-based Knowledge Transfer for Spoken Conversational Question Answering

no code implementations21 Oct 2020 Chenyu You, Nuo Chen, Yuexian Zou

Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow given the speech utterances and text corpora.

Audio Signal Processing Conversational Question Answering +2

Prophet Attention: Predicting Attention with Future Attention

no code implementations NeurIPS 2020 Fenglin Liu, Xuancheng Ren, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou, Xu sun

Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words.

Image Captioning

Adaptive Bi-directional Attention: Exploring Multi-Granularity Representations for Machine Reading Comprehension

no code implementations20 Dec 2020 Nuo Chen, Fenglin Liu, Chenyu You, Peilin Zhou, Yuexian Zou

To predict the answer, it is common practice to employ a predictor to draw information only from the final encoder layer which generates the \textit{coarse-grained} representations of the source sequences, i. e., passage and question.

Machine Reading Comprehension

FWB-Net:Front White Balance Network for Color Shift Correction in Single Image Dehazing via Atmospheric Light Estimation

no code implementations21 Jan 2021 Cong Wang, Yan Huang, Yuexian Zou, Yong Xu

However, for images taken in real-world, the illumination is not uniformly distributed over whole image which brings model mismatch and possibly results in color shift of the deep models using ASM.

Image Dehazing Single Image Dehazing

SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification

no code implementations31 Mar 2021 Helin Wang, Yuexian Zou, Wenwu Wang

In this paper, we present SpecAugment++, a novel data augmentation method for deep neural networks based acoustic scene classification (ASC).

Acoustic Scene Classification Data Augmentation +2

RR-Net: Injecting Interactive Semantics in Human-Object Interaction Detection

no code implementations30 Apr 2021 Dongming Yang, Yuexian Zou, Can Zhang, Meng Cao, Jie Chen

Upon the frame, an Interaction Intensifier Module and a Correlation Parsing Module are carefully designed, where: a) interactive semantics from humans can be exploited and passed to objects to intensify interactions, b) interactive correlations among humans, objects and interactions are integrated to promote predictions.

Human-Object Interaction Detection Relation

Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency

no code implementations8 Apr 2021 Jinchuan Tian, Rongzhi Gu, Helin Wang, Yuexian Zou

Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance.

speech-recognition Speech Recognition

Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

no code implementations15 May 2021 Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang, Xu sun, Yuexian Zou

In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could be addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection.

Image Classification Machine Translation +1

Contrastive Attention for Automatic Chest X-ray Report Generation

no code implementations Findings (ACL) 2021 Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Ping Zhang, Yuexian Zou, Xu sun

In addition, according to the analysis, the CA model can help existing models better attend to the abnormal regions and provide more accurate descriptions which are crucial for an interpretable diagnosis.

Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation

no code implementations CVPR 2021 Fenglin Liu, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou

In detail, PoKE explores the posterior knowledge, which provides explicit abnormal visual regions to alleviate visual data bias; PrKE explores the prior knowledge from the prior medical knowledge graph (medical knowledge) and prior radiology reports (working experience) to alleviate textual data bias.

Exploring Semantic Relationships for Unpaired Image Captioning

no code implementations20 Jun 2021 Fenglin Liu, Meng Gao, Tianhao Zhang, Yuexian Zou

To further improve the quality of captions generated by the model, we propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image.

Image Captioning Sentence

All You Need is a Second Look: Towards Arbitrary-Shaped Text Detection

no code implementations24 Jun 2021 Meng Cao, Can Zhang, Dongming Yang, Yuexian Zou

Compared to the traditional single-stage segmentation network, our NASK conducts the detection in a coarse-to-fine manner with the first stage segmentation spotting the rectangle text proposals and the second one retrieving compact representations.

Instance Segmentation Segmentation +2

SRF-Net: Selective Receptive Field Network for Anchor-Free Temporal Action Detection

no code implementations29 Jun 2021 Ranyu Ning, Can Zhang, Yuexian Zou

Current mainstream one-stage TAD approaches localize and classify action proposals relying on pre-defined anchors, where the location and scale for action instances are set by designers.

Action Detection

Long-Short Temporal Modeling for Efficient Action Recognition

no code implementations30 Jun 2021 Liyu Wu, Yuexian Zou, Can Zhang

Efficient long-short temporal modeling is key for enhancing the performance of action recognition task.

Action Recognition

Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model

no code implementations4 Jul 2021 Zhiqi Huang, Fenglin Liu, Xian Wu, Shen Ge, Helin Wang, Wei Fan, Yuexian Zou

As a result, the proposed approach can handle various tasks including: Audio-Oriented Multimodal Machine Comprehension, Machine Reading Comprehension and Machine Listening Comprehension, in a single model, making fair comparisons possible between our model and the existing unimodal MC models.

Knowledge Distillation Machine Reading Comprehension

Text Anchor Based Metric Learning for Small-footprint Keyword Spotting

no code implementations12 Aug 2021 Li Wang, Rongzhi Gu, Nuo Chen, Yuexian Zou

Recently proposed metric learning approaches improved the generalizability of models for the KWS task, and 1D-CNN based KWS models have achieved the state-of-the-arts (SOTA) in terms of model size.

Metric Learning Small-Footprint Keyword Spotting

Joint Multiple Intent Detection and Slot Filling via Self-distillation

no code implementations18 Aug 2021 Lisong Chen, Peilin Zhou, Yuexian Zou

With the auxiliary knowledge provided by the MIL Intent Decoder, we set Final Slot Decoder as the teacher model that imparts knowledge back to Initial Slot Decoder to complete the loop.

Intent Detection Multiple Instance Learning +3

Fully Non-Homogeneous Atmospheric Scattering Modeling with Convolutional Neural Networks for Single Image Dehazing

no code implementations25 Aug 2021 Cong Wang, Yan Huang, Yuexian Zou, Yong Xu

However, it is noted that ASM-based SIDM degrades its performance in dehazing real world hazy images due to the limited modelling ability of ASM where the atmospheric light factor (ALF) and the angular scattering coefficient (ASC) are assumed as constants for one image.

Image Dehazing Single Image Dehazing

HAN: Higher-order Attention Network for Spoken Language Understanding

no code implementations26 Aug 2021 Dongsheng Chen, Zhiqi Huang, Yuexian Zou

Spoken Language Understanding (SLU), including intent detection and slot filling, is a core component in human-computer interaction.

Intent Detection slot-filling +2

Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering

no code implementations Findings (EMNLP) 2021 Chenyu You, Nuo Chen, Yuexian Zou

In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage.

Question Answering Representation Learning

On Pursuit of Designing Multi-modal Transformer for Video Grounding

no code implementations EMNLP 2021 Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou

Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression.

Sentence Video Grounding

Federated Learning for Spoken Language Understanding

no code implementations COLING 2020 Zhiqi Huang, Fenglin Liu, Yuexian Zou

To this end, we propose a federated learning framework, which could unify various types of datasets as well as tasks to learn and fuse various types of knowledge, i. e., text representations, from different datasets and tasks, without the sharing of downstream task data.

Intent Detection slot-filling +4

Rethinking Skip Connection with Layer Normalization

no code implementations COLING 2020 Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang, Xu sun, Yuexian Zou

In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could by addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection.

Image Classification Machine Translation +1

Learning Decoupling Features Through Orthogonality Regularization

no code implementations31 Mar 2022 Li Wang, Rongzhi Gu, Weiji Zhuang, Peng Gao, Yujun Wang, Yuexian Zou

Bearing this in mind, a two-branch deep network (KWS branch and SV branch) with the same network structure is developed and a novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously where speaker-invariant keyword representations and keyword-invariant speaker representations are expected respectively.

Keyword Spotting Speaker Verification

SpatioTemporal Focus for Skeleton-based Action Recognition

no code implementations31 Mar 2022 Liyu Wu, Can Zhang, Yuexian Zou

Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information from the body joints and parts.

Action Recognition Skeleton Based Action Recognition

Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

no code implementations4 Apr 2022 Zifeng Zhao, Dongchao Yang, Rongzhi Gu, Haoran Zhang, Yuexian Zou

However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings.

blind source separation Metric Learning +2

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

no code implementations15 Apr 2022 Zifeng Zhao, Rongzhi Gu, Dongchao Yang, Jinchuan Tian, Yuexian Zou

Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered.

Domain Adaptation

End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

no code implementations Findings (NAACL) 2022 Chenyu You, Nuo Chen, Fenglin Liu, Shen Ge, Xian Wu, Yuexian Zou

To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations.

4k Conversational Question Answering +2

Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

no code implementations3 May 2022 Xinmeng Xu, Rongzhi Gu, Yuexian Zou

Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems.

Multi-Task Learning Speech Enhancement

Competence-based Multimodal Curriculum Learning for Medical Report Generation

no code implementations ACL 2021 Fenglin Liu, Shen Ge, Yuexian Zou, Xian Wu

Medical report generation task, which targets to produce long and coherent descriptions of medical images, has attracted growing research interests recently.

Image Captioning Medical Report Generation

A Transformer-based Threshold-Free Framework for Multi-Intent NLU

no code implementations COLING 2022 Lisung Chen, Nuo Chen, Yuexian Zou, Yong Wang, Xinzhong Sun

Furthermore, we propose a threshold-free intent multi-intent classifier that utilizes the output of IND task and detects the multiple intents without depending on the threshold.

Multi-Task Learning Natural Language Understanding

Prophet Attention: Predicting Attention with Future Attention for Image Captioning

no code implementations19 Oct 2022 Fenglin Liu, Xuancheng Ren, Xian Wu, Wei Fan, Yuexian Zou, Xu sun

Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words.

Image Captioning

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

no code implementations28 Oct 2022 Fenglin Liu, Xian Wu, Shen Ge, Xuancheng Ren, Wei Fan, Xu sun, Yuexian Zou

To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format.

Image Captioning Language Modelling +3

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

no code implementations22 Nov 2022 Fenglin Liu, Xian Wu, Chenyu You, Shen Ge, Yuexian Zou, Xu sun

To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language.

Translation Video Captioning

Exploiting Auxiliary Caption for Video Grounding

no code implementations15 Jan 2023 Hongxiang Li, Meng Cao, Xuxin Cheng, Zhihong Zhu, Yaowei Li, Yuexian Zou

Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video.

Contrastive Learning Dense Video Captioning +2

Improving Weakly Supervised Sound Event Detection with Causal Intervention

no code implementations10 Mar 2023 Yifei Xin, Dongchao Yang, Fan Cui, Yujun Wang, Yuexian Zou

Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i. e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision.

Event Detection Sound Event Detection

Improve Retrieval-based Dialogue System via Syntax-Informed Attention

no code implementations12 Mar 2023 Tengtao Song, Nuo Chen, Ji Jiang, Zhihong Zhu, Yuexian Zou

Since incorporating syntactic information like dependency structures into neural models can promote a better understanding of the sentences, such a method has been widely used in NLP tasks.

Retrieval Sentence

Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

no code implementations ICCV 2023 Yaowei Li, Bang Yang, Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yuexian Zou

Automatic radiology report generation has attracted enormous research interest due to its practical value in reducing the workload of radiologists.

Sentence

Iterative Proposal Refinement for Weakly-Supervised Video Grounding

no code implementations CVPR 2023 Meng Cao, Fangyun Wei, Can Xu, Xiubo Geng, Long Chen, Can Zhang, Yuexian Zou, Tao Shen, Daxin Jiang

Weakly-Supervised Video Grounding (WSVG) aims to localize events of interest in untrimmed videos with only video-level annotations.

Sentence Video Grounding

Customizing General-Purpose Foundation Models for Medical Report Generation

no code implementations9 Jun 2023 Bang Yang, Asif Raza, Yuexian Zou, Tong Zhang

In this work, we propose customizing off-the-shelf general-purpose large-scale pre-trained models, i. e., foundation models (FMs), in computer vision and natural language processing with a specific focus on medical report generation.

Medical Report Generation Transfer Learning

Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels

no code implementations5 Jul 2023 Bang Yang, Fenglin Liu, Zheng Li, Qingyu Yin, Chenyu You, Bing Yin, Yuexian Zou

We observe that the core challenges of novel product title generation are the understanding of novel product characteristics and the generation of titles in a novel writing style.

Image Captioning Text Generation

ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding

no code implementations19 Nov 2023 Xuxin Cheng, Bowen Cao, Qichen Ye, Zhihong Zhu, Hongxiang Li, Yuexian Zou

Specifically, in fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively, aiming to iteratively share knowledge between these two models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

1 code implementation30 Jan 2024 Bang Yang, Yong Dai, Xuxin Cheng, Yaowei Li, Asif Raza, Yuexian Zou

To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training.

Text Retrieval

Retrieval is Accurate Generation

no code implementations27 Feb 2024 Bowen Cao, Deng Cai, Leyang Cui, Xuxin Cheng, Wei Bi, Yuexian Zou, Shuming Shi

To address this, we propose to initialize the training oracles using linguistic heuristics and, more importantly, bootstrap the oracles through iterative self-reinforcement.

Language Modelling Retrieval +1

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

no code implementations2 Mar 2024 Chenchen Tao, Chong Wang, Yuexian Zou, Xiaohao Peng, Jiafei Wu, Jiangbo Qian

Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly.

Anomaly Detection Multiple Instance Learning +1

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

no code implementations10 Mar 2024 Deshun Yang, Luhui Hu, Yu Tian, Zihao Li, Chris Kelly, Bang Yang, Cindy Yang, Yuexian Zou

Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content.

Video Generation

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

no code implementations14 Mar 2024 Chris Kelly, Luhui Hu, Jiayin Hu, Yu Tian, Deshun Yang, Bang Yang, Cindy Yang, Zihao Li, Zaoshan Huang, Yuexian Zou

It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models, identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs such as text prompts.

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

no code implementations14 Mar 2024 Chris Kelly, Luhui Hu, Bang Yang, Yu Tian, Deshun Yang, Cindy Yang, Zaoshan Huang, Zihao Li, Jiayin Hu, Yuexian Zou

With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question.

Language Modelling Large Language Model +2

Cannot find the paper you are looking for? You can Submit a new open access paper.