Search Results for author: Yapeng Tian

Found 66 papers, 35 papers with code

Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

no code implementations26 Nov 2024 Shijian Deng, Wentian Zhao, Yu-Jhe Li, Kun Wan, Daniel Miranda, Ajinkya Kale, Yapeng Tian

Self-improvement in multimodal large language models (MLLMs) is crucial for enhancing their reliability and robustness.

Hallucination

CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

no code implementations19 Nov 2024 Zhehan Kan, Ce Zhang, Zihan Liao, Yapeng Tian, Wenming Yang, Junyuan Xiao, Xu Li, Dongmei Jiang, YaoWei Wang, Qingmin Liao

Large Vision-Language Model (LVLM) systems have demonstrated impressive vision-language reasoning capabilities but suffer from pervasive and severe hallucination issues, posing significant risks in critical domains such as healthcare and autonomous systems.

Hallucination Language Modelling +2

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

no code implementations15 Nov 2024 Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, Chen Chen

In this paper, we introduce Motion-Grounded Video Reasoning, a new motion understanding task that requires generating visual answers (video segmentation masks) according to the input question, and hence needs implicit spatiotemporal reasoning and grounding.

Benchmarking counterfactual +6

Continual Audio-Visual Sound Separation

1 code implementation5 Nov 2024 Weiguo Pian, Yiyang Nan, Shijian Deng, Shentong Mo, Yunhui Guo, Yapeng Tian

The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning.

Continual Learning Semantic Similarity +1

Scaling Concept With Text-Guided Diffusion Models

no code implementations31 Oct 2024 Chao Huang, Susan Liang, Yunlong Tang, Yapeng Tian, Anurag Kumar, Chenliang Xu

Through an empirical study, we identify a trend where concepts can be decomposed in text-guided diffusion models.

CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP

no code implementations30 Oct 2024 Tianyu Yang, Lisen Dai, Zheyuan Liu, Xiangqi Wang, Meng Jiang, Yapeng Tian, Xiangliang Zhang

Machine unlearning (MU) has gained significant attention as a means to remove specific data from trained models without requiring a full retraining process.

Image Classification Machine Unlearning

DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

no code implementations11 Sep 2024 Steven Hogue, Chenxu Zhang, Hamza Daruger, Yapeng Tian, Xiaohu Guo

Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs.

Diversity Talking Head Generation +1

Semantic Grouping Network for Audio Source Separation

no code implementations4 Jul 2024 Shentong Mo, Yapeng Tian

They extracted high-level semantics from visual inputs as the guidance to help disentangle sound representation for individual sources.

Audio Source Separation

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

1 code implementation7 Jun 2024 Tanvir Mahmud, Shentong Mo, Yapeng Tian, Diana Marculescu

Furthermore, to suppress the background features in each modality from foreground matched audio-visual features, we introduce a robust discriminative foreground mining scheme.

audio-visual learning Contrastive Learning

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

no code implementations24 May 2024 Shentong Mo, Yapeng Tian

Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, limiting their use for high-resolution images.

Image Generation Mamba +1

SignLLM: Sign Language Production Large Language Models

no code implementations17 May 2024 Sen Fang, Lei Wang, Ce Zheng, Chunyu Sui, Mingyu Zhao, Yapeng Tian, Chen Chen

In this paper, we propose SignLLM, a multilingual Sign Language Production (SLP) large language model, which includes two novel multilingual SLP modes MLSF and Prompt2LangGloss that allow sign language gestures generation from query texts input and question-style prompts input respectively.

Language Modelling Large Language Model +1

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures

1 code implementation CVPR 2024 Tanvir Mahmud, Yapeng Tian, Diana Marculescu

Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video.

Sound Source Localization

Robust Active Speaker Detection in Noisy Environments

no code implementations27 Mar 2024 Siva Sai Nagender Vasireddy, Chenxu Zhang, Xiaohu Guo, Yapeng Tian

Experiments demonstrate that non-speech audio noises significantly impact ASD models, and our proposed approach improves ASD performance in noisy environments.

Active Speaker Detection Speech Separation

Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition

no code implementations22 Mar 2024 Shijian Deng, Erin E. Kosloski, Siddhi Patel, Zeke A. Barnett, Yiyang Nan, Alexander Kaplan, Sisira Aarukapalli, William T. Doan, Matthew Wang, Harsh Singh, Pamela R. Rollins, Yapeng Tian

To pave the way for further research on this new problem, we intensively explored leveraging foundation models and multimodal large language models across different modalities.

Language Modelling Large Language Model +1

Text-to-Audio Generation Synchronized with Videos

no code implementations8 Mar 2024 Shentong Mo, Jing Shi, Yapeng Tian

Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency.

AudioCaps Audio Generation +1

OSCaR: Object State Captioning and State Change Representation

1 code implementation27 Feb 2024 Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu

To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.

Change Detection Object

Efficiently Leveraging Linguistic Priors for Scene Text Spotting

no code implementations27 Feb 2024 Nguyen Nguyen, Yapeng Tian, Chenliang Xu

This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.

Scene Text Recognition Text Detection +1

LAVSS: Location-Guided Audio-Visual Spatial Audio Separation

no code implementations31 Oct 2023 Yuxin Ye, Wenming Yang, Yapeng Tian

LAVSS is inspired by the correlation between spatial audio and visual location.

Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

no code implementations18 Oct 2023 Yiyang Su, Ali Vosoughi, Shijian Deng, Yapeng Tian, Chenliang Xu

The audio-visual sound separation field assumes visible sources in videos, but this excludes invisible sounds beyond the camera's view.

cross-modal alignment

Neural Acoustic Context Field: Rendering Realistic Room Impulse Response With Neural Fields

no code implementations27 Sep 2023 Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

Room impulse response (RIR), which measures the sound propagation within an environment, is critical for synthesizing high-fidelity audio for a given environment.

Room Impulse Response (RIR)

Class-Incremental Grouping Network for Continual Audio-Visual Learning

1 code implementation ICCV 2023 Shentong Mo, Weiguo Pian, Yapeng Tian

Our CIGN leverages learnable audio-visual class tokens and audio-visual grouping to continually aggregate class-aware features.

audio-visual learning class-incremental learning +3

SignDiff: Diffusion Models for American Sign Language Production

no code implementations30 Aug 2023 Sen Fang, Chunyu Sui, Yanghao Zhou, Xuedong Zhang, Hongbin Zhong, Minyu Zhao, Yapeng Tian, Chen Chen

In this paper, we propose a dual-condition diffusion pre-training model named SignDiff that can generate human sign language speakers from a skeleton pose.

Pose Estimation Sign Language Production +1

DiffI2I: Efficient Diffusion Model for Image-to-Image Translation

no code implementations26 Aug 2023 Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Radu Timotfe, Luc van Gool

Compared to traditional DMs, the compact IPR enables DiffI2I to obtain more accurate outcomes and employ a lighter denoising network and fewer iterations.

Denoising Image-to-Image Translation +2

Audio-Visual Class-Incremental Learning

1 code implementation ICCV 2023 Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian

We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows.

class-incremental learning Class Incremental Learning +4

High-Quality Visually-Guided Sound Separation from Diverse Categories

no code implementations31 Jul 2023 Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu

We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.

Dual Arbitrary Scale Super-Resolution for Multi-Contrast MRI

1 code implementation5 Jul 2023 Jiamiao Zhang, Yichen Chi, Jun Lyu, Wenming Yang, Yapeng Tian

Limited by imaging systems, the reconstruction of Magnetic Resonance Imaging (MRI) images from partial measurement is essential to medical imaging research.

Decoder Super-Resolution

Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA

no code implementations31 May 2023 Ali Vosoughi, Shijian Deng, Songyang Zhang, Yapeng Tian, Chenliang Xu, Jiebo Luo

In this paper, we first model a confounding effect that causes language and vision bias simultaneously, then propose a counterfactual inference to remove the influence of this effect.

counterfactual Counterfactual Inference +2

EgoVSR: Towards High-Quality Egocentric Video Super-Resolution

1 code implementation24 May 2023 Yichen Chi, Junhao Gu, Jiamiao Zhang, Wenming Yang, Yapeng Tian

We explicitly tackle motion blurs in egocentric videos using a Dual Branch Deblur Network (DB$^2$Net) in the VSR framework.

Video Super-Resolution

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

no code implementations22 May 2023 Shentong Mo, Jing Shi, Yapeng Tian

In this work, we propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA, that can simply fine-tune lightweight visual-text alignment modules with frozen modality-specific encoders to update visual-aligned text embeddings as the condition.

AudioCaps Audio Generation +1

AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

no code implementations3 May 2023 Shentong Mo, Yapeng Tian

In this work, we propose a simple yet effective audio-visual localization and segmentation framework based on the Segment Anything Model, namely AV-SAM, that can generate sounding object masks corresponding to the audio.

Decoder Object Localization +2

Audio-Visual Grouping Network for Sound Localization from Mixtures

1 code implementation CVPR 2023 Shentong Mo, Yapeng Tian

Sound source localization is a typical and challenging task that predicts the location of sound sources in a video.

Object Localization Sound Source Localization

Egocentric Audio-Visual Object Localization

1 code implementation CVPR 2023 Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention.

Object Object Localization

DiffIR: Efficient Diffusion Model for Image Restoration

1 code implementation ICCV 2023 Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Luc van Gool

Diffusion model (DM) has achieved SOTA performance by modeling the image synthesis process into a sequential application of a denoising network.

Denoising Image Generation +1

Basic Binary Convolution Unit for Binarized Image Restoration Network

2 code implementations2 Oct 2022 Bin Xia, Yulun Zhang, Yitong Wang, Yapeng Tian, Wenming Yang, Radu Timofte, Luc van Gool

In this study, we reconsider components in binary convolution, such as residual connection, BatchNorm, activation function, and structure, for IR tasks.

Binarization Image Restoration +1

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

no code implementations20 Aug 2022 Yake Wei, Di Hu, Yapeng Tian, Xuelong Li

A comprehensive survey that can systematically organize and analyze studies of the audio-visual field is expected.

audio-visual learning Scene Understanding +1

Structured Sparsity Learning for Efficient Video Super-Resolution

1 code implementation CVPR 2023 Bin Xia, Jingwen He, Yulun Zhang, Yitong Wang, Yapeng Tian, Wenming Yang, Luc van Gool

In SSL, we design pruning schemes for several key components in VSR models, including residual blocks, recurrent networks, and upsampling networks.

Video Super-Resolution

Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution

1 code implementation CVPR 2022 Guangyuan Li, Jun Lv, Yapeng Tian, Qi Dou, Chengyan Wang, Chenliang Xu, Jing Qin

However, existing methods still have two shortcomings: (1) they neglect that the multi-contrast features at different scales contain different anatomical details and hence lack effective mechanisms to match and fuse these features for better reconstruction; and (2) they are still deficient in capturing long-range dependencies, which are essential for the regions with complicated anatomical structures.

Super-Resolution

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

1 code implementation CVPR 2022 Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.

audio-visual learning Audio-visual Question Answering +4

Learning Spatio-Temporal Downsampling for Effective Video Upscaling

no code implementations15 Mar 2022 Xiaoyu Xiang, Yapeng Tian, Vijay Rengarajan, Lucas Young, Bo Zhu, Rakesh Ranjan

Consequently, the inverse task of upscaling a low-resolution, low frame-rate video in space and time becomes a challenging ill-posed problem due to information loss and aliasing artifacts.

Quantization

STDAN: Deformable Attention Network for Space-Time Video Super-Resolution

1 code implementation14 Mar 2022 Hai Wang, Xiaoyu Xiang, Yapeng Tian, Wenming Yang, Qingmin Liao

Second, we put forward a spatial-temporal deformable feature aggregation (STDFA) module, in which spatial and temporal contexts in dynamic video frames are adaptively captured and aggregated to enhance SR reconstruction.

Space-time Video Super-resolution Video Super-Resolution

Coarse-to-Fine Embedded PatchMatch and Multi-Scale Dynamic Aggregation for Reference-based Super-Resolution

1 code implementation12 Jan 2022 Bin Xia, Yapeng Tian, Yucheng Hang, Wenming Yang, Qingmin Liao, Jie zhou

To improve matching efficiency, we design a novel Embedded PatchMacth scheme with random samples propagation, which involves end-to-end training with asymptotic linear computational cost to the input size.

Reference-based Super-Resolution

Efficient Non-Local Contrastive Attention for Image Super-Resolution

1 code implementation11 Jan 2022 Bin Xia, Yucheng Hang, Yapeng Tian, Wenming Yang, Qingmin Liao, Jie zhou

To demonstrate the effectiveness of ENLCA, we build an architecture called Efficient Non-Local Contrastive Network (ENLCN) by adding a few of our modules in a simple backbone.

Contrastive Learning Feature Correlation +1

Space-Time Memory Network for Sounding Object Localization in Videos

no code implementations10 Nov 2021 Sizhe Li, Yapeng Tian, Chenliang Xu

Leveraging temporal synchronization and association within sight and sound is an essential step towards robust localization of sounding objects.

Object Localization

Video Matting via Consistency-Regularized Graph Neural Networks

no code implementations ICCV 2021 Tiantian Wang, Sifei Liu, Yapeng Tian, Kai Li, Ming-Hsuan Yang

In this paper, we propose to enhance the temporal coherence by Consistency-Regularized Graph Neural Networks (CRGNN) with the aid of a synthesized video matting dataset.

Image Matting Optical Flow Estimation +1

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

1 code implementation ECCV 2020 Yapeng Tian, DIngzeyu Li, Chenliang Xu

In this paper, we introduce a new problem, named audio-visual video parsing, which aims to parse a video into temporal event segments and label them as either audible, visible, or both.

Multiple Instance Learning

TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution

1 code implementation CVPR 2020 Yapeng Tian, Yulun Zhang, Yun Fu, Chenliang Xu

Video super-resolution (VSR) aims to restore a photo-realistic high-resolution (HR) video frame from both its corresponding low-resolution (LR) frame (reference frame) and multiple neighboring frames (supporting frames).

Optical Flow Estimation Video Super-Resolution

Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution

3 code implementations CVPR 2020 Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P. Allebach, Chenliang Xu

Rather than synthesizing missing LR video frames as VFI networks do, we firstly temporally interpolate LR frame features in missing LR video frames capturing local temporal contexts by the proposed feature temporal interpolation network.

Space-time Video Super-resolution Video Frame Interpolation +1

Deep Audio Prior

1 code implementation21 Dec 2019 Yapeng Tian, Chenliang Xu, DIngzeyu Li

We are interested in applying deep networks in the absence of training dataset.

blind source separation Texture Synthesis

LCSCNet: Linear Compressing Based Skip-Connecting Network for Image Super-Resolution

1 code implementation9 Sep 2019 Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue, Qingmin Liao

In this paper, we develop a concise but efficient network architecture called linear compressing based skip-connecting network (LCSCNet) for image super-resolution.

Image Super-Resolution

CFSNet: Toward a Controllable Feature Space for Image Restoration

1 code implementation ICCV 2019 Wei Wang, Ruiming Guo, Yapeng Tian, Wenming Yang

Deep learning methods have witnessed the great progress in image restoration with specific metrics (e. g., PSNR, SSIM).

Image Restoration Image Super-Resolution +1

An Attempt towards Interpretable Audio-Visual Video Captioning

no code implementations7 Dec 2018 Yapeng Tian, Chenxiao Guan, Justin Goodman, Marc Moore, Chenliang Xu

To achieve this, we propose a multimodal convolutional neural network-based audio-visual video captioning framework and introduce a modality-aware module for exploring modality selection during sentence generation.

Audio captioning Audio-Visual Video Captioning +3

TDAN: Temporally Deformable Alignment Network for Video Super-Resolution

2 code implementations7 Dec 2018 Yapeng Tian, Yulun Zhang, Yun Fu, Chenliang Xu

Video super-resolution (VSR) aims to restore a photo-realistic high-resolution (HR) video frame from both its corresponding low-resolution (LR) frame (reference frame) and multiple neighboring frames (supporting frames).

Optical Flow Estimation Video Super-Resolution

Deep Learning for Single Image Super-Resolution: A Brief Review

1 code implementation9 Aug 2018 Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue

Single image super-resolution (SISR) is a notoriously challenging ill-posed problem, which aims to obtain a high-resolution (HR) output from one of its low-resolution (LR) versions.

Deep Learning Efficient Neural Network +1

Residual Dense Network for Image Super-Resolution

16 code implementations CVPR 2018 Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, Yun Fu

In this paper, we propose a novel residual dense network (RDN) to address this problem in image SR. We fully exploit the hierarchical features from all the convolutional layers.

Color Image Denoising Image Super-Resolution

Cannot find the paper you are looking for? You can Submit a new open access paper.