no code implementations • 26 Nov 2024 • Shijian Deng, Wentian Zhao, Yu-Jhe Li, Kun Wan, Daniel Miranda, Ajinkya Kale, Yapeng Tian
Self-improvement in multimodal large language models (MLLMs) is crucial for enhancing their reliability and robustness.
no code implementations • 19 Nov 2024 • Zhehan Kan, Ce Zhang, Zihan Liao, Yapeng Tian, Wenming Yang, Junyuan Xiao, Xu Li, Dongmei Jiang, YaoWei Wang, Qingmin Liao
Large Vision-Language Model (LVLM) systems have demonstrated impressive vision-language reasoning capabilities but suffer from pervasive and severe hallucination issues, posing significant risks in critical domains such as healthcare and autonomous systems.
no code implementations • 15 Nov 2024 • Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, Chen Chen
In this paper, we introduce Motion-Grounded Video Reasoning, a new motion understanding task that requires generating visual answers (video segmentation masks) according to the input question, and hence needs implicit spatiotemporal reasoning and grounding.
no code implementations • 7 Nov 2024 • Tianyu Yang, Yiyang Nan, Lisen Dai, Zhenwen Liang, Yapeng Tian, Xiangliang Zhang
Audio-Visual Question Answering (AVQA) is a challenging task that involves answering questions based on both auditory and visual information in videos.
Audio-visual Question Answering Audio-Visual Question Answering (AVQA) +2
1 code implementation • 5 Nov 2024 • Weiguo Pian, Yiyang Nan, Shijian Deng, Shentong Mo, Yunhui Guo, Yapeng Tian
The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning.
no code implementations • 31 Oct 2024 • Chao Huang, Susan Liang, Yunlong Tang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Through an empirical study, we identify a trend where concepts can be decomposed in text-guided diffusion models.
no code implementations • 30 Oct 2024 • Tianyu Yang, Lisen Dai, Zheyuan Liu, Xiangqi Wang, Meng Jiang, Yapeng Tian, Xiangliang Zhang
Machine unlearning (MU) has gained significant attention as a means to remove specific data from trained models without requiring a full retraining process.
no code implementations • 9 Oct 2024 • Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Secondly, we introduce a cross-modal semantic enhancement approach.
no code implementations • 11 Sep 2024 • Steven Hogue, Chenxu Zhang, Hamza Daruger, Yapeng Tian, Xiaohu Guo
Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs.
no code implementations • 4 Jul 2024 • Shentong Mo, Yapeng Tian
They extracted high-level semantics from visual inputs as the guidance to help disentangle sound representation for individual sources.
no code implementations • 11 Jun 2024 • Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, Yapeng Tian
This shared backbone facilitates both audio and video generation.
1 code implementation • 7 Jun 2024 • Tanvir Mahmud, Shentong Mo, Yapeng Tian, Diana Marculescu
Furthermore, to suppress the background features in each modality from foreground matched audio-visual features, we introduce a robust discriminative foreground mining scheme.
no code implementations • 24 May 2024 • Shentong Mo, Yapeng Tian
Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, limiting their use for high-resolution images.
no code implementations • 17 May 2024 • Sen Fang, Lei Wang, Ce Zheng, Chunyu Sui, Mingyu Zhao, Yapeng Tian, Chen Chen
In this paper, we propose SignLLM, a multilingual Sign Language Production (SLP) large language model, which includes two novel multilingual SLP modes MLSF and Prompt2LangGloss that allow sign language gestures generation from query texts input and question-style prompts input respectively.
1 code implementation • CVPR 2024 • Tanvir Mahmud, Yapeng Tian, Diana Marculescu
Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video.
no code implementations • 27 Mar 2024 • Siva Sai Nagender Vasireddy, Chenxu Zhang, Xiaohu Guo, Yapeng Tian
Experiments demonstrate that non-speech audio noises significantly impact ASD models, and our proposed approach improves ASD performance in noisy environments.
no code implementations • 22 Mar 2024 • Shijian Deng, Erin E. Kosloski, Siddhi Patel, Zeke A. Barnett, Yiyang Nan, Alexander Kaplan, Sisira Aarukapalli, William T. Doan, Matthew Wang, Harsh Singh, Pamela R. Rollins, Yapeng Tian
To pave the way for further research on this new problem, we intensively explored leveraging foundation models and multimodal large language models across different modalities.
no code implementations • 8 Mar 2024 • Shentong Mo, Jing Shi, Yapeng Tian
Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency.
1 code implementation • 27 Feb 2024 • Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu
To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.
no code implementations • 27 Feb 2024 • Nguyen Nguyen, Yapeng Tian, Chenliang Xu
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
no code implementations • 21 Dec 2023 • Chenxu Zhang, Chao Wang, Jianfeng Zhang, Hongyi Xu, Guoxian Song, You Xie, Linjie Luo, Yapeng Tian, Xiaohu Guo, Jiashi Feng
The generation of emotional talking faces from a single portrait image remains a significant challenge.
no code implementations • 31 Oct 2023 • Yuxin Ye, Wenming Yang, Yapeng Tian
LAVSS is inspired by the correlation between spatial audio and visual location.
no code implementations • 18 Oct 2023 • Yiyang Su, Ali Vosoughi, Shijian Deng, Yapeng Tian, Chenliang Xu
The audio-visual sound separation field assumes visible sources in videos, but this excludes invisible sounds beyond the camera's view.
no code implementations • 27 Sep 2023 • Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Room impulse response (RIR), which measures the sound propagation within an environment, is critical for synthesizing high-fidelity audio for a given environment.
2 code implementations • 19 Sep 2023 • Chengyan Wang, Jun Lyu, Shuo Wang, Chen Qin, Kunyuan Guo, Xinyu Zhang, Xiaotong Yu, Yan Li, Fanwen Wang, Jianhua Jin, Zhang Shi, Ziqiang Xu, Yapeng Tian, Sha Hua, Zhensen Chen, Meng Liu, Mengting Sun, Xutong Kuang, Kang Wang, Haoran Wang, Hao Li, Yinghua Chu, Guang Yang, Wenjia Bai, Xiahai Zhuang, He Wang, Jing Qin, Xiaobo Qu
However, a limitation of CMR is its slow imaging speed, which causes patient discomfort and introduces artifacts in the images.
1 code implementation • ICCV 2023 • Shentong Mo, Weiguo Pian, Yapeng Tian
Our CIGN leverages learnable audio-visual class tokens and audio-visual grouping to continually aggregate class-aware features.
no code implementations • 30 Aug 2023 • Sen Fang, Chunyu Sui, Yanghao Zhou, Xuedong Zhang, Hongbin Zhong, Minyu Zhao, Yapeng Tian, Chen Chen
In this paper, we propose a dual-condition diffusion pre-training model named SignDiff that can generate human sign language speakers from a skeleton pose.
no code implementations • 26 Aug 2023 • Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Radu Timotfe, Luc van Gool
Compared to traditional DMs, the compact IPR enables DiffI2I to obtain more accurate outcomes and employ a lighter denoising network and fewer iterations.
1 code implementation • ICCV 2023 • Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian
We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows.
no code implementations • 31 Jul 2023 • Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.
1 code implementation • 5 Jul 2023 • Jiamiao Zhang, Yichen Chi, Jun Lyu, Wenming Yang, Yapeng Tian
Limited by imaging systems, the reconstruction of Magnetic Resonance Imaging (MRI) images from partial measurement is essential to medical imaging research.
no code implementations • 31 May 2023 • Ali Vosoughi, Shijian Deng, Songyang Zhang, Yapeng Tian, Chenliang Xu, Jiebo Luo
In this paper, we first model a confounding effect that causes language and vision bias simultaneously, then propose a counterfactual inference to remove the influence of this effect.
1 code implementation • 24 May 2023 • Yichen Chi, Junhao Gu, Jiamiao Zhang, Wenming Yang, Yapeng Tian
We explicitly tackle motion blurs in egocentric videos using a Dual Branch Deblur Network (DB$^2$Net) in the VSR framework.
no code implementations • 22 May 2023 • Shentong Mo, Jing Shi, Yapeng Tian
In this work, we propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA, that can simply fine-tune lightweight visual-text alignment modules with frozen modality-specific encoders to update visual-aligned text embeddings as the condition.
no code implementations • 3 May 2023 • Shentong Mo, Yapeng Tian
In this work, we propose a simple yet effective audio-visual localization and segmentation framework based on the Segment Anything Model, namely AV-SAM, that can generate sounding object masks corresponding to the audio.
1 code implementation • CVPR 2023 • Shentong Mo, Yapeng Tian
Sound source localization is a typical and challenging task that predicts the location of sound sources in a video.
1 code implementation • CVPR 2023 • Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention.
1 code implementation • ICCV 2023 • Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Luc van Gool
Diffusion model (DM) has achieved SOTA performance by modeling the image synthesis process into a sequential application of a denoising network.
1 code implementation • 30 Nov 2022 • Bin Xia, Yulun Zhang, Yitong Wang, Yapeng Tian, Wenming Yang, Radu Timofte, Luc van Gool
It consists of a knowledge distillation based implicit degradation estimator network (KD-IDE) and an efficient SR network.
2 code implementations • 2 Oct 2022 • Bin Xia, Yulun Zhang, Yitong Wang, Yapeng Tian, Wenming Yang, Radu Timofte, Luc van Gool
In this study, we reconsider components in binary convolution, such as residual connection, BatchNorm, activation function, and structure, for IR tasks.
no code implementations • 20 Aug 2022 • Yake Wei, Di Hu, Yapeng Tian, Xuelong Li
A comprehensive survey that can systematically organize and analyze studies of the audio-visual field is expected.
1 code implementation • 28 Jul 2022 • Bin Xia, Yapeng Tian, Yulun Zhang, Yucheng Hang, Wenming Yang, Qingmin Liao
The most of CNN based super-resolution (SR) methods assume that the degradation is known (\eg, bicubic).
1 code implementation • CVPR 2023 • Bin Xia, Jingwen He, Yulun Zhang, Yitong Wang, Yapeng Tian, Wenming Yang, Luc van Gool
In SSL, we design pruning schemes for several key components in VSR models, including residual blocks, recurrent networks, and upsampling networks.
1 code implementation • CVPR 2022 • Guangyuan Li, Jun Lv, Yapeng Tian, Qi Dou, Chengyan Wang, Chenliang Xu, Jing Qin
However, existing methods still have two shortcomings: (1) they neglect that the multi-contrast features at different scales contain different anatomical details and hence lack effective mechanisms to match and fuse these features for better reconstruction; and (2) they are still deficient in capturing long-range dependencies, which are essential for the regions with complicated anatomical structures.
1 code implementation • CVPR 2022 • Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.
Ranked #5 on Audio-visual Question Answering on MUSIC-AVQA
no code implementations • 15 Mar 2022 • Xiaoyu Xiang, Yapeng Tian, Vijay Rengarajan, Lucas Young, Bo Zhu, Rakesh Ranjan
Consequently, the inverse task of upscaling a low-resolution, low frame-rate video in space and time becomes a challenging ill-posed problem due to information loss and aliasing artifacts.
1 code implementation • 14 Mar 2022 • Hai Wang, Xiaoyu Xiang, Yapeng Tian, Wenming Yang, Qingmin Liao
Second, we put forward a spatial-temporal deformable feature aggregation (STDFA) module, in which spatial and temporal contexts in dynamic video frames are adaptively captured and aggregated to enhance SR reconstruction.
1 code implementation • 12 Jan 2022 • Bin Xia, Yapeng Tian, Yucheng Hang, Wenming Yang, Qingmin Liao, Jie zhou
To improve matching efficiency, we design a novel Embedded PatchMacth scheme with random samples propagation, which involves end-to-end training with asymptotic linear computational cost to the input size.
1 code implementation • 11 Jan 2022 • Bin Xia, Yucheng Hang, Yapeng Tian, Wenming Yang, Qingmin Liao, Jie zhou
To demonstrate the effectiveness of ENLCA, we build an architecture called Efficient Non-Local Contrastive Network (ENLCN) by adding a few of our modules in a simple backbone.
no code implementations • 10 Nov 2021 • Sizhe Li, Yapeng Tian, Chenliang Xu
Leveraging temporal synchronization and association within sight and sound is an essential step towards robust localization of sounding objects.
1 code implementation • 15 Apr 2021 • Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P. Allebach, Chenliang Xu
A na\"ive method is to decompose it into two sub-tasks: video frame interpolation (VFI) and video super-resolution (VSR).
Space-time Video Super-resolution Video Frame Interpolation +1
1 code implementation • CVPR 2021 • Yapeng Tian, Chenliang Xu
In this paper, we propose to make a systematic study on machines multisensory perception under attacks.
1 code implementation • CVPR 2021 • Yapeng Tian, Di Hu, Chenliang Xu
There are rich synchronized audio and visual events in our daily life.
no code implementations • ICCV 2021 • Tiantian Wang, Sifei Liu, Yapeng Tian, Kai Li, Ming-Hsuan Yang
In this paper, we propose to enhance the temporal coherence by Consistency-Regularized Graph Neural Networks (CRGNN) with the aid of a synthesized video matting dataset.
1 code implementation • ECCV 2020 • Yapeng Tian, DIngzeyu Li, Chenliang Xu
In this paper, we introduce a new problem, named audio-visual video parsing, which aims to parse a video into temporal event segments and label them as either audible, visible, or both.
1 code implementation • CVPR 2020 • Yapeng Tian, Yulun Zhang, Yun Fu, Chenliang Xu
Video super-resolution (VSR) aims to restore a photo-realistic high-resolution (HR) video frame from both its corresponding low-resolution (LR) frame (reference frame) and multiple neighboring frames (supporting frames).
3 code implementations • CVPR 2020 • Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P. Allebach, Chenliang Xu
Rather than synthesizing missing LR video frames as VFI networks do, we firstly temporally interpolate LR frame features in missing LR video frames capturing local temporal contexts by the proposed feature temporal interpolation network.
Ranked #4 on Video Frame Interpolation on Vid4 - 4x upscaling
Space-time Video Super-resolution Video Frame Interpolation +1
1 code implementation • 21 Dec 2019 • Yapeng Tian, Chenliang Xu, DIngzeyu Li
We are interested in applying deep networks in the absence of training dataset.
1 code implementation • 9 Sep 2019 • Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue, Qingmin Liao
In this paper, we develop a concise but efficient network architecture called linear compressing based skip-connecting network (LCSCNet) for image super-resolution.
Ranked #18 on Image Super-Resolution on Set14 - 3x upscaling
1 code implementation • ICCV 2019 • Wei Wang, Ruiming Guo, Yapeng Tian, Wenming Yang
Deep learning methods have witnessed the great progress in image restoration with specific metrics (e. g., PSNR, SSIM).
3 code implementations • 25 Dec 2018 • Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, Yun Fu
We fully exploit the hierarchical features from all the convolutional layers.
Ranked #1 on Color Image Denoising on Kodak24 sigma30
no code implementations • 7 Dec 2018 • Yapeng Tian, Chenxiao Guan, Justin Goodman, Marc Moore, Chenliang Xu
To achieve this, we propose a multimodal convolutional neural network-based audio-visual video captioning framework and introduce a modality-aware module for exploring modality selection during sentence generation.
2 code implementations • 7 Dec 2018 • Yapeng Tian, Yulun Zhang, Yun Fu, Chenliang Xu
Video super-resolution (VSR) aims to restore a photo-realistic high-resolution (HR) video frame from both its corresponding low-resolution (LR) frame (reference frame) and multiple neighboring frames (supporting frames).
1 code implementation • 9 Aug 2018 • Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue
Single image super-resolution (SISR) is a notoriously challenging ill-posed problem, which aims to obtain a high-resolution (HR) output from one of its low-resolution (LR) versions.
2 code implementations • ECCV 2018 • Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu
In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos.
16 code implementations • CVPR 2018 • Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, Yun Fu
In this paper, we propose a novel residual dense network (RDN) to address this problem in image SR. We fully exploit the hierarchical features from all the convolutional layers.
Ranked #5 on Image Super-Resolution on IXI