1 code implementation • 11 Feb 2025 • Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng
In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences.
no code implementations • 14 Jan 2025 • MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu
This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens.
no code implementations • 29 Dec 2024 • Bingliang Li, Fengyu Yang, Yuxin Mao, Qingwen Ye, Hongkai Chen, Yiran Zhong
Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene.
no code implementations • 10 Dec 2024 • Hui Deng, Jiawei Shi, Zhen Qin, Yiran Zhong, Yuchao Dai
In this paper, we revisit deep NRSfM from two perspectives to address the limitations of current deep NRSfM methods : (1) canonicalization and (2) sequence modeling.
no code implementations • 10 Dec 2024 • Aixuan Li, Jing Zhang, Jiawei Shi, Yiran Zhong, Yuchao Dai
We find that the well-trained victim models (VMs), against which the attacks are generated, serve as fundamental prerequisites for adversarial attacks, i. e. a segmentation VM is needed to generate attacks for segmentation.
1 code implementation • 18 Nov 2024 • Jinxing Zhou, Dan Guo, Ruohao Guo, Yuxin Mao, Jingjing Hu, Yiran Zhong, Xiaojun Chang, Meng Wang
In this paper, we advance the field by introducing the Open-Vocabulary Audio-Visual Event Localization (OV-AVEL) problem, which requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
1 code implementation • 16 Nov 2024 • Yuhong Chou, Man Yao, Kexin Wang, Yuqi Pan, Ruijie Zhu, Yiran Zhong, Yu Qiao, Jibin Wu, Bo Xu, Guoqi Li
Various linear complexity models, such as Linear Transformer (LinFormer), State Space Model (SSM), and Linear RNN (LinRNN), have been proposed to replace the conventional softmax attention in Transformer structures.
no code implementations • 18 Oct 2024 • Enqi Liu, Liyuan Pan, Yan Yang, Yiran Zhong, Zhijing Wu, Xinxiao wu, Liu Liu
Fine-grained video action recognition can be conceptualized as a video-text matching problem.
no code implementations • 11 Jul 2024 • Jinxing Zhou, Dan Guo, Yuxin Mao, Yiran Zhong, Xiaojun Chang, Meng Wang
Audio-Visual Video Parsing (AVVP) task aims to detect and temporally locate events within audio and visual modalities.
1 code implementation • 24 Jun 2024 • Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, Yiran Zhong
In this study, we present the scaling laws for linear complexity language models to establish a foundation for their scalability.
1 code implementation • 3 Jun 2024 • Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang
The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos.
no code implementations • 31 May 2024 • Zhen Qin, Yuxin Mao, Xuyang Shen, Dong Li, Jing Zhang, Yuchao Dai, Yiran Zhong
Linear attention mechanisms have gained prominence in causal language models due to their linear computational complexity and enhanced speed.
1 code implementation • 27 May 2024 • Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong
This eliminates the need for cumsum in the linear attention calculation.
no code implementations • 27 May 2024 • Zhen Qin, Xuyang Shen, Weigao Sun, Dong Li, Stan Birchfield, Richard Hartley, Yiran Zhong
Finally, the memory state is projected back to a low-dimensional space in the Shrink stage.
1 code implementation • 22 Apr 2024 • Yuxin Mao, Xuyang Shen, Jing Zhang, Zhen Qin, Jinxing Zhou, Mochu Xiang, Yiran Zhong, Yuchao Dai
To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1. 7 million clips with a total duration of 11. 8 thousand hours.
2 code implementations • 11 Apr 2024 • Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, Yiran Zhong
Hierarchically gated linear RNN (HGRN, \citealt{HGRN}) has demonstrated competitive training speed and performance in language modeling while offering efficient inference.
1 code implementation • 3 Apr 2024 • Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong
However, for linear sequence modeling methods like linear attention, existing SP approaches do not take advantage of their right-product-first feature, resulting in sub-optimal communication efficiency and usability.
1 code implementation • 29 Jan 2024 • Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong
CO2 is able to attain a high scalability even on extensive multi-node clusters constrained by very limited communication bandwidth.
1 code implementation • 9 Jan 2024 • Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong
With its ability to process tokens in linear computational complexities, linear attention, in theory, can handle sequences of unlimited length without sacrificing speed, i. e., maintaining a constant training speed for various sequence lengths with a fixed memory consumption.
1 code implementation • 15 Nov 2023 • Zhen Qin, Yiran Zhong
On the other hand, State Space Models (SSMs) achieve lower performance than TNNs in language modeling but offer the advantage of constant inference complexity.
1 code implementation • ICCV 2023 • Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, Yuchao Dai
To achieve this, our ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation.
1 code implementation • 16 Aug 2023 • Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, Yiran Zhong
In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework.
1 code implementation • 11 Aug 2023 • Mengjie Zhou, Liu Liu, Yiran Zhong, Andrew Calway
In this paper, we lift cross-view matching to a 2. 5D space, where heights of structures (e. g., trees and buildings) provide geometric information to guide the cross-view matching.
1 code implementation • 8 Aug 2023 • Weixuan Sun, Yanhao Zhang, Zhen Qin, Zheyuan Liu, Lin Cheng, Fanyi Wang, Yiran Zhong, Nick Barnes
Given a pair of augmented views, our approach regularizes the activation intensities between a pair of augmented views, while also ensuring that the affinity across regions within each view remains consistent.
Ranked #16 on
Weakly-Supervised Semantic Segmentation
on COCO 2014 val
Object Localization
Weakly supervised Semantic Segmentation
+1
no code implementations • 31 Jul 2023 • Yuxin Mao, Jing Zhang, Mochu Xiang, Yunqiu Lv, Yiran Zhong, Yuchao Dai
We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio.
2 code implementations • 27 Jul 2023 • Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Xiao Luo, Yu Qiao, Yiran Zhong
TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization.
no code implementations • 19 Jul 2023 • Zhen Qin, Yiran Zhong, Hui Deng
While these methods perform well on a variety of corpora, the conditions for length extrapolation have yet to be investigated.
no code implementations • 18 Jul 2023 • Zhen Qin, Weixuan Sun, Kaiyue Lu, Hui Deng, Dongxu Li, Xiaodong Han, Yuchao Dai, Lingpeng Kong, Yiran Zhong
Meanwhile, it emphasizes a general paradigm for designing broadly more relative positional encoding methods that are applicable to linear transformers.
no code implementations • 10 Jul 2023 • Aixuan Li, Jing Zhang, Yunqiu Lv, Tong Zhang, Yiran Zhong, Mingyi He, Yuchao Dai
In this case, salient objects are typically non-camouflaged, and camouflaged objects are usually not salient.
2 code implementations • 8 May 2023 • Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, Yiran Zhong
Sequence modeling has important applications in natural language processing and computer vision.
1 code implementation • 2 May 2023 • Weixuan Sun, Zheyuan Liu, Yanhao Zhang, Yiran Zhong, Nick Barnes
The Segment Anything Model (SAM) has demonstrated exceptional performance and versatility, making it a promising tool for various related tasks.
Ranked #3 on
Weakly-Supervised Semantic Segmentation
on COCO 2014 val
(using extra training data)
1 code implementation • CVPR 2023 • Xuyang Shen, Dong Li, Jinxing Zhou, Zhen Qin, Bowen He, Xiaodong Han, Aixuan Li, Yuchao Dai, Lingpeng Kong, Meng Wang, Yu Qiao, Yiran Zhong
We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD).
1 code implementation • CVPR 2023 • Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, Nick Barnes
Based on this observation, we propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with such false negative samples.
no code implementations • 4 Mar 2023 • Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang
We perform extensive experiments on the LLP dataset and demonstrate that our method can generate high-quality segment-level pseudo labels with the help of our newly proposed loss and the label denoising strategy.
1 code implementation • 30 Jan 2023 • Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong
To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process.
1 code implementation • 19 Oct 2022 • Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, Yiran Zhong
In this paper, we examine existing kernel-based linear transformers and identify two key issues that lead to such performance gaps: 1) unbounded gradients in the attention computation adversely impact the convergence of linear transformer models; 2) attention dilution which trivially distributes attention scores over long sequences while neglecting neighbouring structures.
no code implementations • 15 Oct 2022 • Kaiyue Lu, Zexiang Liu, Jianyuan Wang, Weixuan Sun, Zhen Qin, Dong Li, Xuyang Shen, Hui Deng, Xiaodong Han, Yuchao Dai, Yiran Zhong
Therefore, we propose a feature fixation module to reweight the feature importance of the query and key before computing linear attention.
no code implementations • 28 Jul 2022 • Zexiang Liu, Dong Li, Kaiyue Lu, Zhen Qin, Weixuan Sun, Jiacheng Xu, Yiran Zhong
To address this issue, we propose a new framework to find optimal architectures for efficient Transformers with the neural architecture search (NAS) technique.
no code implementations • 26 Jul 2022 • Xuyang Shen, Jo Plested, Sabrina Caldwell, Yiran Zhong, Tom Gedeon
Fine-tuning is widely applied in image classification tasks as a transfer learning approach.
1 code implementation • 25 Jul 2022 • Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Tom Drummond, Zhiyong Wang, ZongYuan Ge
The self-attention mechanism, successfully employed with the transformer structure is shown promise in many computer vision tasks including image recognition, and object detection.
2 code implementations • 11 Jul 2022 • Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong
To deal with the AVS problem, we propose a novel method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process.
1 code implementation • 21 Jun 2022 • Weixuan Sun, Zhen Qin, Hui Deng, Jianyuan Wang, Yi Zhang, Kaihao Zhang, Nick Barnes, Stan Birchfield, Lingpeng Kong, Yiran Zhong
Based on this observation, we present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Ranked #301 on
Image Classification
on ImageNet
no code implementations • 10 Apr 2022 • Hui Deng, Tong Zhang, Yuchao Dai, Jiawei Shi, Yiran Zhong, Hongdong Li
In this paper, we propose to model deep NRSfM from a sequence-to-sequence translation perspective, where the input 2D frame sequence is taken as a whole to reconstruct the deforming 3D non-rigid shape sequence.
1 code implementation • CVPR 2022 • Xuelian Cheng, Huan Xiong, Deng-Ping Fan, Yiran Zhong, Mehrtash Harandi, Tom Drummond, ZongYuan Ge
We propose a new video camouflaged object detection (VCOD) framework that can exploit both short-term dynamics and long-term temporal consistency to detect camouflaged objects from video frames.
Ranked #2 on
Camouflaged Object Segmentation
on Camouflaged Animal Dataset
(using extra training data)
3 code implementations • ICLR 2022 • Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, Yiran Zhong
As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the quadratic space and time complexity to the sequence length.
Ranked #6 on
D4RL
on D4RL
1 code implementation • 17 Dec 2021 • Dongxu Li, Chenchen Xu, Liu Liu, Yiran Zhong, Rong Wang, Lars Petersson, Hongdong Li
This work studies the task of glossification, of which the aim is to em transcribe natural spoken language sentences for the Deaf (hard-of-hearing) community to ordered sign language glosses.
1 code implementation • 6 Dec 2021 • Weixuan Sun, Jing Zhang, Zheyuan Liu, Yiran Zhong, Nick Barnes
To bridge their gap, a Class Activation Map (CAM) is usually generated to provide pixel level pseudo labels.
Weakly supervised Semantic Segmentation
Weakly-Supervised Semantic Segmentation
no code implementations • 29 Nov 2021 • Jiadai Sun, Yuxin Mao, Yuchao Dai, Yiran Zhong, Jianyuan Wang
The task of semi-supervised video object segmentation (VOS) has been greatly advanced and state-of-the-art performance has been made by dense matching-based methods.
no code implementations • 22 Nov 2021 • Jing Zhang, Yuchao Dai, Mehrtash Harandi, Yiran Zhong, Nick Barnes, Richard Hartley
Uncertainty estimation has been extensively studied in recent literature, which can usually be classified as aleatoric uncertainty and epistemic uncertainty.
no code implementations • 29 Sep 2021 • Xuelian Cheng, Huan Xiong, Deng-Ping Fan, Yiran Zhong, Mehrtash Harandi, Tom Drummond, ZongYuan Ge
The proposed SLT-Net leverages on both short-term dynamics and long-term temporal consistency to detect concealed objects in continuous video frames.
1 code implementation • ICCV 2021 • Jing Zhang, Deng-Ping Fan, Yuchao Dai, Xin Yu, Yiran Zhong, Nick Barnes, Ling Shao
In this paper, we introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data.
1 code implementation • 1 Sep 2021 • Xiaomeng Xin, Yiran Zhong, Yunzhong Hou, Jinjun Wang, Liang Zheng
With the absence of old task images, they often assume that old knowledge is well preserved if the classifier produces similar output on new images.
no code implementations • 24 Jun 2021 • Mochu Xiang, Jing Zhang, Yunqiu Lv, Aixuan Li, Yiran Zhong, Yuchao Dai
In this paper, we study the depth contribution for camouflaged object detection, where the depth maps are generated with existing monocular depth estimation (MDE) methods.
Generative Adversarial Network
Monocular Depth Estimation
+5
1 code implementation • 16 Jun 2021 • Jiajun Zha, Yiran Zhong, Jing Zhang, Richard Hartley, Liang Zheng
Attention has been proved to be an efficient mechanism to capture long-range dependencies.
no code implementations • 27 May 2021 • Wenjia Niu, Kaihao Zhang, Wenhan Luo, Yiran Zhong
Single-image super-resolution (SR) and multi-frame SR are two ways to super resolve low-resolution images.
2 code implementations • CVPR 2021 • Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, Meng Wang
To encourage the network to extract high correlated features for positive samples, a new audio-visual pair similarity loss is proposed.
1 code implementation • CVPR 2021 • Jianyuan Wang, Yiran Zhong, Yuchao Dai, Stan Birchfield, Kaihao Zhang, Nikolai Smolyanskiy, Hongdong Li
Two-view structure-from-motion (SfM) is the cornerstone of 3D reconstruction and visual SLAM.
Ranked #28 on
Monocular Depth Estimation
on KITTI Eigen split
no code implementations • CVPR 2021 • Dongxu Li, Chenchen Xu, Kaihao Zhang, Xin Yu, Yiran Zhong, Wenqi Ren, Hanna Suominen, Hongdong Li
Video deblurring models exploit consecutive frames to remove blurs from camera shakes and object motions.
no code implementations • 6 Dec 2020 • Yiran Zhong, Yuchao Dai, Hongdong Li
More specifically, we represent the desired depth map as a collection of 3D planar and the reconstruction problem is formulated as the optimization of planar parameters.
no code implementations • 2 Dec 2020 • Yiran Zhong, Yuchao Dai, Hongdong Li
The given sparse depth points are served as a data term to constrain the weighting process.
no code implementations • 1 Dec 2020 • Yiran Zhong, Charles Loop, Wonmin Byeon, Stan Birchfield, Yuchao Dai, Kaihao Zhang, Alexey Kamenev, Thomas Breuel, Hongdong Li, Jan Kautz
A common way to speed up the computation is to downsample the feature volume, but this loses high-frequency details.
3 code implementations • NeurIPS 2020 • Jianyuan Wang, Yiran Zhong, Yuchao Dai, Kaihao Zhang, Pan Ji, Hongdong Li
Learning matching costs has been shown to be critical to the success of the state-of-the-art deep stereo matching methods, in which 3D convolutions are applied on a 4D feature volume to learn a 3D cost volume.
1 code implementation • NeurIPS 2020 • Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao Dai, Xiaojun Chang, Tom Drummond, Hongdong Li, ZongYuan Ge
To reduce the human efforts in neural network design, Neural Architecture Search (NAS) has been applied with remarkable success to various high-level vision tasks such as classification and semantic segmentation.
Ranked #2 on
Stereo Disparity Estimation
on Scene Flow
1 code implementation • CVPR 2020 • Kaihao Zhang, Wenhan Luo, Yiran Zhong, Lin Ma, Bjorn Stenger, Wei Liu, Hongdong Li
To address this problem, we propose a new method which combines two GAN models, i. e., a learning-to-Blur GAN (BGAN) and learning-to-DeBlur GAN (DBGAN), in order to learn a better model for image deblurring by primarily learning how to blur images.
Ranked #23 on
Deblurring
on HIDE (trained on GOPRO)
no code implementations • CVPR 2019 • Yiran Zhong, Pan Ji, Jianyuan Wang, Yuchao Dai, Hongdong Li
In this paper, we propose Deep Epipolar Flow, an unsupervised optical flow method which incorporates global geometric constraints into network learning.
3 code implementations • CVPR 2019 • Xuelian Cheng, Yiran Zhong, Yuchao Dao, Pan Ji, Hongdong Li
In this paper, we present LidarStereoNet, the first unsupervised Lidar-stereo fusion network, which can be trained in an end-to-end manner without the need of ground truth depth maps.
no code implementations • ECCV 2018 • Yiran Zhong, Yuchao Dai, Hongdong Li
This paper proposes an original problem of \emph{stereo computation from a single mixture image}-- a challenging problem that had not been researched before.
no code implementations • 13 Aug 2018 • Yiran Zhong, Yuchao Dai, Hongdong Li
This paper is concerned with the problem of how to better exploit 3D geometric information for dense semantic image labeling.
no code implementations • ECCV 2018 • Yiran Zhong, Hongdong Li, Yuchao Dai
Deep Learning based stereo matching methods have shown great successes and achieved top scores across different benchmarks.
1 code implementation • 28 Mar 2018 • Kaihao Zhang, Wenhan Luo, Yiran Zhong, Lin Ma, Wei Liu, Hongdong Li
To tackle the second challenge, we leverage the developed DBLRNet as a generator in the GAN (generative adversarial network) architecture, and employ a content loss in addition to an adversarial loss for efficient adversarial training.
no code implementations • 4 Sep 2017 • Yiran Zhong, Yuchao Dai, Hongdong Li
Exiting deep-learning based dense stereo matching methods often rely on ground-truth disparity maps as the training signals, which are however not always available in many situations.
no code implementations • CVPR 2016 • Pan Ji, Hongdong Li, Mathieu Salzmann, Yiran Zhong
Feature tracking is a fundamental problem in computer vision, with applications in many computer vision tasks, such as visual SLAM and action recognition.