1 code implementation • 15 Nov 2024 • Sucheng Ren, Yaodong Yu, Nataniel Ruiz, Feng Wang, Alan Yuille, Cihang Xie
In this paper, we show that this scale-wise autoregressive framework can be effectively decoupled into \textit{intra-scale modeling}, which captures local spatial dependencies within each scale, and \textit{inter-scale modeling}, which models cross-scale relationships progressively from coarse-to-fine scales.
1 code implementation • 10 Oct 2024 • Feng Wang, Timing Yang, Yaodong Yu, Sucheng Ren, Guoyizhe Wei, Angtian Wang, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie
In this work, we present a comprehensive analysis of causal image modeling and introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations.
no code implementations • 12 Jun 2024 • Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie
For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks.
Ranked #119 on Visual Question Answering on MM-Vet
1 code implementation • 11 Jun 2024 • Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie
The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks.
1 code implementation • 8 Jun 2024 • Sucheng Ren, Xiaoke Huang, Xianhang Li, Junfei Xiao, Jieru Mei, Zeyu Wang, Alan Yuille, Yuyin Zhou
This study presents Medical Vision Generalist (MVG), the first foundation model capable of handling various medical imaging tasks -- such as cross-modal synthesis, image segmentation, denoising, and inpainting -- within a unified image-to-image generation framework.
no code implementations • 24 May 2024 • Sucheng Ren, Hongru Zhu, Chen Wei, Yijiang Li, Alan Yuille, Cihang Xie
This paper presents a new self-supervised video representation learning framework, ARVideo, which autoregressively predicts the next video token in a tailored sequence order.
1 code implementation • 23 May 2024 • Feng Wang, Jiahao Wang, Sucheng Ren, Guoyizhe Wei, Jieru Mei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie
Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba.
no code implementations • 8 Mar 2024 • Yijiang Li, Sucheng Ren, Weipeng Deng, Yuzhi Xu, Ying Gao, Edith Ngai, Haohan Wang
Starting with the class of interest, we query the LLMs to extract relevant knowledge for these novel domains.
no code implementations • 11 Dec 2023 • Lei Zhang, Fangxun Shu, Tianyang Liu, Sucheng Ren, Hao Jiang, Cihang Xie
However, the vast scale of these datasets inevitably introduces significant variability in data quality, which can adversely affect the model performance.
4 code implementations • 4 Dec 2023 • Sucheng Ren, Zeyu Wang, Hongru Zhu, Junfei Xiao, Alan Yuille, Cihang Xie
This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict the next pixels for visual representation learning.
1 code implementation • 23 Aug 2023 • Ziyu Yang, Sucheng Ren, Zongwei Wu, Nanxuan Zhao, Junle Wang, Jing Qin, Shengfeng He
Non-photorealistic videos are in demand with the wave of the metaverse, but lack of sufficient research studies.
1 code implementation • ICCV 2023 • Sucheng Ren, Xingyi Yang, Songhua Liu, Xinchao Wang
At the heart of our approach is to utilize a significance map, which is estimated through hybrid-scale self-attention and evolves itself during training, to reallocate tokens based on the significance of each region.
1 code implementation • 15 Mar 2023 • Sucheng Ren, Fangyun Wei, Samuel Albanie, Zheng Zhang, Han Hu
Deep supervision, which involves extra supervisions to the intermediate features of a neural network, was widely used in image classification in the early deep learning era since it significantly reduces the training difficulty and eases the optimization like avoiding gradient vanish over the vanilla training.
2 code implementations • CVPR 2023 • Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu
Our TinyMIM model of tiny size achieves 79. 6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget.
1 code implementation • 22 Jul 2022 • Zhengqi Gao, Fan-Keng Sun, Mingran Yang, Sucheng Ren, Zikai Xiong, Marc Engeler, Antonio Burazer, Linda Wildling, Luca Daniel, Duane S. Boning
Data lies at the core of modern deep learning.
1 code implementation • 13 Jul 2022 • Songhua Liu, Jingwen Ye, Sucheng Ren, Xinchao Wang
Prior approaches, despite the promising results, have relied on either estimating dense attention to compute per-point matching, which is limited to only coarse scales due to the quadratic memory cost, or fixing the number of correspondences to achieve linear complexity, which lacks flexibility.
1 code implementation • CVPR 2022 • Sucheng Ren, Huiyu Wang, Zhengqi Gao, Shengfeng He, Alan Yuille, Yuyin Zhou, Cihang Xie
More notably, our SDMP is the first method that successfully leverages data mixing to improve (rather than hurt) the performance of Vision Transformers in the self-supervised setting.
1 code implementation • 13 Jun 2022 • Zihui Xue, Zhengqi Gao, Sucheng Ren, Hang Zhao
Crossmodal knowledge distillation (KD) extends traditional knowledge distillation to the area of multimodal learning and demonstrates great success in various applications.
no code implementations • 29 May 2022 • Zheng Xiong, Liangyu Chai, Wenxi Liu, Yongtuo Liu, Sucheng Ren, Shengfeng He
To enable training under this new setting, we convert the crowd count regression problem to a ranking potential prediction problem.
no code implementations • 5 Apr 2022 • Zhengqi Gao, Sucheng Ren, Zihui Xue, Siting Li, Hang Zhao
Multimodal fusion emerges as an appealing technique to improve model performances on many tasks.
no code implementations • 22 Mar 2022 • Tianyu Hua, Yonglong Tian, Sucheng Ren, Michalis Raptis, Hang Zhao, Leonid Sigal
We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-segment) predictions which are effective for feature learning.
1 code implementation • CVPR 2022 • Sucheng Ren, Daquan Zhou, Shengfeng He, Jiashi Feng, Xinchao Wang
This novel merging scheme enables the self-attention to learn relationships between objects with different sizes and simultaneously reduces the token numbers and the computational cost.
no code implementations • 6 Aug 2021 • Yongtuo Liu, Dan Xu, Sucheng Ren, Hanjie Wu, Hongmin Cai, Shengfeng He
To this end, we propose to untangle \emph{domain-invariant} crowd and \emph{domain-specific} background from crowd images and design a fine-grained domain adaption method for crowd counting.
no code implementations • 6 Aug 2021 • Yongtuo Liu, Sucheng Ren, Liangyu Chai, Hanjie Wu, Jing Qin, Dan Xu, Shengfeng He
In this way, we can transfer the original spatial labeling redundancy caused by individual similarities to effective supervision signals on the unlabeled regions.
1 code implementation • 5 Aug 2021 • Sucheng Ren, Qiang Wen, Nanxuan Zhao, Guoqiang Han, Shengfeng He
In this paper, we introduce a new attention-based encoder, vision transformer, into salient object detection to ensure the globalization of the representations from shallow to deep layers.
no code implementations • CVPR 2022 • Sucheng Ren, Zhengqi Gao, Tianyu Hua, Zihui Xue, Yonglong Tian, Shengfeng He, Hang Zhao
Transformers recently are adapted from the community of natural language processing as a promising substitute of convolution-based neural networks for visual learning tasks.
1 code implementation • CVPR 2021 • Sucheng Ren, Wenxi Liu, Yongtuo Liu, Haoxin Chen, Guoqiang Han, Shengfeng He
Additionally, to exclude the information of the moving background objects from motion features, our transformation module enables to reciprocally transform the appearance features to enhance the motion features, so as to focus on the moving objects with salient appearance while removing the co-moving outliers.
Ranked #12 on Unsupervised Video Object Segmentation on DAVIS 2016 val
1 code implementation • CVPR 2021 • Haoxin Chen, Hanjie Wu, Nanxuan Zhao, Sucheng Ren, Shengfeng He
The key is to model the relationship between the query videos and the support images for propagating the object information.
no code implementations • CVPR 2021 • Sucheng Ren, Yong Du, Jianming Lv, Guoqiang Han, Shengfeng He
To these ends, we introduce a trainable "master" network which ingests both audio signals and silent lip videos instead of a pretrained teacher.
1 code implementation • ICCV 2021 • Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, Hang Zhao
In self-supervised representation learning, a common idea behind most of the state-of-the-art approaches is to enforce the robustness of the representations to predefined augmentations.
1 code implementation • ICCV 2021 • Zihui Xue, Sucheng Ren, Zhengqi Gao, Hang Zhao
The popularity of multimodal sensors and the accessibility of the Internet have brought us a massive amount of unlabeled multimodal data.
Ranked #69 on Semantic Segmentation on NYU Depth v2
no code implementations • ECCV 2020 • Sucheng Ren, Chu Han, Xin Yang, Guoqiang Han, Shengfeng He
In this paper, we propose a simple yet effective approach, named Triple Excitation Network, to reinforce the training of video salient object detection (VSOD) from three aspects, spatial, temporal, and online excitations.