no code implementations • 1 Aug 2024 • Benlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yansong Tang, Wei-Chiu Ma, Ranjay Krishna
We introduce Coarse Correspondence, a simple, training-free, effective, and general-purpose visual prompting method to elicit 3D and temporal understanding in multimodal LLMs.
no code implementations • 15 Jul 2024 • Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, QIngwei Lin, JianGuang Lou, Shifeng Chen, Yansong Tang, Weizhu Chen
In this paper, we introduce Arena Learning, an innovative offline strategy designed to simulate these arena battles using AI-driven annotations to evaluate battle outcomes, thus facilitating the continuous improvement of the target model through both supervised fine-tuning and reinforcement learning.
no code implementations • 9 Jul 2024 • BoWen Zhang, Yiji Cheng, Chunyu Wang, Ting Zhang, Jiaolong Yang, Yansong Tang, Feng Zhao, Dong Chen, Baining Guo
We present RodinHD, which can generate high-fidelity 3D avatars from a portrait image.
no code implementations • 30 Jun 2024 • Yiqin Wang, Haoji Zhang, Yansong Tang, Yong liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin
This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA).
no code implementations • 25 Jun 2024 • Aoyang Liu, Qingnan Fan, Shuai Qin, Hong Gu, Yansong Tang
In this paper, we explore a novel task: learning the personalized identity prior for text-based non-rigid image editing.
1 code implementation • 21 Jun 2024 • Chubin Zhang, Hongliang Song, Yi Wei, Yu Chen, Jiwen Lu, Yansong Tang
In this work, we introduce the Geometry-Aware Large Reconstruction Model (GeoLRM), an approach which can predict high-quality assets with 512k Gaussians and 21 input images in only 11 GB GPU memory.
1 code implementation • 18 Jun 2024 • Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang
Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss.
no code implementations • 14 Jun 2024 • Gengyuan Zhang, Mang Ling Ada Fok, Yan Xia, Yansong Tang, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu
Our new benchmark aims to evaluate how well models can localize an event given a multimodal semantic query that consists of a reference image, which depicts the event, and a refinement text to adjust the images' semantics.
1 code implementation • 12 Jun 2024 • Haoji Zhang, Yiqin Wang, Yansong Tang, Yong liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin
Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously.
Ranked #1 on Zero-Shot Video Question Answer on MSRVTT-QA
no code implementations • 3 Jun 2024 • Guanxing Lu, Zifeng Gao, Tianxing Chen, Wenxun Dai, Ziwei Wang, Yansong Tang
To model this process, we design a consistency distillation technique to predict the action sample directly instead of predicting the noise within the vision community for fast convergence in the low-dimensional action manifold.
1 code implementation • CVPR 2024 • Yixuan Zhu, Wenliang Zhao, Ao Li, Yansong Tang, Jie zhou, Jiwen Lu
Image enhancement holds extensive applications in real-world scenarios due to complex environments and limitations of imaging devices.
no code implementations • CVPR 2024 • Kun Yuan, Hongbo Liu, Mading Li, Muyi Sun, Ming Sun, Jiachao Gong, Jinhua Hao, Chao Zhou, Yansong Tang
In this paper, we propose a VQA method named PTM-VQA, which leverages PreTrained Models to transfer knowledge from models pretrained on various pre-tasks, enabling benefits for VQA from different aspects.
1 code implementation • 21 May 2024 • Zhaojian Yu, Yinghao Wu, Zhuotao Deng, Yansong Tang, Xiao-Ping Zhang
By promoting sustainable AI development and deployment, OpenCarbonEval can help reduce the environmental impact of large-scale models and contribute to a more environmentally responsible future for the AI community.
1 code implementation • 30 Apr 2024 • Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, Yansong Tang
By employing one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation.
Ranked #19 on Motion Synthesis on HumanML3D
1 code implementation • CVPR 2024 • Shiyi Zhang, Sule Bai, Guangyi Chen, Lei Chen, Jiwen Lu, Junle Wang, Yansong Tang
NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor.
2 code implementations • CVPR 2023 • Shiyi Zhang, Wenxun Dai, Sujia Wang, Xiangwei Shen, Jiwen Lu, Jie zhou, Yansong Tang
Action quality assessment (AQA) has become an emerging topic since it can be extensively applied in numerous scenarios.
1 code implementation • CVPR 2024 • Yixuan Zhu, Ao Li, Yansong Tang, Wenliang Zhao, Jie zhou, Jiwen Lu
The recovery of occluded human meshes presents challenges for current methods due to the difficulty in extracting effective image features under severe occlusion.
no code implementations • 28 Mar 2024 • BoWen Zhang, Yiji Cheng, Jiaolong Yang, Chunyu Wang, Feng Zhao, Yansong Tang, Dong Chen, Baining Guo
We introduce a radiance representation that is both structured and fully explicit and thus greatly facilitates 3D generative modeling.
1 code implementation • CVPR 2024 • Hancheng Ye, Chong Yu, Peng Ye, Renqiu Xia, Yansong Tang, Jiwen Lu, Tao Chen, Bo Zhang
Recent Vision Transformer Compression (VTC) works mainly follow a two-stage scheme, where the importance score of each model unit is first evaluated or preset in each submodule, followed by the sparsity score evaluation according to the target sparsity constraint.
no code implementations • 16 Mar 2024 • Zhiheng Li, Muheng Li, Jixuan Fan, Lei Chen, Yansong Tang, Jie zhou, Jiwen Lu
Scale arbitrary super-resolution based on implicit image function gains increasing popularity since it can better represent the visual world in a continuous manner.
no code implementations • 13 Mar 2024 • Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang
Performing language-conditioned robotic manipulation tasks in unstructured environments is highly demanded for general intelligent robots.
1 code implementation • CVPR 2024 • JianJian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen
To this end, we propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
1 code implementation • 1 Jan 2024 • Zhuoyan Luo, Yicheng Xiao, Yong liu, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang
The recent transformer-based models have dominated the Referring Video Object Segmentation (RVOS) task due to the superior performance.
no code implementations • 22 Dec 2023 • Jinpeng Liu, Wenxun Dai, Chunyu Wang, Yiji Cheng, Yansong Tang, Xin Tong
Some works use the CLIP model to align the motion space and the text space, aiming to enable motion generation from natural language motion descriptions.
1 code implementation • 14 Dec 2023 • Chubin Zhang, Juncheng Yan, Yi Wei, Jiaxin Li, Li Liu, Yansong Tang, Yueqi Duan, Jiwen Lu
Occupancy prediction reconstructs 3D structures of surrounding environments.
no code implementations • 12 Dec 2023 • Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang
Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments.
1 code implementation • CVPR 2024 • Yong liu, Sule Bai, Guanbin Li, Yitong Wang, Yansong Tang
We attribute this to the in-vocabulary embedding and domain-biased CLIP prediction.
no code implementations • 7 Dec 2023 • Kang Ge, Chen Wang, Yutao Guo, Yansong Tang, Zhenzhong Hu, Hongbing Chen
Two parameter-efficient fine-tuning methods, adapter and low-rank adaptation, are adopted to fine-tune the foundation model in semantic segmentation: the Segment Anything Model (SAM).
1 code implementation • CVPR 2024 • Yong liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, Yansong Tang
This paper aims to achieve universal segmentation of arbitrary semantic level.
Ranked #1 on Referring Expression Segmentation on RefCOCOg-test (using extra training data)
1 code implementation • CVPR 2024 • Xiaoke Huang, JianFeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, Zicheng Liu
We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions.
no code implementations • 10 Nov 2023 • Siao Tang, Xin Wang, Hong Chen, Chaoyu Guan, Zewen Wu, Yansong Tang, Wenwu Zhu
In this paper, we propose a novel post-training quantization method PCR (Progressive Calibration and Relaxing) for text-to-image diffusion models, which consists of a progressive calibration strategy that considers the accumulated quantization error across timesteps, and an activation relaxing strategy that improves the performance with negligible cost.
no code implementations • 8 Nov 2023 • Siao Tang, Xin Wang, Hong Chen, Chaoyu Guan, Yansong Tang, Wenwu Zhu
When retraining the searched architecture, we adopt a dynamic joint loss to maintain the consistency between supernet training and subnet retraining, which also provides informative objectives for each block and shortens the paths of gradient propagation.
1 code implementation • NeurIPS 2023 • Yinan Liang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie zhou, Jiwen Lu
Due to the high price and heavy energy consumption of GPUs, deploying deep models on IoT devices such as microcontrollers makes significant contributions for ecological AI.
1 code implementation • ICCV 2023 • Zhiheng Li, Wenjia Geng, Muheng Li, Lei Chen, Yansong Tang, Jiwen Lu, Jie zhou
By this means, our model explores all sorts of reliable sub-relations within an action sequence in the condensed action space.
1 code implementation • ICCV 2023 • Guangyi Chen, Xiao Liu, Guangrun Wang, Kun Zhang, Philip H. S. Torr, Xiao-Ping Zhang, Yansong Tang
To bridge these gaps, in this paper, we propose Tem-Adapter, which enables the learning of temporal dynamics and complex semantics by a visual Temporal Aligner and a textual Semantic Aligner.
Ranked #1 on Video Question Answering on SUTD-TrafficQA
no code implementations • 1 Aug 2023 • Hongbo Liu, Mingda Wu, Kun Yuan, Ming Sun, Yansong Tang, Chuanchuan Zheng, Xing Wen, Xiu Li
Video quality assessment (VQA) has attracted growing attention in recent years.
1 code implementation • 7 Jul 2023 • Xiao Liu, Guangyi Chen, Yansong Tang, Guangrun Wang, Xiao-Ping Zhang, Ser-Nam Lim
Composing simple elements into complex concepts is crucial yet challenging, especially for 3D action generation.
no code implementations • 3 Jun 2023 • Yiji Cheng, Fei Yin, Xiaoke Huang, Xintong Yu, Jiaxiang Liu, Shikun Feng, Yujiu Yang, Yansong Tang
These elaborated designs enable our model to generate portraits with robust multi-view semantic consistency, eliminating the need for optimization-based methods.
1 code implementation • CVPR 2024 • Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie zhou, Jiwen Lu
On the contrary, we design group-wise quantization functions for activation discretization in different timesteps and sample the optimal timestep for informative calibration image generation, so that our quantized diffusion model can reduce the discretization errors with negligible computational overhead.
1 code implementation • NeurIPS 2023 • Zhuoyan Luo, Yicheng Xiao, Yong liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang
To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
Ranked #2 on Referring Expression Segmentation on A2D Sentences (using extra training data)
1 code implementation • 2 May 2023 • Yuanzheng Ma, Wangting Zhou, Rui Ma, Sihua Yang, Yansong Tang, Xun Guan
To address this challenge, we propose a novel approach that employs a super-resolution PAA method trained with forged PAA images.
1 code implementation • 23 Mar 2023 • Xiaoke Huang, Yiji Cheng, Yansong Tang, Xiu Li, Jie zhou, Jiwen Lu
Moreover, only minutes of optimization is enough for plausible reconstruction results.
1 code implementation • ICCV 2023 • Kunyang Han, Yong liu, Jun Hao Liew, Henghui Ding, Yunchao Wei, Jiajun Liu, Yitong Wang, Yansong Tang, Yujiu Yang, Jiashi Feng, Yao Zhao
Recent advancements in pre-trained vision-language models, such as CLIP, have enabled the segmentation of arbitrary concepts solely from textual inputs, a process commonly referred to as open-vocabulary semantic segmentation (OVS).
Knowledge Distillation Open Vocabulary Semantic Segmentation +4
no code implementations • 11 Mar 2023 • Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, Philip H. S. Torr
Referring image segmentation segments an image from a language expression.
1 code implementation • CVPR 2023 • Yansong Tang, Jinpeng Liu, Aoyang Liu, Bin Yang, Wenxun Dai, Yongming Rao, Jiwen Lu, Jie zhou, Xiu Li
With the continuously thriving popularity around the world, fitness activity analytic has become an emerging research topic in computer vision.
1 code implementation • ICCV 2023 • Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, Xiu Li
To address these problems, we propose FineDance, which contains 14. 6 hours of music-dance paired data, with fine-grained hand motions, fine-grained genres (22 dance genres), and accurate posture.
1 code implementation • 11 Oct 2022 • Yong liu, Ran Yu, Jiahao Wang, Xinyuan Zhao, Yitong Wang, Yansong Tang, Yujiu Yang
Besides, we empirically find low frequency feature should be enhanced in encoder (backbone) while high frequency for decoder (segmentation head).
7 code implementations • 28 Jul 2022 • Yongming Rao, Wenliang Zhao, Yansong Tang, Jie zhou, Ser-Nam Lim, Jiwen Lu
In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework.
Ranked #20 on Semantic Segmentation on ADE20K
1 code implementation • 17 Jul 2022 • Yansong Tang, Xingyu Liu, Xumin Yu, Danyang Zhang, Jiwen Lu, Jie zhou
Different from the conventional adversarial learning-based approaches for UDA, we utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
1 code implementation • 6 Jun 2022 • Wanhua Li, Xiaoke Huang, Zheng Zhu, Yansong Tang, Xiu Li, Jie zhou, Jiwen Lu
In this paper, we propose to learn the rank concepts from the rich semantic CLIP latent space.
Ranked #1 on Few-shot Age Estimation on MORPH Album2
1 code implementation • CVPR 2022 • Kejie Li, Yansong Tang, Victor Adrian Prisacariu, Philip H. S. Torr
Dense 3D reconstruction from a stream of depth images is the key to many mixed reality and robotic applications.
2 code implementations • 21 Mar 2022 • Rui Yang, Hailong Ma, Jie Wu, Yansong Tang, Xuefeng Xiao, Min Zheng, Xiu Li
The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions.
no code implementations • CVPR 2022 • Donglai Wei, Siddhant Kharbanda, Sarthak Arora, Roshan Roy, Nishant Jain, Akash Palrecha, Tanav Shah, Shray Mathur, Ritik Mathur, Abhijay Kemkar, Anirudh Chakravarthy, Zudi Lin, Won-Dong Jang, Yansong Tang, Song Bai, James Tompkin, Philip H.S. Torr, Hanspeter Pfister
Many video understanding tasks require analyzing multi-shot videos, but existing datasets for video object segmentation (VOS) only consider single-shot videos.
1 code implementation • CVPR 2022 • Guangrun Wang, Yansong Tang, Liang Lin, Philip H.S. Torr
Inspired by perceptual learning that could use cross-view learning to perceive concepts and semantics, we propose a novel AE that could learn semantic-aware representation via cross-view image reconstruction.
1 code implementation • CVPR 2022 • Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, Philip H. S. Torr
Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image.
1 code implementation • CVPR 2022 • Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie zhou, Jiwen Lu
In this work, we present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
no code implementations • British Machine Vision Conference 2021 • Zhao Yang, Yansong Tang, Luca Bertinetto, Hengshuang Zhao, Philip Torr
In this paper, we investigate the problem of video object segmentation from referring expressions (VOSRE).
Ranked #1 on Referring Expression Segmentation on J-HMDB (Precision@0.9 metric)
Optical Flow Estimation Referring Expression Segmentation +3
no code implementations • 19 Jul 2021 • Jiahuan Zhou, Yansong Tang, Bing Su, Ying Wu
We justify that the performance limitation is caused by the gradient vanishing on these sample outliers.
1 code implementation • 12 May 2021 • Yansong Tang, Zhenyu Jiang, Zhenda Xie, Yue Cao, Zheng Zhang, Philip H. S. Torr, Han Hu
Previous cycle-consistency correspondence learning methods usually leverage image patches for training.
1 code implementation • CVPR 2020 • Yansong Tang, Zanlin Ni, Jiahuan Zhou, Danyang Zhang, Jiwen Lu, Ying Wu, Jie zhou
Assessing action quality from videos has attracted growing attention in recent years.
Ranked #4 on Action Quality Assessment on AQA-7
no code implementations • 20 Mar 2020 • Yansong Tang, Jiwen Lu, Jie zhou
We believe the introduction of the COIN dataset will promote the future in-depth research on instructional video analysis for the community.
no code implementations • CVPR 2019 • Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, Jie zhou
There are substantial instructional videos on the Internet, which enables us to acquire knowledge for completing various tasks.
no code implementations • CVPR 2018 • Yansong Tang, Yi Tian, Jiwen Lu, Peiyang Li, Jie zhou
In this paper, we propose a deep progressive reinforcement learning (DPRL) method for action recognition in skeleton-based videos, which aims to distil the most informative frames and discard ambiguous frames in sequences for recognizing actions.
Ranked #3 on Skeleton Based Action Recognition on UT-Kinect