no code implementations • 19 Mar 2024 • Hanlin Wang, Zhan Tong, Kecheng Zheng, Yujun Shen, LiMin Wang
With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie.
1 code implementation • 21 Dec 2023 • Qinying Liu, Wei Wu, Kecheng Zheng, Zhan Tong, Jiawei Liu, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen
The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data.
1 code implementation • CVPR 2024 • Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou
In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way.
1 code implementation • 26 Nov 2023 • Chongjian Ge, Xiaohan Ding, Zhan Tong, Li Yuan, Jiangliu Wang, Yibing Song, Ping Luo
The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value.
no code implementations • 25 Sep 2023 • Jiangliu Wang, Jianbo Jiao, Yibing Song, Stephen James, Zhan Tong, Chongjian Ge, Pieter Abbeel, Yun-hui Liu
This work aims to improve unsupervised audio-visual pre-training.
1 code implementation • 23 May 2023 • Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan
We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers.
2 code implementations • ICCV 2023 • Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, LiMin Wang
Our EVAD consists of two specialized designs for video action detection.
1 code implementation • 7 Apr 2023 • Ziteng Gao, Zhan Tong, LiMin Wang, Mike Zheng Shou
In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.
Sparse Representation-based Classification Video Classification
no code implementations • 30 Mar 2023 • Chongjian Ge, Jiangliu Wang, Zhan Tong, Shoufa Chen, Yibing Song, Ping Luo
We evaluate our soft neighbor contrastive learning method (SNCLR) on standard visual recognition benchmarks, including image classification, object detection, and instance segmentation.
1 code implementation • CVPR 2023 • LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao
Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).
Ranked #1 on Self-Supervised Action Recognition on UCF101 (using extra training data)
no code implementations • 28 Mar 2023 • Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, LiMin Wang
Existing studies model each actor and scene relation to improve action recognition.
2 code implementations • 26 May 2022 • Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, Ping Luo
To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently.
7 code implementations • 23 Mar 2022 • Zhan Tong, Yibing Song, Jue Wang, LiMin Wang
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.
Ranked #3 on Self-Supervised Action Recognition Linear on HMDB51
2 code implementations • 16 Feb 2022 • Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao Xie
Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images.
Ranked #4 on Efficient ViTs on ImageNet-1K (with DeiT-S)
1 code implementation • ICLR 2022 • Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao Xie
Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images.
1 code implementation • ICCV 2021 • Yuan Zhi, Zhan Tong, LiMin Wang, Gangshan Wu
First, we present two different motion representations to enable us to efficiently distinguish the motion-salient frames from the background.
no code implementations • 1 Jan 2021 • LiMin Wang, Bin Ji, Zhan Tong, Gangshan Wu
To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.
1 code implementation • CVPR 2021 • LiMin Wang, Zhan Tong, Bin Ji, Gangshan Wu
To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.
Ranked #18 on Action Recognition on Something-Something V1