1 code implementation • 1 Jun 2023 • Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance.
Ranked #1 on Image Classification on Places365-Standard (Top-1 Accuracy metric)
no code implementations • 6 Apr 2023 • Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer
There has been a longstanding belief that generation can facilitate a true understanding of visual data.
1 code implementation • CVPR 2023 • Yubin Hu, Yuze He, Yanghao Li, Jisheng Li, Yuxing Han, Jiangtao Wen, Yong-Jin Liu
In this paper, we propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient VSS.
4 code implementations • CVPR 2022 • Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-yuan Wu, Bo Xiong, Christoph Feichtenhofer, Jitendra Malik
Reversible Vision Transformers achieve a reduced memory footprint of up to 15. 5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes.
no code implementations • 15 Dec 2022 • Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer
We present Masked Audio-Video Learners (MAViL) to train audio-visual representations.
Ranked #1 on Audio Classification on VGGSound (using extra training data)
4 code implementations • CVPR 2023 • Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He
We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP.
1 code implementation • CVPR 2023 • Mengmeng Xu, Yanghao Li, Cheng-Yang Fu, Bernard Ghanem, Tao Xiang, Juan-Manuel Perez-Rua
Our experiments show the proposed adaptations improve egocentric query detection, leading to a better visual query localization system in both 2D and 3D configurations.
1 code implementation • 3 Aug 2022 • Mengmeng Xu, Cheng-Yang Fu, Yanghao Li, Bernard Ghanem, Juan-Manuel Perez-Rua, Tao Xiang
The repeated gradient computation of the same object lead to an inefficient training; (2) The false positive rate is high on background frames.
2 code implementations • 18 May 2022 • Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He
We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels.
5 code implementations • 30 Mar 2022 • Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He
This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training.
Ranked #2 on Instance Segmentation on LVIS v1.0 val
1 code implementation • CVPR 2022 • Chao-yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer
Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration.
Ranked #2 on Action Anticipation on EPIC-KITCHENS-100 (using extra training data)
6 code implementations • CVPR 2022 • Yanghao Li, Chao-yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection.
Ranked #1 on Action Classification on Kinetics-600 (GFLOPs metric)
2 code implementations • 22 Nov 2021 • Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaiming He, Ross Girshick
The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive.
1 code implementation • 18 Nov 2021 • Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Christoph Feichtenhofer
We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing.
38 code implementations • CVPR 2022 • Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Ranked #1 on Out-of-Distribution Generalization on ImageNet-W
3 code implementations • CVPR 2022 • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.
6 code implementations • ICCV 2021 • Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.
Ranked #13 on Action Classification on Charades
1 code implementation • CVPR 2021 • Yanghao Li, Tushar Nagarajan, Bo Xiong, Kristen Grauman
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets.
no code implementations • 7 Jul 2020 • Yanghao Li, Bichuan Guo, Jiangtao Wen, Zhen Xia, Shan Liu, Yuxing Han
Denoisers trained with synthetic data often fail to cope with the diversity of unknown noises, giving way to methods that can adapt to existing noise without knowing its ground truth.
no code implementations • 31 Jan 2020 • Sijie Song, Jiaying Liu, Yanghao Li, Zongming Guo
In this work, we propose a Modality Compensation Network (MCN) to explore the relationships of different modalities, and boost the representations for human action recognition.
1 code implementation • CVPR 2020 • Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman
We introduce a model for environment affordances that is learned directly from egocentric video.
1 code implementation • 14 Mar 2019 • Yuntao Chen, Chenxia Han, Yanghao Li, Zehao Huang, Yi Jiang, Naiyan Wang, Zhao-Xiang Zhang
A Simple and Versatile Framework for Object Detection and Instance Recognition
4 code implementations • ICCV 2019 • Yanghao Li, Yuntao Chen, Naiyan Wang, Zhao-Xiang Zhang
In this work, we first present a controlled experiment to investigate the effect of receptive fields for scale variation in object detection.
Ranked #103 on Object Detection on COCO test-dev
no code implementations • 25 Nov 2018 • Yanghao Li, Sijie Song, Yuqi Li, Jiaying Liu
Temporal modeling in videos is a fundamental yet challenging problem in computer vision.
no code implementations • 22 Mar 2017 • Chunhui Liu, Yueyu Hu, Yanghao Li, Sijie Song, Jiaying Liu
Despite the fact that many 3D human activity benchmarks being proposed, most existing action datasets focus on the action recognition tasks for the segmented videos.
3 code implementations • 4 Jan 2017 • Yanghao Li, Naiyan Wang, Jiaying Liu, Xiaodi Hou
Neural Style Transfer has recently demonstrated very exciting results which catches eyes in both academia and industry.
1 code implementation • ICCV 2017 • Yanghao Li, Naiyan Wang, Jiaying Liu, Xiaodi Hou
Although Deep Convolutional Neural Networks (CNNs) have liberated their power in various computer vision tasks, the most important components of CNN, convolutional layers and fully connected layers, are still limited to linear transformations.
1 code implementation • 19 Apr 2016 • Yanghao Li, Cuiling Lan, Junliang Xing, Wen-Jun Zeng, Chunfeng Yuan, Jiaying Liu
In this paper, we study the problem of online action detection from streaming skeleton data.
no code implementations • 24 Mar 2016 • Wentao Zhu, Cuiling Lan, Junliang Xing, Wen-Jun Zeng, Yanghao Li, Li Shen, Xiaohui Xie
Skeleton based action recognition distinguishes human actions using the trajectories of skeleton joints, which provide a very good representation for describing actions.
1 code implementation • 15 Mar 2016 • Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, Xiaodi Hou
However, it is still a common annoyance during the training phase, that one has to prepare at least thousands of labeled images to fine-tune a network to a specific domain.