no code implementations • 9 Nov 2023 • Daniel Bolya, Chaitanya Ryali, Judy Hoffman, Christoph Feichtenhofer
To fix it, we introduce a simple absolute window position embedding strategy, which solves the bug outright in Hiera and allows us to increase both speed and performance of the model in ViTDet.
2 code implementations • 28 Sep 2023 • Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective.
3 code implementations • 1 Jun 2023 • Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance.
Ranked #1 on Image Classification on iNaturalist 2019 (using extra training data)
no code implementations • ICCV 2023 • Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer
There has been a longstanding belief that generation can facilitate a true understanding of visual data.
1 code implementation • CVPR 2023 • Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Christoph Feichtenhofer, Jitendra Malik
Subsequently, we propose a Lagrangian Action Recognition model by fusing 3D pose and contextualized appearance over tracklets.
Ranked #1 on Action Recognition on AVA v2.2 (using extra training data)
1 code implementation • ICCV 2023 • Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra
While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well.
Ranked #1 on Few-Shot Image Classification on ImageNet - 10-shot (using extra training data)
4 code implementations • CVPR 2022 • Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-yuan Wu, Bo Xiong, Christoph Feichtenhofer, Jitendra Malik
Reversible Vision Transformers achieve a reduced memory footprint of up to 15. 5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes.
1 code implementation • CVPR 2023 • Chao-yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, Georgia Gkioxari
We introduce a simple framework that operates on 3D points of single objects or whole scenes coupled with category-agnostic large-scale training from diverse RGB-D videos.
1 code implementation • ICCV 2023 • Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford.
1 code implementation • NeurIPS 2023 • Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer
We present Masked Audio-Video Learners (MAViL) to train audio-visual representations.
5 code implementations • CVPR 2023 • Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He
We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP.
3 code implementations • 17 Oct 2022 • Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman
Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2. 2x the throughput of ViT-L on video with only a 0. 2-0. 3% accuracy drop in each case.
Ranked #13 on Efficient ViTs on ImageNet-1K (with DeiT-S)
4 code implementations • 13 Jul 2022 • Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
Ranked #2 on Speaker Identification on VoxCeleb1 (using extra training data)
3 code implementations • 18 May 2022 • Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He
We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels.
1 code implementation • CVPR 2022 • Chao-yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer
Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration.
Ranked #4 on Action Anticipation on EPIC-KITCHENS-100 (using extra training data)
46 code implementations • CVPR 2022 • Zhuang Liu, Hanzi Mao, Chao-yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie
The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model.
Ranked #1 on Classification on InDL
5 code implementations • CVPR 2022 • Chen Wei, Haoqi Fan, Saining Xie, Chao-yuan Wu, Alan Yuille, Christoph Feichtenhofer
We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models.
Ranked #8 on Action Recognition on AVA v2.2 (using extra training data)
7 code implementations • CVPR 2022 • Yanghao Li, Chao-yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection.
Ranked #1 on Action Classification on Kinetics-600 (GFLOPs metric)
1 code implementation • 18 Nov 2021 • Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Christoph Feichtenhofer
We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing.
7 code implementations • CVPR 2022 • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.
2 code implementations • EMNLP 2021 • Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.
Ranked #1 on Temporal Action Localization on CrossTask (using extra training data)
Action Segmentation Long Video Retrieval (Background Removed) +4
2 code implementations • NeurIPS 2021 • Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, João F. Henriques
In video transformers, the time dimension is often treated in the same way as the two spatial dimensions.
Ranked #15 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)
1 code implementation • Findings (ACL) 2021 • Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer
We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks.
Ranked #2 on Temporal Action Localization on CrossTask (using extra training data)
2 code implementations • CVPR 2021 • Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, Kaiming He
We present a large-scale study on unsupervised spatiotemporal representation learning from videos.
Ranked #3 on Self-Supervised Action Recognition on HMDB51
Representation Learning Self-Supervised Action Recognition +1
7 code implementations • ICCV 2021 • Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.
Ranked #14 on Action Classification on Charades
no code implementations • ICCV 2021 • Bo Xiong, Haoqi Fan, Kristen Grauman, Christoph Feichtenhofer
We present a multiview pseudo-labeling approach to video learning, a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.
2 code implementations • CVPR 2022 • Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph Feichtenhofer
The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories.
Ranked #1 on Multi-Object Tracking on MOT17 (e2e-MOT metric)
8 code implementations • CVPR 2020 • Christoph Feichtenhofer
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.
Ranked #93 on Action Classification on Kinetics-400
1 code implementation • 7 Apr 2020 • Kai Chen, Yuhang Cao, Chen Change Loy, Dahua Lin, Christoph Feichtenhofer
Feature pyramid networks have been widely adopted in the object detection literature to improve feature representations for better handling of variations in scale.
3 code implementations • 23 Jan 2020 • Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception.
1 code implementation • CVPR 2020 • Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman
We introduce a model for environment affordances that is learned directly from egocentric video.
3 code implementations • CVPR 2020 • Chao-yuan Wu, Ross Girshick, Kaiming He, Christoph Feichtenhofer, Philipp Krähenbühl
We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).
Ranked #1 on Video Classification on Kinetics
3 code implementations • NeurIPS 2019 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation.
Ranked #2 on Multi-Person Pose Estimation on PoseTrack2018 (using extra training data)
no code implementations • 3 Jun 2019 • Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman
Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements.
1 code implementation • 21 Jan 2019 • Dario Pavllo, Christoph Feichtenhofer, Michael Auli, David Grangier
Previous work on predicting or generating 3D human pose sequences regresses either joint rotations or joint positions.
4 code implementations • CVPR 2019 • Chao-yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, Ross Girshick
To understand the world, we humans constantly need to relate the present to the past, and put events in context.
Ranked #4 on Egocentric Activity Recognition on EPIC-KITCHENS-55
no code implementations • 11 Dec 2018 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
Our network learns to spatially sample features from Frame B in order to maximize pose detection accuracy in Frame A.
1 code implementation • ICCV 2019 • Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman
Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements.
Ranked #3 on Video-to-image Affordance Grounding on EPIC-Hotspot
15 code implementations • ICCV 2019 • Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He
We present SlowFast networks for video recognition.
Ranked #4 on Action Recognition on AVA v2.1
10 code implementations • CVPR 2019 • Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli
We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints.
Ranked #13 on Weakly-supervised 3D Human Pose Estimation on Human3.6M (Number of Frames Per View metric)
no code implementations • 20 Feb 2018 • Moritz Kampelmühler, Michael G. Müller, Christoph Feichtenhofer
This paper documents the winning entry at the CVPR2017 vehicle velocity estimation challenge.
no code implementations • CVPR 2018 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, Andrew Zisserman
In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video.
3 code implementations • ICCV 2017 • Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year.
1 code implementation • CVPR 2017 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
This paper presents a general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features.
Ranked #47 on Action Recognition on HMDB-51
1 code implementation • CVPR 2017 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
Finally, our temporal ResNet boosts recognition performance and establishes a new state-of-the-art on dynamic scene recognition, as well as on the complementary task of action recognition.
1 code implementation • NeurIPS 2016 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos.
Ranked #48 on Action Recognition on UCF101
1 code implementation • CVPR 2016 • Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information.
Ranked #60 on Action Recognition on UCF101 (using extra training data)
Action Recognition In Videos Temporal Action Localization +1
no code implementations • CVPR 2015 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
By using the resulting definition of saliency during feature pooling we show that action recognition performance achieves state-of-the-art levels on three widely considered action recognition datasets.
no code implementations • CVPR 2014 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
This paper presents a unified bag of visual word (BoW) framework for dynamic scene recognition.