Search Results for author: Christoph Feichtenhofer

Found 49 papers, 40 papers with code

Window Attention is Bugged: How not to Interpolate Position Embeddings

no code implementations • 9 Nov 2023 • Daniel Bolya, Chaitanya Ryali, Judy Hoffman, Christoph Feichtenhofer

To fix it, we introduce a simple absolute window position embedding strategy, which solves the bug outright in Hiera and allows us to increase both speed and performance of the model in ViTDet.

Position

Paper
Add Code

Demystifying CLIP Data

2 code implementations • 28 Sep 2023 • Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective.

991

Paper
Code

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

2 code implementations • 1 Jun 2023 • Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance.

Ranked #1 on Image Classification on iNaturalist 2019 (using extra training data)

Action Classification Action Recognition In Videos +4

691

Paper
Code

Diffusion Models as Masked Autoencoders

no code implementations • ICCV 2023 • Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer

There has been a longstanding belief that generation can facilitate a true understanding of visual data.

Denoising Image Inpainting

Paper
Add Code

On the Benefits of 3D Pose and Tracking for Human Action Recognition

1 code implementation • CVPR 2023 • Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Christoph Feichtenhofer, Jitendra Malik

Subsequently, we propose a Lagrangian Action Recognition model by fusing 3D pose and contextualized appearance over tracklets.

Ranked #1 on Action Recognition on AVA v2.2 (using extra training data)

Action Recognition Temporal Action Localization

218

Paper
Code

The effectiveness of MAE pre-pretraining for billion-scale pretraining

1 code implementation • ICCV 2023 • Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well.

Ranked #1 on Few-Shot Image Classification on ImageNet - 10-shot (using extra training data)

Action Classification Action Recognition +6

Paper
Code

Reversible Vision Transformers

4 code implementations • CVPR 2022 • Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-yuan Wu, Bo Xiong, Christoph Feichtenhofer, Jitendra Malik

Reversible Vision Transformers achieve a reduced memory footprint of up to 15. 5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes.

Image Classification object-detection +2

6,264

Paper
Code

Multiview Compressive Coding for 3D Reconstruction

1 code implementation • CVPR 2023 • Chao-yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, Georgia Gkioxari

We introduce a simple framework that operates on 3D points of single objects or whole scenes coupled with category-agnostic large-scale training from diverse RGB-D videos.

Ranked #2 on Single-View 3D Reconstruction on Common Objects in 3D

3D Reconstruction Self-Supervised Learning +1

602

Paper
Code

CiT: Curation in Training for Effective Vision-Language Data

1 code implementation • ICCV 2023 • Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford.

Paper
Code

MAViL: Masked Audio-Video Learners

1 code implementation • NeurIPS 2023 • Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer

We present Masked Audio-Video Learners (MAViL) to train audio-visual representations.

Contrastive Learning Retrieval

Paper
Code

Scaling Language-Image Pre-training via Masking

4 code implementations • CVPR 2023 • Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He

We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP.

8,381

Paper
Code

Token Merging: Your ViT But Faster

3 code implementations • 17 Oct 2022 • Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman

Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2. 2x the throughput of ViT-L on video with only a 0. 2-0. 3% accuracy drop in each case.

Ranked #13 on Efficient ViTs on ImageNet-1K (with DeiT-S)

Efficient ViTs

1,205

Paper
Code

Masked Autoencoders that Listen

4 code implementations • 13 Jul 2022 • Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.

Ranked #2 on Speaker Identification on VoxCeleb1 (using extra training data)

Audio Classification Representation Learning +1

1,287

Paper
Code

Masked Autoencoders As Spatiotemporal Learners

3 code implementations • 18 May 2022 • Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He

We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels.

Inductive Bias Representation Learning

6,264

Paper
Code

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

1 code implementation • CVPR 2022 • Chao-yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration.

Ranked #3 on Action Anticipation on EPIC-KITCHENS-100 (using extra training data)

Action Anticipation Action Classification +2

135

Paper
Code

A ConvNet for the 2020s

45 code implementations • CVPR 2022 • Zhuang Liu, Hanzi Mao, Chao-yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie

The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model.

Ranked #1 on Classification on InDL

Classification Domain Generalization +3

60,884

Paper
Code

Masked Feature Prediction for Self-Supervised Visual Pre-Training

5 code implementations • CVPR 2022 • Chen Wei, Haoqi Fan, Saining Xie, Chao-yuan Wu, Alan Yuille, Christoph Feichtenhofer

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models.

Ranked #8 on Action Recognition on AVA v2.2 (using extra training data)

Action Classification Action Recognition +1

6,264

Paper
Code

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

7 code implementations • CVPR 2022 • Yanghao Li, Chao-yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection.

Ranked #1 on Action Classification on Kinetics-600 (GFLOPs metric)

Action Classification Action Recognition +6

29,671

Paper
Code

PyTorchVideo: A Deep Learning Library for Video Understanding

1 code implementation • 18 Nov 2021 • Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Christoph Feichtenhofer

We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing.

Self-Supervised Learning Video Understanding

3,176

Paper
Code

Ego4D: Around the World in 3,000 Hours of Egocentric Video

6 code implementations • CVPR 2022 • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.

De-identification Ethics

4,978

Paper
Code

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2 code implementations • EMNLP 2021 • Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.

Ranked #1 on Temporal Action Localization on CrossTask (using extra training data)

Action Segmentation Long Video Retrieval (Background Removed) +4

29,193

Paper
Code

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

2 code implementations • NeurIPS 2021 • Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, João F. Henriques

In video transformers, the time dimension is often treated in the same way as the two spatial dimensions.

Ranked #15 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Classification Action Recognition +1

7,522

Paper
Code

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

1 code implementation • Findings (ACL) 2021 • Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks.

Ranked #2 on Temporal Action Localization on CrossTask (using extra training data)

Action Segmentation Language Modelling +5

29,192

Paper
Code

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

2 code implementations • CVPR 2021 • Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, Kaiming He

We present a large-scale study on unsupervised spatiotemporal representation learning from videos.

Ranked #3 on Self-Supervised Action Recognition on HMDB51

Representation Learning Self-Supervised Action Recognition +1

6,264

Paper
Code

Multiscale Vision Transformers

7 code implementations • ICCV 2021 • Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer

We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.

Ranked #14 on Action Classification on Charades

Action Classification Action Recognition +2

6,264

Paper
Code

Multiview Pseudo-Labeling for Semi-supervised Learning from Video

no code implementations • ICCV 2021 • Bo Xiong, Haoqi Fan, Kristen Grauman, Christoph Feichtenhofer

We present a multiview pseudo-labeling approach to video learning, a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.

Representation Learning Video Recognition

Paper
Add Code

TrackFormer: Multi-Object Tracking with Transformers

2 code implementations • CVPR 2022 • Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph Feichtenhofer

The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories.

Ranked #1 on Multi-Object Tracking on MOT17 (e2e-MOT metric)

Multi-Object Tracking Object +1

471

Paper
Code

X3D: Expanding Architectures for Efficient Video Recognition

8 code implementations • CVPR 2020 • Christoph Feichtenhofer

This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.

Ranked #93 on Action Classification on Kinetics-400

Action Classification feature selection +4

6,264

Paper
Code

Feature Pyramid Grids

1 code implementation • 7 Apr 2020 • Kai Chen, Yuhang Cao, Chen Change Loy, Dahua Lin, Christoph Feichtenhofer

Feature pyramid networks have been widely adopted in the object detection literature to improve feature representations for better handling of variations in scale.

Neural Architecture Search object-detection +2

27,705

Paper
Code

Audiovisual SlowFast Networks for Video Recognition

3 code implementations • 23 Jan 2020 • Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception.

Action Classification Video Recognition

6,264

Paper
Code

EGO-TOPO: Environment Affordances from Egocentric Video

1 code implementation • CVPR 2020 • Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman

We introduce a model for environment affordances that is learned directly from egocentric video.

Paper
Code

A Multigrid Method for Efficiently Training Video Models

3 code implementations • CVPR 2020 • Chao-yuan Wu, Ross Girshick, Kaiming He, Christoph Feichtenhofer, Philipp Krähenbühl

We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).

Ranked #1 on Video Classification on Kinetics

Action Detection Action Recognition +2

6,264

Paper
Code

Learning Temporal Pose Estimation from Sparsely-Labeled Videos

3 code implementations • NeurIPS 2019 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation.

Ranked #2 on Multi-Person Pose Estimation on PoseTrack2018 (using extra training data)

Multi-Person Pose Estimation Optical Flow Estimation

4,966

Paper
Code

Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)

no code implementations • 3 Jun 2019 • Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements.

Human-Object Interaction Detection Object +1

Paper
Add Code

Modeling Human Motion with Quaternion-based Neural Networks

1 code implementation • 21 Jan 2019 • Dario Pavllo, Christoph Feichtenhofer, Michael Auli, David Grangier

Previous work on predicting or generating 3D human pose sequences regresses either joint rotations or joint positions.

695

Paper
Code

Long-Term Feature Banks for Detailed Video Understanding

4 code implementations • CVPR 2019 • Chao-yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, Ross Girshick

To understand the world, we humans constantly need to relate the present to the past, and put events in context.

Ranked #4 on Egocentric Activity Recognition on EPIC-KITCHENS-55

Action Classification Action Recognition +2

3,866

Paper
Code

Grounded Human-Object Interaction Hotspots from Video

1 code implementation • ICCV 2019 • Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements.

Ranked #3 on Video-to-image Affordance Grounding on EPIC-Hotspot

Human-Object Interaction Detection Object +3

Paper
Code

Learning Discriminative Motion Features Through Detection

no code implementations • 11 Dec 2018 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

Our network learns to spatially sample features from Frame B in order to maximize pose detection accuracy in Frame A.

Fine-grained Action Recognition Pose Estimation +1

Paper
Add Code

SlowFast Networks for Video Recognition

15 code implementations • ICCV 2019 • Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He

We present SlowFast networks for video recognition.

Ranked #4 on Action Recognition on AVA v2.1

Action Classification Action Detection +3

6,264

Paper
Code

3D human pose estimation in video with temporal convolutions and semi-supervised training

10 code implementations • CVPR 2019 • Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli

We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints.

Ranked #13 on Weakly-supervised 3D Human Pose Estimation on Human3.6M (Number of Frames Per View metric)

Monocular 3D Human Pose Estimation Position +1

4,966

Paper
Code

Camera-based vehicle velocity estimation from monocular video

no code implementations • 20 Feb 2018 • Moritz Kampelmühler, Michael G. Müller, Christoph Feichtenhofer

This paper documents the winning entry at the CVPR2017 vehicle velocity estimation challenge.

Autonomous Driving Optical Flow Estimation

Paper
Add Code

What have we learned from deep representations for action recognition?

no code implementations • CVPR 2018 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, Andrew Zisserman

In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video.

Action Recognition Temporal Action Localization

Paper
Add Code

Detect to Track and Track to Detect

3 code implementations • ICCV 2017 • Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year.

Object object-detection +1

549

Paper
Code

Spatiotemporal Multiplier Networks for Video Action Recognition

1 code implementation • CVPR 2017 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes

This paper presents a general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features.

Ranked #47 on Action Recognition on HMDB-51

Action Recognition General Classification +1

175

Paper
Code

Temporal Residual Networks for Dynamic Scene Recognition

1 code implementation • CVPR 2017 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes

Finally, our temporal ResNet boosts recognition performance and establishes a new state-of-the-art on dynamic scene recognition, as well as on the complementary task of action recognition.

Action Recognition Scene Recognition +1

Paper
Code

Spatiotemporal Residual Networks for Video Action Recognition

1 code implementation • NeurIPS 2016 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes

Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos.

Ranked #48 on Action Recognition on UCF101

Action Recognition In Videos Temporal Action Localization

175

Paper
Code

Convolutional Two-Stream Network Fusion for Video Action Recognition

1 code implementation • CVPR 2016 • Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information.

Ranked #60 on Action Recognition on UCF101 (using extra training data)

Action Recognition In Videos Temporal Action Localization +1

706

Paper
Code

Dynamically Encoded Actions Based on Spacetime Saliency

no code implementations • CVPR 2015 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes

By using the resulting definition of saliency during feature pooling we show that action recognition performance achieves state-of-the-art levels on three widely considered action recognition datasets.

Action Recognition Temporal Action Localization

Paper
Add Code

Bags of Spacetime Energies for Dynamic Scene Recognition

no code implementations • CVPR 2014 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes

This paper presents a unified bag of visual word (BoW) framework for dynamic scene recognition.

General Classification Scene Recognition

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.