Search Results for author: Christoph Feichtenhofer

Found 49 papers, 40 papers with code

Window Attention is Bugged: How not to Interpolate Position Embeddings

no code implementations9 Nov 2023 Daniel Bolya, Chaitanya Ryali, Judy Hoffman, Christoph Feichtenhofer

To fix it, we introduce a simple absolute window position embedding strategy, which solves the bug outright in Hiera and allows us to increase both speed and performance of the model in ViTDet.

Position

Demystifying CLIP Data

2 code implementations28 Sep 2023 Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective.

Reversible Vision Transformers

4 code implementations CVPR 2022 Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-yuan Wu, Bo Xiong, Christoph Feichtenhofer, Jitendra Malik

Reversible Vision Transformers achieve a reduced memory footprint of up to 15. 5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes.

Image Classification object-detection +2

Multiview Compressive Coding for 3D Reconstruction

1 code implementation CVPR 2023 Chao-yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, Georgia Gkioxari

We introduce a simple framework that operates on 3D points of single objects or whole scenes coupled with category-agnostic large-scale training from diverse RGB-D videos.

3D Reconstruction Self-Supervised Learning +1

CiT: Curation in Training for Effective Vision-Language Data

1 code implementation ICCV 2023 Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford.

Scaling Language-Image Pre-training via Masking

4 code implementations CVPR 2023 Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He

We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP.

Token Merging: Your ViT But Faster

3 code implementations17 Oct 2022 Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman

Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2. 2x the throughput of ViT-L on video with only a 0. 2-0. 3% accuracy drop in each case.

Efficient ViTs

Masked Autoencoders that Listen

4 code implementations13 Jul 2022 Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.

Ranked #2 on Speaker Identification on VoxCeleb1 (using extra training data)

Audio Classification Representation Learning +1

Masked Autoencoders As Spatiotemporal Learners

3 code implementations18 May 2022 Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He

We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels.

Inductive Bias Representation Learning

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

1 code implementation CVPR 2022 Chao-yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration.

Ranked #3 on Action Anticipation on EPIC-KITCHENS-100 (using extra training data)

Action Anticipation Action Classification +2

A ConvNet for the 2020s

45 code implementations CVPR 2022 Zhuang Liu, Hanzi Mao, Chao-yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie

The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model.

Classification Domain Generalization +3

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

7 code implementations CVPR 2022 Yanghao Li, Chao-yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection.

 Ranked #1 on Action Classification on Kinetics-600 (GFLOPs metric)

Action Classification Action Recognition +6

PyTorchVideo: A Deep Learning Library for Video Understanding

1 code implementation18 Nov 2021 Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Christoph Feichtenhofer

We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing.

Self-Supervised Learning Video Understanding

Ego4D: Around the World in 3,000 Hours of Egocentric Video

6 code implementations CVPR 2022 Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.

De-identification Ethics

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2 code implementations EMNLP 2021 Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.

 Ranked #1 on Temporal Action Localization on CrossTask (using extra training data)

Action Segmentation Long Video Retrieval (Background Removed) +4

Multiscale Vision Transformers

7 code implementations ICCV 2021 Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer

We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.

Action Classification Action Recognition +2

Multiview Pseudo-Labeling for Semi-supervised Learning from Video

no code implementations ICCV 2021 Bo Xiong, Haoqi Fan, Kristen Grauman, Christoph Feichtenhofer

We present a multiview pseudo-labeling approach to video learning, a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.

Representation Learning Video Recognition

TrackFormer: Multi-Object Tracking with Transformers

2 code implementations CVPR 2022 Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph Feichtenhofer

The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories.

 Ranked #1 on Multi-Object Tracking on MOT17 (e2e-MOT metric)

Multi-Object Tracking Object +1

X3D: Expanding Architectures for Efficient Video Recognition

8 code implementations CVPR 2020 Christoph Feichtenhofer

This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.

Action Classification feature selection +4

Feature Pyramid Grids

1 code implementation7 Apr 2020 Kai Chen, Yuhang Cao, Chen Change Loy, Dahua Lin, Christoph Feichtenhofer

Feature pyramid networks have been widely adopted in the object detection literature to improve feature representations for better handling of variations in scale.

Neural Architecture Search object-detection +2

EGO-TOPO: Environment Affordances from Egocentric Video

1 code implementation CVPR 2020 Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman

We introduce a model for environment affordances that is learned directly from egocentric video.

A Multigrid Method for Efficiently Training Video Models

3 code implementations CVPR 2020 Chao-yuan Wu, Ross Girshick, Kaiming He, Christoph Feichtenhofer, Philipp Krähenbühl

We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).

Action Detection Action Recognition +2

Learning Temporal Pose Estimation from Sparsely-Labeled Videos

3 code implementations NeurIPS 2019 Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation.

Ranked #2 on Multi-Person Pose Estimation on PoseTrack2018 (using extra training data)

Multi-Person Pose Estimation Optical Flow Estimation

Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)

no code implementations3 Jun 2019 Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements.

Human-Object Interaction Detection Object +1

Modeling Human Motion with Quaternion-based Neural Networks

1 code implementation21 Jan 2019 Dario Pavllo, Christoph Feichtenhofer, Michael Auli, David Grangier

Previous work on predicting or generating 3D human pose sequences regresses either joint rotations or joint positions.

Grounded Human-Object Interaction Hotspots from Video

1 code implementation ICCV 2019 Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements.

Human-Object Interaction Detection Object +3

What have we learned from deep representations for action recognition?

no code implementations CVPR 2018 Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, Andrew Zisserman

In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video.

Action Recognition Temporal Action Localization

Detect to Track and Track to Detect

3 code implementations ICCV 2017 Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year.

Object object-detection +1

Spatiotemporal Multiplier Networks for Video Action Recognition

1 code implementation CVPR 2017 Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes

This paper presents a general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features.

Action Recognition General Classification +1

Temporal Residual Networks for Dynamic Scene Recognition

1 code implementation CVPR 2017 Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes

Finally, our temporal ResNet boosts recognition performance and establishes a new state-of-the-art on dynamic scene recognition, as well as on the complementary task of action recognition.

Action Recognition Scene Recognition +1

Convolutional Two-Stream Network Fusion for Video Action Recognition

1 code implementation CVPR 2016 Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information.

Ranked #60 on Action Recognition on UCF101 (using extra training data)

Action Recognition In Videos Temporal Action Localization +1

Dynamically Encoded Actions Based on Spacetime Saliency

no code implementations CVPR 2015 Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes

By using the resulting definition of saliency during feature pooling we show that action recognition performance achieves state-of-the-art levels on three widely considered action recognition datasets.

Action Recognition Temporal Action Localization

Cannot find the paper you are looking for? You can Submit a new open access paper.