no code implementations • 16 Jun 2023 • Du Tran, Jitendra Malik
We propose a new task of space-time semantic correspondence prediction in videos.
no code implementations • 9 Mar 2023 • Tarun Kalluri, Weiyao Wang, Heng Wang, Manmohan Chandraker, Lorenzo Torresani, Du Tran
Many top-down architectures for instance segmentation achieve significant success when trained and tested on pre-defined closed-world taxonomy.
no code implementations • 16 Feb 2023 • Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran
Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences.
no code implementations • CVPR 2023 • Xitong Yang, Fu-Jen Chu, Matt Feiszli, Raghav Goyal, Lorenzo Torresani, Du Tran
In this paper, we propose to study these problems in a joint framework for long video understanding.
1 code implementation • CVPR 2022 • Weiyao Wang, Matt Feiszli, Heng Wang, Jitendra Malik, Du Tran
From PA we construct a large set of pseudo-ground-truth instance masks; combined with human-annotated instance masks we train GGNs and significantly outperform the SOTA on open-world instance segmentation on various benchmarks including COCO, LVIS, ADE20K, and UVO.
no code implementations • CVPR 2022 • Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani
Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
no code implementations • ICCV 2021 • Weiyao Wang, Matt Feiszli, Heng Wang, Du Tran
Current state-of-the-art object detection and segmentation methods work well under the closed-world assumption.
1 code implementation • 15 Dec 2020 • Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, Du Tran
A majority of methods for video frame interpolation compute bidirectional optical flow between adjacent frames of a video, followed by a suitable warping algorithm to generate the output frames.
Ranked #2 on Video Frame Interpolation on GoPro
1 code implementation • NeurIPS 2020 • Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran
To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
no code implementations • 10 Jun 2019 • Linchao Zhu, Laura Sevilla-Lara, Du Tran, Matt Feiszli, Yi Yang, Heng Wang
FASTER aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities.
Ranked #26 on Action Recognition on UCF101
no code implementations • 10 Jun 2019 • Yufei Wang, Du Tran, Lorenzo Torresani
It consists of a shared 2D spatial convolution followed by two parallel point-wise convolutional layers, one devoted to images and the other one used for videos.
no code implementations • CVPR 2020 • Heng Wang, Du Tran, Lorenzo Torresani, Matt Feiszli
Motion is a salient cue to recognize actions in video.
Ranked #109 on Action Classification on Kinetics-400
3 code implementations • NeurIPS 2019 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation.
Ranked #2 on Multi-Person Pose Estimation on PoseTrack2018 (using extra training data)
3 code implementations • CVPR 2020 • Wei-Yao Wang, Du Tran, Matt Feiszli
Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart.
Ranked #1 on Action Recognition In Videos on miniSports
3 code implementations • CVPR 2019 • Deepti Ghadiyaram, Matt Feiszli, Du Tran, Xueting Yan, Heng Wang, Dhruv Mahajan
Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning?
Ranked #2 on Egocentric Activity Recognition on EPIC-KITCHENS-55 (Actions Top-1 (S2) metric)
no code implementations • ICCV 2019 • Bruno Korbar, Du Tran, Lorenzo Torresani
We demonstrate that the computational cost of action recognition on untrimmed videos can be dramatically reduced by invoking recognition only on these most salient clips.
Ranked #1 on Action Recognition on miniSports
7 code implementations • ICCV 2019 • Du Tran, Heng Wang, Lorenzo Torresani, Matt Feiszli
It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.
Ranked #1 on Action Recognition on Sports-1M
no code implementations • ICCV 2019 • Rohit Girdhar, Du Tran, Lorenzo Torresani, Deva Ramanan
In this work, we propose an alternative approach to learning video representations that require no semantically labeled videos and instead leverages the years of effort in collecting and labeling large and clean still-image datasets.
Ranked #71 on Action Recognition on HMDB-51 (using extra training data)
no code implementations • 11 Dec 2018 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
Our network learns to spatially sample features from Frame B in order to maximize pose detection accuracy in Frame A.
no code implementations • ECCV 2018 • Jamie Ray, Heng Wang, Du Tran, YuFei Wang, Matt Feiszli, Lorenzo Torresani, Manohar Paluri
The videos retrieved by the search engines are then veried for correctness by human annotators.
no code implementations • NeurIPS 2018 • Bruno Korbar, Du Tran, Lorenzo Torresani
There is a natural correlation between the visual and auditive elements of a video.
Ranked #7 on Self-Supervised Audio Classification on ESC-50
1 code implementation • CVPR 2018 • Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, Du Tran
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video.
Ranked #8 on Pose Tracking on PoseTrack2017 (using extra training data)
21 code implementations • CVPR 2018 • Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann Lecun, Manohar Paluri
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.
Ranked #3 on Action Recognition on Sports-1M
1 code implementation • 16 Aug 2017 • Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, Manohar Paluri
Learning image representations with ConvNets by pre-training on ImageNet has proven useful across many visual understanding tasks including object detection, semantic segmentation, and image captioning.
Ranked #70 on Action Recognition on HMDB-51
no code implementations • 29 Jan 2017 • Joost van Amersfoort, Anitha Kannan, Marc'Aurelio Ranzato, Arthur Szlam, Du Tran, Soumith Chintala
In this work we propose a simple unsupervised approach for next frame prediction in video.
no code implementations • 23 Jun 2016 • Du Tran, Maksim Bolonkin, Manohar Paluri, Lorenzo Torresani
Language has been exploited to sidestep the problem of defining video categories, by formulating video understanding as the task of captioning or description.
no code implementations • 20 Nov 2015 • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri
Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis.
29 code implementations • ICCV 2015 • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset.
Ranked #8 on Action Recognition on Sports-1M
Action Recognition In Videos Dynamic Facial Expression Recognition
no code implementations • 20 Dec 2013 • Du Tran, Lorenzo Torresani
We show the generality of our approach by building our mid-level descriptors from two different low-level feature representations.
no code implementations • NeurIPS 2012 • Du Tran, Junsong Yuan
The mapping between a video and a spatio-temporal action trajectory is learned.