Search Results for author: Du Tran

Found 30 papers, 11 papers with code

Learning Space-Time Semantic Correspondences

no code implementations16 Jun 2023 Du Tran, Jitendra Malik

We propose a new task of space-time semantic correspondence prediction in videos.

Imitation Learning Semantic correspondence +1

Open-world Instance Segmentation: Top-down Learning with Bottom-up Supervision

no code implementations9 Mar 2023 Tarun Kalluri, Weiyao Wang, Heng Wang, Manmohan Chandraker, Lorenzo Torresani, Du Tran

Many top-down architectures for instance segmentation achieve significant success when trained and tested on pre-defined closed-world taxonomy.

Open-World Instance Segmentation Segmentation +1

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

no code implementations16 Feb 2023 Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran

Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences.

Action Detection Sentence +2

Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity

1 code implementation CVPR 2022 Weiyao Wang, Matt Feiszli, Heng Wang, Jitendra Malik, Du Tran

From PA we construct a large set of pseudo-ground-truth instance masks; combined with human-annotated instance masks we train GGNs and significantly outperform the SOTA on open-world instance segmentation on various benchmarks including COCO, LVIS, ADE20K, and UVO.

Diversity Open-World Instance Segmentation +1

Long-Short Temporal Contrastive Learning of Video Transformers

no code implementations CVPR 2022 Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani

Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.

Action Recognition Contrastive Learning +1

FLAVR: Flow-Agnostic Video Representations for Fast Frame Interpolation

1 code implementation15 Dec 2020 Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, Du Tran

A majority of methods for video frame interpolation compute bidirectional optical flow between adjacent frames of a video, followed by a suitable warping algorithm to generate the output frames.

Action Recognition Motion Magnification +2

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

1 code implementation NeurIPS 2020 Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran

To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

Audio Classification Clustering +5

FASTER Recurrent Networks for Efficient Video Classification

no code implementations10 Jun 2019 Linchao Zhu, Laura Sevilla-Lara, Du Tran, Matt Feiszli, Yi Yang, Heng Wang

FASTER aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities.

Action Classification Action Recognition +3

UniDual: A Unified Model for Image and Video Understanding

no code implementations10 Jun 2019 Yufei Wang, Du Tran, Lorenzo Torresani

It consists of a shared 2D spatial convolution followed by two parallel point-wise convolutional layers, one devoted to images and the other one used for videos.

Multi-Task Learning Video Understanding

Learning Temporal Pose Estimation from Sparsely-Labeled Videos

3 code implementations NeurIPS 2019 Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation.

Ranked #2 on Multi-Person Pose Estimation on PoseTrack2018 (using extra training data)

Multi-Person Pose Estimation Optical Flow Estimation

What Makes Training Multi-Modal Classification Networks Hard?

3 code implementations CVPR 2020 Wei-Yao Wang, Du Tran, Matt Feiszli

Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart.

Action Classification Action Recognition In Videos +4

Large-scale weakly-supervised pre-training for video action recognition

3 code implementations CVPR 2019 Deepti Ghadiyaram, Matt Feiszli, Du Tran, Xueting Yan, Heng Wang, Dhruv Mahajan

Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning?

Ranked #2 on Egocentric Activity Recognition on EPIC-KITCHENS-55 (Actions Top-1 (S2) metric)

Action Classification Action Recognition +3

SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition

no code implementations ICCV 2019 Bruno Korbar, Du Tran, Lorenzo Torresani

We demonstrate that the computational cost of action recognition on untrimmed videos can be dramatically reduced by invoking recognition only on these most salient clips.

Action Recognition Temporal Action Localization

Video Classification with Channel-Separated Convolutional Networks

7 code implementations ICCV 2019 Du Tran, Heng Wang, Lorenzo Torresani, Matt Feiszli

It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.

Action Classification Action Recognition +3

DistInit: Learning Video Representations Without a Single Labeled Video

no code implementations ICCV 2019 Rohit Girdhar, Du Tran, Lorenzo Torresani, Deva Ramanan

In this work, we propose an alternative approach to learning video representations that require no semantically labeled videos and instead leverages the years of effort in collecting and labeling large and clean still-image datasets.

Ranked #71 on Action Recognition on HMDB-51 (using extra training data)

Action Recognition Temporal Action Localization +1

Detect-and-Track: Efficient Pose Estimation in Videos

1 code implementation CVPR 2018 Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, Du Tran

This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video.

Ranked #8 on Pose Tracking on PoseTrack2017 (using extra training data)

Human Detection Keypoint Estimation +4

ConvNet Architecture Search for Spatiotemporal Feature Learning

1 code implementation16 Aug 2017 Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, Manohar Paluri

Learning image representations with ConvNets by pre-training on ImageNet has proven useful across many visual understanding tasks including object detection, semantic segmentation, and image captioning.

Action Classification Action Recognition +5

Transformation-Based Models of Video Sequences

no code implementations29 Jan 2017 Joost van Amersfoort, Anitha Kannan, Marc'Aurelio Ranzato, Arthur Szlam, Du Tran, Soumith Chintala

In this work we propose a simple unsupervised approach for next frame prediction in video.

VideoMCC: a New Benchmark for Video Comprehension

no code implementations23 Jun 2016 Du Tran, Maksim Bolonkin, Manohar Paluri, Lorenzo Torresani

Language has been exploited to sidestep the problem of defining video categories, by formulating video understanding as the task of captioning or description.

Multiple-choice Video Description +1

Deep End2End Voxel2Voxel Prediction

no code implementations20 Nov 2015 Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri

Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis.

Neural Architecture Search Optical Flow Estimation +3

Learning Spatiotemporal Features with 3D Convolutional Networks

29 code implementations ICCV 2015 Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri

We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset.

Action Recognition In Videos Dynamic Facial Expression Recognition

EXMOVES: Classifier-based Features for Scalable Action Recognition

no code implementations20 Dec 2013 Du Tran, Lorenzo Torresani

We show the generality of our approach by building our mid-level descriptors from two different low-level feature representations.

Action Recognition General Classification +1

Cannot find the paper you are looking for? You can Submit a new open access paper.