no code implementations • 1 Apr 2022 • Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid
To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
no code implementations • 31 Mar 2022 • Tian Yun, Usha Bhalla, Ellie Pavlick, Chen Sun
Our study reveals that state-of-the-art VL pretrained models learn primitive concepts that are highly useful as visual descriptors, as demonstrated by their strong performance on fine-grained visual recognition tasks, but those concepts struggle to provide interpretable compositional derivations, which highlights limitations of existing VL models.
no code implementations • 31 Jan 2022 • Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, Yoshua Bengio
Generative Flow Networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or strings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object.
1 code implementation • 12 Jan 2022 • Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid
Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.
Ranked #1 on
Action Classification
on Kinetics-400
(using extra training data)
no code implementations • 1 Nov 2021 • Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.
1 code implementation • Findings (EMNLP) 2021 • Tian Yun, Chen Sun, Ellie Pavlick
Linguistic representations derived from text alone have been criticized for their lack of grounding, i. e., connecting words to their meanings in the physical world.
1 code implementation • ICCV 2021 • Junru Gu, Chen Sun, Hang Zhao
In this work, we propose an anchor-free and end-to-end trajectory prediction model, named DenseTNT, that directly outputs a set of trajectories from dense goal candidates.
no code implementations • NeurIPS 2021 • Dianbo Liu, Alex Lamb, Kenji Kawaguchi, Anirudh Goyal, Chen Sun, Michael Curtis Mozer, Yoshua Bengio
Deep learning has advanced from fully connected architectures to structured models organized into components, e. g., the transformer composed of positional elements, modular architectures divided into slots, and graph neural nets made up of nodes.
no code implementations • NeurIPS 2021 • Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.
Ranked #1 on
Audio Classification
on VGGSound
no code implementations • CVPR 2021 • Lu Mi, Hang Zhao, Charlie Nash, Xiaohan Jin, Jiyang Gao, Chen Sun, Cordelia Schmid, Nir Shavit, Yuning Chai, Dragomir Anguelov
To address this issue, we introduce a new challenging task to generate HD maps.
1 code implementation • ICCV 2021 • Alexander Pashevich, Cordelia Schmid, Chen Sun
We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance.
1 code implementation • 6 Apr 2021 • Jack Valmadre, Alex Bewley, Jonathan Huang, Chen Sun, Cristian Sminchisescu, Cordelia Schmid
This paper introduces temporally local metrics for Multi-Object Tracking.
no code implementations • ICCV 2021 • Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid
We focus on contrastive methods for self-supervised video representation learning.
no code implementations • ICCV 2021 • Anurag Arnab, Chen Sun, Cordelia Schmid
Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.
4 code implementations • ICCV 2021 • Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.
Ranked #6 on
Action Classification
on Moments in Time
(Top 5 Accuracy metric, using extra
training data)
no code implementations • ICCV 2021 • Dave Epstein, Jiajun Wu, Cordelia Schmid, Chen Sun
Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community.
2 code implementations • 19 Aug 2020 • Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Cong-Cong Li, Dragomir Anguelov
Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states.
1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
no code implementations • 29 Jul 2020 • Jonathan C. Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, David A. Ross
Based on this observation, we propose to use text as a method for learning video representations.
2 code implementations • ECCV 2020 • Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid
In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others.
Ranked #6 on
Video Retrieval
on ActivityNet
(using extra training data)
no code implementations • ECCV 2020 • Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid
Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind.
1 code implementation • NeurIPS 2020 • Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola
Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.
Ranked #50 on
Self-Supervised Image Classification
on ImageNet
5 code implementations • CVPR 2020 • Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cong-Cong Li, Cordelia Schmid
Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e. g. pedestrians and vehicles) and road context information (e. g. lanes, traffic lights).
no code implementations • CVPR 2020 • Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.
1 code implementation • CONLL 2019 • Yanjun Gao, Chen Sun, Rebecca J. Passonneau
Pyramid evaluation was developed to assess the content of paragraph length summaries of source texts.
1 code implementation • NeurIPS 2019 • Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, Kevin Murphy, Honglak Lee
Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning.
Ranked #11 on
Video Prediction
on KTH
no code implementations • 13 Jun 2019 • Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid
This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods.
no code implementations • ICLR 2019 • Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu
Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future.
no code implementations • ICLR 2019 • Chen Sun, Per Karlsson, Jiajun Wu, Joshua B. Tenenbaum, Kevin Murphy
We present a method which learns to integrate temporal information, from a learned dynamics model, with ambiguous visual information, from a learned vision model, in the context of interacting agents.
no code implementations • CVPR 2019 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, Cordelia Schmid
This paper focuses on multi-person action forecasting in videos.
3 code implementations • ICCV 2019 • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.
Ranked #1 on
Action Classification
on YouCook2
no code implementations • 20 Mar 2019 • Chen Sun, Jean M. Uwabeza Vianney, Dongpu Cao
Our results indicate that this method could act as a cheaper way for training data collection in autonomous driving.
no code implementations • 12 Mar 2019 • Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu
Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future.
no code implementations • 25 Feb 2019 • Chen Sun, Per Karlsson, Jiajun Wu, Joshua B. Tenenbaum, Kevin Murphy
We present a method that learns to integrate temporal information, from a learned dynamics model, with ambiguous visual information, from a learned vision model, in the context of interacting agents.
no code implementations • 13 Feb 2019 • Chen Sun, Ye Tian, Liang Gao, Yishuai Niu, Tianlong Zhang, Hua Li, Yuqing Zhang, Zengqi Yue, Nicole Delepine-Gilon, Jin Yu
Machine learning has been used to develop the model.
1 code implementation • 19 Dec 2018 • Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar
State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input.
Ranked #10 on
Action Recognition
on AVA v2.1
4 code implementations • CVPR 2019 • Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays
In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image.
Ranked #2 on
Image Retrieval with Multi-Modal Query
on MIT-States
1 code implementation • ECCV 2018 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid
A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.
Ranked #14 on
Action Recognition
on AVA v2.1
1 code implementation • CVPR 2018 • Yin Cui, Yang song, Chen Sun, Andrew Howard, Serge Belongie
We propose a measure to estimate domain similarity via Earth Mover's Distance and demonstrate that transfer learning benefits from pre-training on a source domain that is similar to the target domain by this measure.
Ranked #15 on
Fine-Grained Image Classification
on CUB-200-2011
Fine-Grained Image Classification
Fine-Grained Visual Categorization
+1
no code implementations • 22 Jan 2018 • Unaiza Ahsan, Chen Sun, Irfan Essa
We propose an action recognition framework using Gen- erative Adversarial Networks.
1 code implementation • ECCV 2018 • Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, Kevin Murphy
Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification.
Ranked #22 on
Action Recognition
on UCF101
(using extra training data)
1 code implementation • ICCV 2017 • Chuang Gan, Yandong Li, Haoxiang Li, Chen Sun, Boqing Gong
Many seemingly distant annotations (e. g., semantic segmentation and visual question answering (VQA)) are inherently connected in that they reveal different levels and perspectives of human understandings about the same visual scenes --- and even the same set of images (e. g., of COCO).
6 code implementations • CVPR 2018 • Grant Van Horn, Oisin Mac Aodha, Yang song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie
Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories.
Ranked #6 on
Image Classification
on iNaturalist
2 code implementations • ICCV 2017 • Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta
What will happen if we increase the dataset size by 10x or 100x?
Ranked #2 on
Semantic Segmentation
on PASCAL VOC 2007
4 code implementations • CVPR 2018 • Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
Ranked #2 on
Temporal Action Localization
on UCF101-24
8 code implementations • ICCV 2017 • Jiyang Gao, Chen Sun, Zhenheng Yang, Ram Nevatia
For evaluation, we adopt TaCoS dataset, and build a new dataset for this task on top of Charades by adding sentence temporal annotations, called Charades-STA.
1 code implementation • ICCV 2017 • Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, Ram Nevatia
Temporal Action Proposal (TAP) generation is an important problem, as fast and accurate extraction of semantically important (e. g. human actions) segments from untrimmed videos is an important step for large-scale video analysis.
Ranked #8 on
Action Recognition
on THUMOS’14
no code implementations • 17 Jan 2017 • Unaiza Ahsan, Chen Sun, James Hays, Irfan Essa
We propose to leverage concept-level representations for complex event recognition in photographs given limited training examples.
14 code implementations • CVPR 2017 • Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang song, Sergio Guadarrama, Kevin Murphy
On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.
Ranked #190 on
Object Detection
on COCO test-dev
no code implementations • 16 Apr 2016 • Jiyang Gao, Chen Sun, Ram Nevatia
It obtains candidate action concepts by extracting verb-object pairs from sentences and verifies their visualness with the associated images.
no code implementations • CVPR 2016 • Chen Sun, Manohar Paluri, Ronan Collobert, Ram Nevatia, Lubomir Bourdev
This paper aims to classify and locate objects accurately and efficiently, without using bounding box annotations.
Ranked #5 on
Weakly Supervised Object Detection
on COCO
no code implementations • ICCV 2015 • Chen Sun, Chuang Gan, Ram Nevatia
Humans connect language and vision to perceive the world.
1 code implementation • 4 Apr 2015 • Chen Sun, Sanketh Shetty, Rahul Sukthankar, Ram Nevatia
To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output.
no code implementations • CVPR 2014 • Chen Sun, Ram Nevatia
Our goal is to find the important segments and capture their information for event classification and recounting.