no code implementations • ECCV 2020 • Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova
We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.
no code implementations • 23 Nov 2022 • Ryan Burgert, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo
Recent diffusion-based generative models combined with vision-language models are capable of creating realistic images from natural language prompts.
no code implementations • 16 Nov 2022 • Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab
We propose Token Turing Machines (TTM), a sequential, autoregressive Transformer model with memory for real-world sequential visual understanding.
Ranked #1 on
Action Detection
on Charades
no code implementations • 28 Oct 2022 • Jongwoo Park, Kumara Kahatapitiya, Donghyun Kim, Shivchander Sudalairaj, Quanfu Fan, Michael S. Ryoo
In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high- and low-resolution features alike.
no code implementations • 20 Sep 2022 • Boyuan Chen, Fei Xia, Brian Ichter, Kanishka Rao, Keerthana Gopalakrishnan, Michael S. Ryoo, Austin Stone, Daniel Kappler
Large language models (LLMs) have unlocked new capabilities of task planning from human instructions.
no code implementations • 1 Aug 2022 • AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova
Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video.
Ranked #4 on
Video Question Answering
on iVQA
1 code implementation • 1 Jul 2022 • Srijan Das, Michael S. Ryoo
The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information within a video clip of few frames.
1 code implementation • 23 Jun 2022 • Jinghuan Shang, Srijan Das, Michael S. Ryoo
To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations.
2 code implementations • 10 Jun 2022 • Xiang Li, Jinghuan Shang, Srijan Das, Michael S. Ryoo
We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels.
no code implementations • 7 Dec 2021 • Srijan Das, Michael S. Ryoo
Learning self-supervised video representation predominantly focuses on discriminating instances generated from simple data augmentation schemes.
no code implementations • 7 Dec 2021 • Srijan Das, Michael S. Ryoo
To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities.
1 code implementation • CVPR 2022 • Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S. Ryoo, Francois Bremond
Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos.
Ranked #2 on
Action Detection
on TSU
no code implementations • 26 Nov 2021 • Kumara Kahatapitiya, Michael S. Ryoo
Modeling visual data as tokens (i. e., image patches) using attention mechanisms, feed-forward networks or convolutions has been highly effective in recent years.
1 code implementation • 26 Nov 2021 • Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu, Michael S. Ryoo, Gang Hua
However, such pretrained models are not ideal for downstream detection, due to the disparity between the pretraining and the downstream fine-tuning tasks.
Ranked #3 on
Action Detection
on Charades
1 code implementation • 12 Oct 2021 • Jinghuan Shang, Kumara Kahatapitiya, Xiang Li, Michael S. Ryoo
Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions.
1 code implementation • ICCV 2021 • AJ Piergiovanni, Vincent Casser, Michael S. Ryoo, Anelia Angelova
We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time.
no code implementations • 2 Aug 2021 • Jinghuan Shang, Michael S. Ryoo
Third-person imitation learning (TPIL) is the concept of learning action policies by observing other agents in a third-person view (TPV), similar to what humans do.
no code implementations • 28 Jun 2021 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos.
4 code implementations • 21 Jun 2021 • Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.
Ranked #1 on
Action Classification
on Charades
no code implementations • 7 Jun 2021 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions.
no code implementations • CVPR 2021 • AJ Piergiovanni, Michael S. Ryoo
Standard methods for video recognition use large CNNs designed to capture spatio-temporal data.
Ranked #2 on
Action Classification
on Toyota Smarthome dataset
(CV1 metric)
no code implementations • 26 Mar 2021 • Iretiayo Akinola, Anelia Angelova, Yao Lu, Yevgen Chebotar, Dmitry Kalashnikov, Jacob Varley, Julian Ibarz, Michael S. Ryoo
We propose a vision-based architecture search algorithm for robot manipulation learning, which discovers interactions between low dimension action inputs and high dimensional visual inputs.
1 code implementation • CVPR 2021 • Kumara Kahatapitiya, Michael S. Ryoo
In this paper, we introduce Coarse-Fine Networks, a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
Ranked #7 on
Action Detection
on Charades
no code implementations • 13 Nov 2020 • Ramyad Hadidi, Jiashen Cao, Michael S. Ryoo, Hyesoon Kim
Satisfying the high computation demand of modern deep learning architectures is challenging for achieving low inference latency.
1 code implementation • 18 Aug 2020 • Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova
We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.
Ranked #4 on
Action Classification
on Toyota Smarthome dataset
no code implementations • ECCV 2020 • AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo
In this paper we propose an adversarial generative grammar model for future prediction.
no code implementations • ECCV 2020 • Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua
The discovered attention cells can be seamlessly inserted into existing backbone networks, e. g., I3D or S3D, and improve video classification accuracy by more than 2% on both Kinetics-600 and MiT datasets.
1 code implementation • NeurIPS 2020 • AJ Piergiovanni, Michael S. Ryoo
We confirm that most of the existing video datasets are statistically biased to only capture action videos from a limited number of countries.
Ranked #2 on
Action Classification
on AViD
no code implementations • 13 Mar 2020 • Ramyad Hadidi, Bahar Asgari, Jiashen Cao, Younmin Bae, Da Eun Shim, Hyojong Kim, Sung-Kyu Lim, Michael S. Ryoo, Hyesoon Kim
To benefit from available compute resources with low communication overhead, we propose the first DNN parallelization method for reducing the communication overhead in a distributed system.
no code implementations • CVPR 2020 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
We present a new method to learn video representations from large-scale unlabeled video data.
1 code implementation • 26 Nov 2019 • Xiuye Gu, Weixin Luo, Michael S. Ryoo, Yong Jae Lee
Cameras are prevalent in our daily lives, and enable many useful systems built upon computer vision technologies such as smart cameras and home robots for service applications.
2 code implementations • 15 Oct 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
Video understanding is a challenging problem with great impact on the abilities of autonomous agents working in the real-world.
1 code implementation • 8 Oct 2019 • Alan Wu, AJ Piergiovanni, Michael S. Ryoo
We present a visual imitation learning framework that enables learning of robot action policies solely based on expert samples without any robot trials.
no code implementations • 7 Jun 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
We present a new method to learn video representations from unlabeled data.
2 code implementations • ICLR 2020 • Michael S. Ryoo, AJ Piergiovanni, Mingxing Tan, Anelia Angelova
Learning to represent videos is a very challenging task both algorithmically and computationally.
no code implementations • ICLR 2019 • AJ Piergiovanni, Michael S. Ryoo
In this paper, we present a method to learn a joint multimodal representation space that allows for the recognition of unseen activities in videos.
no code implementations • 18 Apr 2019 • AJ Piergiovanni, Michael S. Ryoo
Injuries are a major cost in sports.
no code implementations • 1 Feb 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos.
no code implementations • ICCV 2019 • AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo
We present a new method for finding video CNN architectures that capture rich spatio-temporal information in videos.
Ranked #20 on
Action Classification
on Moments in Time
5 code implementations • CVPR 2019 • AJ Piergiovanni, Michael S. Ryoo
Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.
Ranked #14 on
Action Recognition
on HMDB-51
1 code implementation • 21 Jun 2018 • AJ Piergiovanni, Michael S. Ryoo
We present a method to learn a joint multimodal representation space that enables recognition of unseen activities in videos.
no code implementations • 20 May 2018 • AJ Piergiovanni, Alan Wu, Michael S. Ryoo
Learning to control robots directly based on images is a primary challenge in robotics.
3 code implementations • 9 Apr 2018 • AJ Piergiovanni, Michael S. Ryoo
In this paper, we introduce a challenging new dataset, MLB-YouTube, designed for fine-grained activity detection.
1 code implementation • ECCV 2018 • Zhongzheng Ren, Yong Jae Lee, Michael S. Ryoo
The end result is a video anonymizer that performs pixel-level modifications to anonymize each person's face, with minimal effect on action detection performance.
no code implementations • ECCV 2018 • Mingze Xu, Chenyou Fan, Yuchen Wang, Michael S. Ryoo, David J. Crandall
In this paper, we wish to solve two specific problems: (1) given two or more synchronized third-person videos of a scene, produce a pixel-level segmentation of each visible person and identify corresponding people across different views (i. e., determine who in camera A corresponds with whom in camera B), and (2) given one or more synchronized third-person videos as well as a first-person video taken by a mobile or wearable camera, segment and identify the camera wearer in the third-person videos.
1 code implementation • ICLR 2019 • AJ Piergiovanni, Michael S. Ryoo
We introduce a new convolutional layer named the Temporal Gaussian Mixture (TGM) layer and present how it can be used to efficiently capture longer-term temporal information in continuous activity videos.
Ranked #4 on
Action Detection
on Multi-THUMOS
no code implementations • 5 Feb 2018 • Ramyad Hadidi, Jiashen Cao, Matthew Woodward, Michael S. Ryoo, Hyesoon Kim
Furthermore, in image recognition, Musical Chair achieves similar performance and saves dynamic energy.
2 code implementations • CVPR 2018 • AJ Piergiovanni, Michael S. Ryoo
In this paper, we introduce the concept of learning latent super-events from activity videos, and present how it benefits activity detection in continuous videos.
Ranked #6 on
Action Detection
on Multi-THUMOS
no code implementations • 3 Aug 2017 • Michael S. Ryoo, Kiyoon Kim, Hyun Jong Yang
This paper presents an approach for recognizing human activities from extreme low resolution (e. g., 16x12) videos.
no code implementations • 20 May 2017 • Chenyou Fan, Jangwon Lee, Michael S. Ryoo
The key idea is that (1) an intermediate representation of a convolutional object recognition model abstracts scene information in its frame and that (2) we can predict (i. e., regress) such representations corresponding to the future frames based on that of the current frame.
no code implementations • CVPR 2017 • Chenyou Fan, Jang-Won Lee, Mingze Xu, Krishna Kumar Singh, Yong Jae Lee, David J. Crandall, Michael S. Ryoo
We consider scenarios in which we wish to perform joint scene understanding, object tracking, activity recognition, and other tasks in environments in which multiple people are wearing body-worn cameras while a third-person static camera also captures the scene.
no code implementations • 3 Mar 2017 • Jang-Won Lee, Michael S. Ryoo
We design a new approach that allows robot learning of new activities from unlabeled human example videos.
no code implementations • 1 Mar 2017 • Tianmin Shu, Xiaofeng Gao, Michael S. Ryoo, Song-Chun Zhu
In this paper, we present a general framework for learning social affordance grammar as a spatiotemporal AND-OR graph (ST-AOG) from RGB-D videos of human interactions, and transfer the grammar to humanoids to enable a real-time motion inference for human-robot interaction (HRI).
1 code implementation • 26 May 2016 • AJ Piergiovanni, Chenyou Fan, Michael S. Ryoo
In this paper, we newly introduce the concept of temporal attention filters, and describe how they can be used for human activity recognition from videos.
Ranked #1 on
Activity Recognition In Videos
on DogCentric
no code implementations • 12 Apr 2016 • Michael S. Ryoo, Brandon Rothrock, Charles Fleming, Hyun Jong Yang
We introduce the paradigm of inverse super resolution (ISR), the concept of learning the optimal set of image transformations to generate multiple low-resolution (LR) training videos from a single video.
no code implementations • 9 Jul 2015 • Ilaria Gori, J. K. Aggarwal, Larry Matthies, Michael S. Ryoo
Activity recognition is very useful in scenarios where robots interact with, monitor or assist humans.
no code implementations • CVPR 2013 • Michael S. Ryoo, Larry Matthies
This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint.