Search Results for author: Michael S. Ryoo

Found 66 papers, 30 papers with code

AssembleNet++: Assembling Modality Representations via Attention Connections - Supplementary Material -

no code implementations ECCV 2020 Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova

We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.

Activity Recognition

Understanding Long Videos in One Multimodal Language Model Pass

1 code implementation25 Mar 2024 Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo

In addition to faster inference, we discover the resulting models to yield surprisingly good accuracy on long-video tasks, even with no video specific information.

Fine-grained Action Recognition Language Modelling +3

Language Repository for Long Video Understanding

1 code implementation21 Mar 2024 Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, Michael S. Ryoo

In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise and structured information as an interpretable (i. e., all-textual) representation.

Video Understanding Visual Question Answering +1

Diffusion Illusions: Hiding Images in Plain Sight

no code implementations6 Dec 2023 Ryan Burgert, Xiang Li, Abe Leite, Kanchana Ranasinghe, Michael S. Ryoo

We explore the problem of computationally generating special `prime' images that produce optical illusions when physically arranged and viewed in a certain way.

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

no code implementations9 Nov 2023 AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential.

Action Classification Audio Classification +1

Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders

1 code implementation31 Oct 2023 Srijan Das, Tanmay Jain, Dominick Reilly, Pranav Balaji, Soumyajit Karmakar, Shyam Marjit, Xiang Li, Abhijit Das, Michael S. Ryoo

We explore the appropriate SSL tasks that can be optimized alongside the primary task, the training schemes for these tasks, and the data scale at which they can be most effective.

DeepFake Detection Face Swapping +1

Energy-Based Models for Cross-Modal Localization using Convolutional Transformers

no code implementations6 Jun 2023 Alan Wu, Michael S. Ryoo

We present a novel framework using Energy-Based Models (EBMs) for localizing a ground vehicle mounted with a range sensor against satellite imagery in the absence of GPS.

Autonomous Vehicles

Active Vision Reinforcement Learning under Limited Visual Observability

2 code implementations NeurIPS 2023 Jinghuan Shang, Michael S. Ryoo

This learnable reward is assigned by sensorimotor reward module, incentivizes the sensory policy to select observations that are optimal to infer its own motor action, inspired by the sensorimotor stage of humans.

reinforcement-learning

VicTR: Video-conditioned Text Representations for Activity Recognition

no code implementations5 Apr 2023 Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo

All such recipes rely on augmenting visual embeddings with temporal information (i. e., image -> video), often keeping text embeddings unchanged or even being discarded.

Action Classification Activity Recognition +1

Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

1 code implementation23 Nov 2022 Ryan Burgert, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo

In this work, we explore how an off-the-shelf text-to-image diffusion model, trained without exposure to localization information, can ground various semantic phrases without segmentation-specific re-training.

Segmentation Unsupervised Semantic Segmentation

Token Turing Machines

1 code implementation CVPR 2023 Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab

The model's memory module ensures that a new observation will only be processed with the contents of the memory (and not the entire history), meaning that it can efficiently process long sequences with a bounded computational cost at each step.

Action Detection Activity Detection

Grafting Vision Transformers

no code implementations28 Oct 2022 Jongwoo Park, Kumara Kahatapitiya, Donghyun Kim, Shivchander Sudalairaj, Quanfu Fan, Michael S. Ryoo

In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high- and low-resolution features alike.

Image Classification Instance Segmentation +3

Video Question Answering with Iterative Video-Text Co-Tokenization

no code implementations1 Aug 2022 AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video.

Question Answering Video Question Answering +1

Video + CLIP Baseline for Ego4D Long-term Action Anticipation

1 code implementation1 Jul 2022 Srijan Das, Michael S. Ryoo

The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information within a video clip of few frames.

Action Anticipation Long Term Action Anticipation

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

1 code implementation23 Jun 2022 Jinghuan Shang, Srijan Das, Michael S. Ryoo

To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations.

Action Recognition Image Classification +1

ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints

no code implementations7 Dec 2021 Srijan Das, Michael S. Ryoo

Learning self-supervised video representation predominantly focuses on discriminating instances generated from simple data augmentation schemes.

Data Augmentation

Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning

no code implementations7 Dec 2021 Srijan Das, Michael S. Ryoo

To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities.

Action Recognition Representation Learning +3

Weakly-guided Self-supervised Pretraining for Temporal Activity Detection

1 code implementation26 Nov 2021 Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu, Michael S. Ryoo, Gang Hua

However, such pretrained models are not ideal for downstream detection, due to the disparity between the pretraining and the downstream fine-tuning tasks.

Action Detection Activity Detection +2

SWAT: Spatial Structure Within and Among Tokens

1 code implementation26 Nov 2021 Kumara Kahatapitiya, Michael S. Ryoo

Modeling visual data as tokens (i. e., image patches) using attention mechanisms, feed-forward networks or convolutions has been highly effective in recent years.

StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning

1 code implementation12 Oct 2021 Jinghuan Shang, Kumara Kahatapitiya, Xiang Li, Michael S. Ryoo

Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions.

Imitation Learning Inductive Bias +3

4D-Net for Learned Multi-Modal Alignment

1 code implementation ICCV 2021 AJ Piergiovanni, Vincent Casser, Michael S. Ryoo, Anelia Angelova

We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time.

3D Object Detection object-detection

Self-Supervised Disentangled Representation Learning for Third-Person Imitation Learning

no code implementations2 Aug 2021 Jinghuan Shang, Michael S. Ryoo

Third-person imitation learning (TPIL) is the concept of learning action policies by observing other agents in a third-person view (TPV), similar to what humans do.

Imitation Learning Representation Learning

Unsupervised Discovery of Actions in Instructional Videos

no code implementations28 Jun 2021 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos.

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

4 code implementations21 Jun 2021 Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

Action Classification Image Classification +3

Unsupervised Action Segmentation for Instructional Videos

no code implementations7 Jun 2021 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions.

Action Segmentation Segmentation

Visionary: Vision architecture discovery for robot learning

no code implementations26 Mar 2021 Iretiayo Akinola, Anelia Angelova, Yao Lu, Yevgen Chebotar, Dmitry Kalashnikov, Jacob Varley, Julian Ibarz, Michael S. Ryoo

We propose a vision-based architecture search algorithm for robot manipulation learning, which discovers interactions between low dimension action inputs and high dimensional visual inputs.

Neural Architecture Search Robot Manipulation

Coarse-Fine Networks for Temporal Activity Detection in Videos

1 code implementation CVPR 2021 Kumara Kahatapitiya, Michael S. Ryoo

In this paper, we introduce Coarse-Fine Networks, a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.

Action Detection Activity Detection

Reducing Inference Latency with Concurrent Architectures for Image Recognition

no code implementations13 Nov 2020 Ramyad Hadidi, Jiashen Cao, Michael S. Ryoo, Hyesoon Kim

Satisfying the high computation demand of modern deep learning architectures is challenging for achieving low inference latency.

Neural Architecture Search

AssembleNet++: Assembling Modality Representations via Attention Connections

1 code implementation18 Aug 2020 Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova

We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.

Action Classification Activity Recognition

AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

no code implementations ECCV 2020 Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua

The discovered attention cells can be seamlessly inserted into existing backbone networks, e. g., I3D or S3D, and improve video classification accuracy by more than 2% on both Kinetics-600 and MiT datasets.

Classification General Classification +1

AViD Dataset: Anonymized Videos from Diverse Countries

1 code implementation NeurIPS 2020 AJ Piergiovanni, Michael S. Ryoo

We confirm that most of the existing video datasets are statistically biased to only capture action videos from a limited number of countries.

Action Classification Action Detection +1

LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition

no code implementations13 Mar 2020 Ramyad Hadidi, Bahar Asgari, Jiashen Cao, Younmin Bae, Da Eun Shim, Hyojong Kim, Sung-Kyu Lim, Michael S. Ryoo, Hyesoon Kim

To benefit from available compute resources with low communication overhead, we propose the first DNN parallelization method for reducing the communication overhead in a distributed system.

Quantization

Password-conditioned Anonymization and Deanonymization with Face Identity Transformers

1 code implementation26 Nov 2019 Xiuye Gu, Weixin Luo, Michael S. Ryoo, Yong Jae Lee

Cameras are prevalent in our daily lives, and enable many useful systems built upon computer vision technologies such as smart cameras and home robots for service applications.

Tiny Video Networks

2 code implementations15 Oct 2019 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

Video understanding is a challenging problem with great impact on the abilities of autonomous agents working in the real-world.

Video Understanding

Model-based Behavioral Cloning with Future Image Similarity Learning

1 code implementation8 Oct 2019 Alan Wu, AJ Piergiovanni, Michael S. Ryoo

We present a visual imitation learning framework that enables learning of robot action policies solely based on expert samples without any robot trials.

Imitation Learning

Unseen Action Recognition with Unpaired Adversarial Multimodal Learning

no code implementations ICLR 2019 AJ Piergiovanni, Michael S. Ryoo

In this paper, we present a method to learn a joint multimodal representation space that allows for the recognition of unseen activities in videos.

Action Recognition General Classification +1

Differentiable Grammars for Videos

no code implementations1 Feb 2019 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos.

Representation Flow for Action Recognition

5 code implementations CVPR 2019 AJ Piergiovanni, Michael S. Ryoo

Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.

Action Classification Action Recognition In Videos +4

Learning Multimodal Representations for Unseen Activities

1 code implementation21 Jun 2018 AJ Piergiovanni, Michael S. Ryoo

We present a method to learn a joint multimodal representation space that enables recognition of unseen activities in videos.

General Classification Temporal Action Localization

Fine-grained Activity Recognition in Baseball Videos

3 code implementations9 Apr 2018 AJ Piergiovanni, Michael S. Ryoo

In this paper, we introduce a challenging new dataset, MLB-YouTube, designed for fine-grained activity detection.

Action Detection Activity Detection +3

Learning to Anonymize Faces for Privacy Preserving Action Detection

1 code implementation ECCV 2018 Zhongzheng Ren, Yong Jae Lee, Michael S. Ryoo

The end result is a video anonymizer that performs pixel-level modifications to anonymize each person's face, with minimal effect on action detection performance.

Action Detection Privacy Preserving

Joint Person Segmentation and Identification in Synchronized First- and Third-person Videos

no code implementations ECCV 2018 Mingze Xu, Chenyou Fan, Yuchen Wang, Michael S. Ryoo, David J. Crandall

In this paper, we wish to solve two specific problems: (1) given two or more synchronized third-person videos of a scene, produce a pixel-level segmentation of each visible person and identify corresponding people across different views (i. e., determine who in camera A corresponds with whom in camera B), and (2) given one or more synchronized third-person videos as well as a first-person video taken by a mobile or wearable camera, segment and identify the camera wearer in the third-person videos.

Segmentation

Temporal Gaussian Mixture Layer for Videos

1 code implementation ICLR 2019 AJ Piergiovanni, Michael S. Ryoo

We introduce a new convolutional layer named the Temporal Gaussian Mixture (TGM) layer and present how it can be used to efficiently capture longer-term temporal information in continuous activity videos.

Action Detection Activity Detection

Learning Latent Super-Events to Detect Multiple Activities in Videos

2 code implementations CVPR 2018 AJ Piergiovanni, Michael S. Ryoo

In this paper, we introduce the concept of learning latent super-events from activity videos, and present how it benefits activity detection in continuous videos.

Action Detection Activity Detection

Extreme Low Resolution Activity Recognition with Multi-Siamese Embedding Learning

no code implementations3 Aug 2017 Michael S. Ryoo, Kiyoon Kim, Hyun Jong Yang

This paper presents an approach for recognizing human activities from extreme low resolution (e. g., 16x12) videos.

Activity Recognition Privacy Preserving

Forecasting Hands and Objects in Future Frames

no code implementations20 May 2017 Chenyou Fan, JangWon Lee, Michael S. Ryoo

The key idea is that (1) an intermediate representation of a convolutional object recognition model abstracts scene information in its frame and that (2) we can predict (i. e., regress) such representations corresponding to the future frames based on that of the current frame.

Object object-detection +2

Identifying First-person Camera Wearers in Third-person Videos

no code implementations CVPR 2017 Chenyou Fan, Jang-Won Lee, Mingze Xu, Krishna Kumar Singh, Yong Jae Lee, David J. Crandall, Michael S. Ryoo

We consider scenarios in which we wish to perform joint scene understanding, object tracking, activity recognition, and other tasks in environments in which multiple people are wearing body-worn cameras while a third-person static camera also captures the scene.

Activity Recognition Object Tracking +1

Learning Social Affordance Grammar from Videos: Transferring Human Interactions to Human-Robot Interactions

no code implementations1 Mar 2017 Tianmin Shu, Xiaofeng Gao, Michael S. Ryoo, Song-Chun Zhu

In this paper, we present a general framework for learning social affordance grammar as a spatiotemporal AND-OR graph (ST-AOG) from RGB-D videos of human interactions, and transfer the grammar to humanoids to enable a real-time motion inference for human-robot interaction (HRI).

Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

1 code implementation26 May 2016 AJ Piergiovanni, Chenyou Fan, Michael S. Ryoo

In this paper, we newly introduce the concept of temporal attention filters, and describe how they can be used for human activity recognition from videos.

Action Classification Action Recognition In Videos +2

Privacy-Preserving Human Activity Recognition from Extreme Low Resolution

no code implementations12 Apr 2016 Michael S. Ryoo, Brandon Rothrock, Charles Fleming, Hyun Jong Yang

We introduce the paradigm of inverse super resolution (ISR), the concept of learning the optimal set of image transformations to generate multiple low-resolution (LR) training videos from a single video.

Human Activity Recognition Privacy Preserving +1

Cannot find the paper you are looking for? You can Submit a new open access paper.