We analyze the suitability of our new primitive for video action recognition and explore several novel variations of our approach to enable stronger representational flexibility while maintaining an efficient design.
1 code implementation • • Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Benita Teufel, Marco Bellagente, Minguk Kang, Taesung Park, Jure Leskovec, Jun-Yan Zhu, Li Fei-Fei, Jiajun Wu, Stefano Ermon, Percy Liang
The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption.
no code implementations • 2 Nov 2023 • Ruohan Zhang, Sharon Lee, Minjune Hwang, Ayano Hiranaka, Chen Wang, Wensi Ai, Jin Jie Ryan Tan, Shreya Gupta, Yilun Hao, Gabrael Levine, Ruohan Gao, Anthony Norcia, Li Fei-Fei, Jiajun Wu
We present Neural Signal Operated Intelligent Robots (NOIR), a general-purpose, intelligent brain-robot interface system that enables humans to command robots to perform everyday activities through brain signals.
Further, we observe that Score Distillation Sampling (SDS) tends to truncate the distribution of complex backgrounds during distillation of 360-degree scenes, and propose "SDS anchoring" to improve the diversity of synthesized novel views.
We present Mini-BEHAVIOR, a novel benchmark for embodied AI that challenges agents to use reasoning and decision-making skills to solve complex activities that resemble everyday human challenges.
These fields capture the dynamics of the underlying 3D environment and encode both semantic features and instance masks.
Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration.
However, the challenges arise due to the high-dimensional action space of dexterous hand and complex compositional dynamics of the long-horizon tasks.
By combining them, SEED reduces the human effort required in RLHF and increases safety in training robot manipulation with RL in real-world settings.
The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations.
Prior works typically assume representation at a fixed dimension or resolution, which may be inefficient for simple tasks and ineffective for more complicated tasks.
In recent years, differential privacy has seen significant advancements in image classification; however, its application to video activity recognition remains under-explored.
Embodied AI agents in large scenes often need to navigate to find objects.
In this work, we propose a novel method for representation learning of multi-view videos, where we explicitly model the representation space to maintain Homography Equivariance (HomE).
We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for multisensory object-centric learning, centered around object recognition, reconstruction, and manipulation with sight, sound, and touch.
We introduce Sonicverse, a multisensory simulation platform with integrated audio-visual simulation for training household agents that can both see and hear.
We evaluate our method in the Dynamic House Simulator, a new benchmark that creates diverse dynamic graphs following the semantic patterns typically seen at homes, and show that NEP can be trained to predict the locations of objects in a variety of environments with diverse object movement dynamics, outperforming baselines both in terms of new scene adaptability and overall accuracy.
Humans use all of their senses to accomplish different tasks in everyday activities.
ATR selects suitable tasks, which consist of an initial environment state and manipulation goal, for learning robust skills by balancing the diversity and feasibility of the tasks.
no code implementations • 13 Oct 2022 • Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi, Sonia Raychaudhuri, Mike Roberts, Silvio Savarese, Manolis Savva, Mohit Shridhar, Niko Sünderhauf, Andrew Szot, Ben Talbot, Joshua B. Tenenbaum, Jesse Thomason, Alexander Toshev, Joanne Truong, Luca Weihs, Jiajun Wu
We present a retrospective on the state of Embodied AI research.
These results identify tasks where expectation alignment is a more useful strategy than curiosity-driven exploration for multi-agent coordination, enabling agents to do zero-shot coordination.
We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens.
Because of this clinical data scarcity and inspired by the recent advances in self-supervised large-scale language models like GPT-3, we use human motion forecasting as an effective self-supervised pre-training task for the estimation of motor impairment severity.
This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling.
Robots excel in performing repetitive and precision-sensitive tasks in controlled environments such as warehouses and factories, but have not been yet extended to embodied AI agents providing assistance in household tasks.
The accelerated use of digital cameras prompts an increasing concern about privacy and security, particularly in applications such as action recognition.
Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks.
Ranked #1 on Video Question Answering on MSR-VTT-MC
We present ObjectFolder 2. 0, a large-scale, multisensory dataset of common household objects in the form of implicit neural representations that significantly enhances ObjectFolder 1. 0 in three aspects.
Multiple domains like vision, natural language, and audio are witnessing tremendous progress by leveraging Transformers for large scale pre-training followed by task specific fine tuning.
Doing this is challenging for two reasons: on the data side, current interfaces make collecting high-quality human demonstrations difficult, and on the learning side, policies trained on limited data can suffer from covariate shift when deployed.
Over the last decade, Computer Vision, the branch of Artificial Intelligence aimed at understanding the visual world, has evolved from simply recognizing objects in images to describing pictures, answering questions about images, aiding robots maneuver around physical spaces and even generating novel visual content.
In this paper, we study the problem of learning a repertoire of low-level skills from raw images that can be sequenced to complete long-horizon visuomotor tasks.
Multisensory object-centric perception, reasoning, and interaction have been a key research topic in recent years.
3 code implementations • 16 Aug 2021 • Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, aditi raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, Percy Liang
AI is undergoing a paradigm shift with the rise of models (e. g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks.
Our method co-optimizes a human policy and a robot policy in an interactive learning process: the human policy learns to generate diverse and plausible collaborative behaviors from demonstrations while the robot policy learns to assist by estimating the unobserved latent strategy of its human collaborator.
no code implementations • 6 Aug 2021 • Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, C. Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, Li Fei-Fei
We introduce BEHAVIOR, a benchmark for embodied AI with 100 activities in simulation, spanning a range of everyday household chores such as cleaning, maintenance, and food preparation.
1 code implementation • 6 Aug 2021 • Chengshu Li, Fei Xia, Roberto Martín-Martín, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, Andrey Kurenkov, C. Karen Liu, Hyowon Gweon, Jiajun Wu, Li Fei-Fei, Silvio Savarese
We evaluate the new capabilities of iGibson 2. 0 to enable robot learning of novel tasks, in the hope of demonstrating the potential of this new simulator to support new research in embodied AI.
Based on the study, we derive a series of lessons including the sensitivity to different algorithmic design choices, the dependence on the quality of the demonstrations, and the variability based on the stopping criteria due to the different objectives in training and evaluation.
Although virtual agents are increasingly situated in environments where natural language is the most effective mode of interaction with humans, these exchanges are rarely used as an opportunity for learning.
Active learning promises to alleviate the massive data needs of supervised machine learning: it has successfully improved sample efficiency by an order of magnitude on traditional tasks like topic classification and object recognition.
To encourage generalizable skills to emerge, our method trains each skill to specialize in the paired task and maximizes the diversity of the generated tasks.
We propose a novel method for privacy-preserving training of deep neural networks leveraging public, out-domain data.
A student network then learns to mimic the expert policy by supervised learning with strong augmentations, making its representation more robust against visual variations compared to the expert.
3 code implementations • 15 Jun 2021 • Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L. K. Yamins, Judith E. Fan
While current vision algorithms excel at many challenging tasks, it is unclear how well they understand the physical dynamics of real-world environments.
Federated learning is an emerging research paradigm enabling collaborative training of machine learning models among different organizations while keeping data private at each institution.
Batch Normalization (BN) and its variants have delivered tremendous success in combating the covariate shift induced by the training step of deep learning methods.
In this paper, we explore the effects of face obfuscation on the popular ImageNet challenge visual recognition benchmark.
Our key insight is that greedy and modular optimization of hierarchical autoencoders can simultaneously address both the memory constraints and the optimization challenges of large-scale video prediction.
Ranked #1 on Video Prediction on Cityscapes 128x128
Key to such capability is hand-eye coordination, a cognitive ability that enables humans to adaptively direct their movements at task-relevant objects and be invariant to the objects' absolute spatial location.
However, the principles governing relations between environmental complexity, evolved morphology, and the learnability of intelligent control, remain elusive, partially due to the substantial challenge of performing large-scale in silico experiments on evolution and learning.
We develop a simple and effective algorithm to train the policy iteratively on new data collected by the system that encourages the policy to learn how to traverse bottlenecks through the interventions.
To address these challenges, we present Multi-Arm RoboTurk (MART), a multi-user data collection platform that allows multiple remote users to simultaneously teleoperate a set of robotic arms and collect demonstrations for multi-arm tasks.
2 code implementations • 5 Dec 2020 • Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín, Linxi Fan, Guanzhi Wang, Claudia Pérez-D'Arpino, Shyamal Buch, Sanjana Srivastava, Lyne P. Tchapmi, Micael E. Tchapmi, Kent Vainio, Josiah Wong, Li Fei-Fei, Silvio Savarese
We present iGibson 1. 0, a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes.
In a third study, we assess effects of metaphor choices on potential users' desire to try out the system and find that users are drawn to systems that project higher competence and warmth.
This is the first benchmark for classifying PD patients based on MDS-UPDRS gait severity and could be an objective biomarker for disease severity.
To enable curriculum learning in the absence of a direct indicator of learning progress, we propose to train the task generator by balancing the agent's performance in the generated tasks and the similarity to the target tasks.
To overcome these limitations, we introduce the idea of Physical Scene Graphs (PSGs), which represent scenes as hierarchical graphs, with nodes in the hierarchy corresponding intuitively to object parts at different scales, and edges to physical connections between parts.
In the second stage of GTI, we collect a small set of rollouts from the unconditioned stochastic policy of the first stage, and train a goal-directed agent to generalize to novel start and goal configurations.
Computer vision technology is being used by many but remains representative of only a few.
Next, by decomposing and learning the temporal changes in visual relationships that result in an action, we demonstrate the utility of a hierarchical event decomposition by enabling few-shot action recognition, achieving 42. 7% mAP using as few as 10 examples.
The assumption that these tasks always have exactly one correct answer has resulted in the creation of numerous uncertainty-based measurements, such as entropy and least confidence, which operate over a model's outputs.
We further show that by using the automatically inferred goal from the video demonstration, our robot is able to reproduce the same task in a real kitchen environment.
For simple short-horizon manipulation tasks with modest variation in task instances, offline learning from a small set of demonstrations can produce controllers that successfully solve the task.
We evaluate the quality of our platform, the diversity of demonstrations in our dataset, and the utility of our dataset via quantitative and qualitative analysis.
We present Interactive Gibson Benchmark, the first comprehensive benchmark for training and evaluating Interactive Navigation: robot navigation strategies where physical interaction with objects is allowed and even encouraged to accomplish a task.
The fundamental challenge of planning for multi-step manipulation is to find effective and plausible action sequences that lead to the task goal.
We present 6-PACK, a deep learning approach to category-level 6D object pose tracking on RGB-D data.
Ranked #1 on 6D Pose Estimation using RGBD on REAL275 (Rerr metric)
Presence of bias (in datasets or tasks) is inarguably one of the most critical challenges in machine learning applications that has alluded to pivotal debates in recent years.
A major difficulty of solving continuous POMDPs is to infer the multi-modal distribution of the unobserved true states and to make the planning algorithm dependent on the perceived uncertainty.
We present an overview of SURREAL-System, a reproducible, flexible, and scalable framework for distributed reinforcement learning (RL).
Recently, neural-network based forward dynamics models have been proposed that attempt to learn the dynamics of physical systems in a deterministic way.
A complex visual navigation task puts an agent in different situations which call for a diverse range of visual perception abilities.
The key technical challenge is that the symbol grounding is prone to error with limited training data and leads to subsequent symbolic planning failures.
Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback.
In this paper, we study the problem of procedure planning in instructional videos, which can be seen as a step towards enabling autonomous agents to plan for complex tasks in everyday settings such as cooking.
We introduce the first scene graph prediction model that supports few-shot learning of predicates.
We first evaluate the E3D-LSTM network on widely-used future video prediction datasets and achieve the state-of-the-art performance.
Ranked #1 on Video Prediction on KTH (Cond metric)
All scene graph models to date are limited to training on a small set of visual relationships that have thousands of training labels each.
Ranked #1 on Scene Graph Detection on VRD
We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time.
Many robotic applications require the agent to perform long-horizon tasks in partially observable environments.
To facilitate the training, the network is learned with an auxiliary task of predicting future location in which the activity will happen.
Ranked #1 on Activity Prediction on ActEV
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources.
Ranked #4 on 6D Pose Estimation on LineMOD
Therefore, we propose to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space.
Ranked #7 on Semantic Segmentation on PASCAL VOC 2012 val
The key technical challenge for discriminative modeling with weak supervision is that the loss function of the ordering supervision is usually formulated using dynamic programming and is thus not differentiable.
In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image.
Ranked #2 on Image Retrieval with Multi-Modal Query on MIT-States
As the senior population rapidly increases, it is challenging yet crucial to provide effective long-term care for seniors who live at home or in senior care facilities.
Computer-vision hospital systems can greatly assist healthcare workers and improve medical facility treatment, but often face patient resistance due to the perceived intrusiveness and violation of privacy associated with visual surveillance.
Homomorphic encryption enables arbitrary computation over data while it remains encrypted.
Cryptography and Security
This technology could be deployed to cell phones worldwide and facilitate low-cost universal access to mental health care.
Imitation Learning has empowered recent advances in learning robotic manipulation tasks by addressing shortcomings of Reinforcement Learning such as exploration and reward specification.
Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback.
A major challenge in computer vision is scaling activity understanding to the long tail of complex activities without requiring collecting large quantities of data for new actions.
We propose Neural Graph Matching (NGM) Networks, a novel framework that can learn to recognize a previous unseen 3D action class with only a few examples.
Ranked #1 on Skeleton Based Action Recognition on CAD-120
We show that these encodings are competitive with existing data hiding algorithms, and further that they can be made robust to noise: our models learn to reconstruct hidden information in an encoded image despite the presence of Gaussian blurring, pixel-wise dropout, cropping, and JPEG compression.
We hypothesize that to successfully generalize to unseen complex tasks from a single video demonstration, it is necessary to explicitly incorporate the compositional structure of the tasks into the model.
One of the most widely used optimization methods for large-scale machine learning problems is distributed asynchronous stochastic gradient descent (DASGD).
We perform both simulated and real-world experiments on two tool-based manipulation tasks: sweeping and hammering.
Humans have a remarkable capacity to understand the physical dynamics of objects in their environment, flexibly capturing complex structures and interactions at multiple levels of detail.
Our goal is to predict future video frames given a sequence of input frames.
In this work, we propose to tackle this new task with a weakly-supervised framework for reference-aware visual grounding in instructional videos, where only the temporal alignment between the transcription and the video segment are available for supervision.
The ability to capture temporal information has been critical to the development of video understanding models.
To overcome this limitation we propose a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships.
Ranked #4 on Layout-to-Image Generation on Visual Genome 64x64
Understanding human motion behavior is critical for autonomous moving platforms (like self-driving cars and social robots) if they are to navigate human-centric environments.
Ranked #4 on Trajectory Prediction on ETH
The framework consists of two core modules: a local module that uses spatial memory to store previous beliefs with parallel updates; and a global graph-reasoning module.
We show that our method both effectively detects the spatial bounds of tools as well as significantly outperforms existing methods on tool presence detection.
We demonstrate that this policy causes the agent to explore novel and informative interactions with its environment, leading to the generation of a spectrum of complex behaviors, including ego-motion prediction, object attention, and object gathering.
Moreover, the world model that the agent learns supports improved performance on object dynamics prediction and localization tasks.
Recent deep networks are capable of memorizing the entire data even when the labels are completely random.
Ranked #16 on Image Classification on WebVision-1000
We propose a new method for learning the structure of convolutional neural networks (CNNs) that is more efficient than recent state-of-the-art methods based on reinforcement learning and evolutionary algorithms.
Ranked #14 on Neural Architecture Search on NAS-Bench-201, ImageNet-16-120 (Accuracy (Val) metric)
We propose a technique that tackles action detection in multimodal videos under a realistic and challenging condition in which only limited training data and partially observed modalities are available.
We propose a framework that learns a representation transferable across different domains and tasks in a label efficient manner.
Accurate identification and localization of abnormalities from radiology images play an integral part in clinical diagnosis and treatment planning.
In this work, we propose a novel robot learning framework called Neural Task Programming (NTP), which bridges the idea of few-shot learning from demonstration and neural program induction.
We present a crowdsourcing workflow to collect image annotations for visually similar synthetic categories without requiring experts.
In this work, we leverage the ubiquity of Google Street View images and develop a computer vision pipeline to predict income, per capita carbon emission, crime rates and other city attributes from a single source of publicly available visual data.
While fine-grained object recognition is an important problem in computer vision, current models are unlikely to accurately classify objects in the wild.
no code implementations • 1 Aug 2017 • Albert Haque, Michelle Guo, Alexandre Alahi, Serena Yeung, Zelun Luo, Alisha Rege, Jeffrey Jopling, Lance Downing, William Beninati, Amit Singh, Terry Platchek, Arnold Milstein, Li Fei-Fei
One in twenty-five patients admitted to a hospital will suffer from a hospital acquired infection.
Physiological signals such as heart rate can provide valuable information about an individual's state and activity.
Our method uses Q-learning to learn a data labeling policy on a small labeled training dataset, and then uses this to automatically label noisy web data for new visual concepts.
A crucial capability of real-world intelligent agents is their ability to plan a sequence of actions to achieve their goals in the visual world.
Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes.
Ranked #5 on Visual Question Answering (VQA) on CLEVR-Humans
Recent progress in style transfer on images has focused on improving the quality of stylized images and speed of methods.
We propose an unsupervised method for reference resolution in instructional videos, where the goal is to temporally link an entity (e. g., "dressing") to the action (e. g., "mix yogurt") that produced it.
The United States spends more than $1B each year on initiatives such as the American Community Survey (ACS), a labor-intensive door-to-door study that measures statistics relating to race, gender, education, occupation, unemployment, and other demographic factors.
In this work, we explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image.
Ranked #7 on Panoptic Scene Graph Generation on PSG Dataset
We present an unsupervised representation learning approach that compactly encodes the motion dependencies in videos.
When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings.
We present an attention-based model that reasons on human body shape and motion dynamics to identify individuals in the absence of RGB information, hence in the dark.
Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail.
To address the second issue, we propose AI2-THOR framework, which provides an environment with high-quality 3D scenes and physics engine.
We improve on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship.
Ranked #2 on Scene Graph Generation on VRD
We propose a weakly-supervised framework for action labeling in video, where only the order of occurring actions is required during training time.
Our method outperforms the two most commonly used alternatives (anatomical landmark-based AFNI alignment and cortical convexity-based FreeSurfer alignment) in overlap between predicted region and functionally-defined LOC.
Different from the conventional LSTM, we share the information between multiple LSTMs through a new pooling layer.
Ranked #1 on Trajectory Prediction on Stanford Drone (ADE (8/12) @K=5 metric)
We consider image transformation problems, where an input image is transformed into an output image.
Ranked #4 on Nuclear Segmentation on Cell17
We propose a viewpoint invariant model for 3D human pose estimation from a single depth image.
Ranked #4 on Pose Estimation on ITOP top-view
Microtask crowdsourcing has enabled dataset advances in social science and machine learning, but existing crowdsourcing schemes are too expensive to scale up with the expanding volume of data.
Inspired by the recent success of RGB-D cameras, we propose the enrichment of RGB data with an additional "quasi-free" modality, namely, the wireless signal (e. g., wifi or Bluetooth) emitted by individuals' cell phones, referred to as RGB-W.
We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language.
Ranked #3 on Dense Captioning on Visual Genome
In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions.
Ranked #9 on Temporal Action Localization on THUMOS’14 (mAP IOU@0.2 metric)
Current approaches for fine-grained recognition do the following: First, recruit experts to annotate a dataset of images, optionally also collecting more structured data in the form of part annotations and bounding boxes.
Ranked #4 on Fine-Grained Image Classification on CUB-200-2011 (using extra training data)
It enables a new type of QA with visual answers, in addition to textual answers used in previous work.
In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event.
Every moment counts in action recognition.
Ranked #7 on Action Detection on Multi-THUMOS