The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics.
Ranked #4 on Robot Manipulation on RLBench
To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.
Ranked #1 on Zero-shot dense video captioning on ViTT (using extra training data)
Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image.
Ranked #1 on Composed Video Retrieval (CoVR) on WebVid-CoVR
To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass.
While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.
Ranked #1 on Action Segmentation on COIN
While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks.
A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.
We show our task is more general than grounding, and models trained on our task can directly be applied to grounding by finding the bounding box with the maximum likelihood of generating the query sentence.
In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.
Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Ranked #3 on Fine-Grained Image Recognition on OVEN
The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3.
We present a framework that formulates visual question answering as modular code generation.
To encourage generalization to new tasks, we avoid particular tasks during training and learn our policy from unlabelled robot trajectories and corresponding robot videos.
In particular, we address reconstruction of hands and manipulated objects from monocular RGB images.
The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks.
Ranked #1 on Action Recognition on AVA v2.1 (using extra training data)
Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time.
Ranked #9 on Video Question Answering on NExT-QA (using extra training data)
Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems.
Ranked #1 on Image Classification on WebVision-1000 (using extra training data)
In this work, we introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data.
Based on these findings, we propose to boost the attribution scores of the model trained with partial labels to make its explanation resemble that of the model trained with full labels.
(ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech).
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)
One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images.
REVEAL consists of four key components: the memory, the encoder, the retriever and the generator.
Ranked #1 on Visual Question Answering (VQA) on A-OKVQA (Accuracy metric)
Can we leverage the audiovisual information already present in video to improve self-supervised representation learning?
Ranked #1 on Audio Classification on EPIC-KITCHENS-100 (using extra training data)
This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones.
In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.
We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from.
Reinforcement learning (RL) and trajectory optimization (TO) present strong complementary advantages.
In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.
Ranked #2 on Robot Manipulation on RLBench (Succ. Rate (10 tasks, 100 demos/task) metric)
Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.
In this work, we focus on summarizing instructional videos, an under-explored area of video summarization.
We show that such aligned SDFs better focus on reconstructing shape details and improve reconstruction accuracy both for hands and objects.
Transfer learning is the predominant paradigm for training deep networks on small target datasets.
This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge.
Ranked #2 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)
Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.
Ranked #1 on Zero-Shot Learning on iVQA (using extra training data)
Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.
Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions.
We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.
This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation.
To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
Ranked #4 on Zero-shot Text to Audio Retrieval on AudioCaps
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.
Ranked #1 on Spatio-Temporal Video Grounding on VidSTG
In this work, we argue that the coupling of camera rotation and camera translation can create complex motion fields that are difficult for a deep network to untangle directly.
To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.
Recent advances in deep learning have relied on large, labelled datasets to train high-capacity models.
Recent video and language pretraining frameworks lack the ability to generate sentences.
Ranked #11 on Video Captioning on MSR-VTT (using extra training data)
Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.
Ranked #4 on Action Classification on Kinetics-700 (using extra training data)
Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
Ranked #3 on Vision and Language Navigation on RxR
Notably, images depend both on the properties of observed scenes and on the process of image formation.
Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.
Ranked #3 on Vision and Language Navigation on VLN Challenge
Our work aims to obtain 3D reconstruction of hands and manipulated objects from monocular videos.
The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module.
Ranked #8 on Video Prediction on Kinetics-600 12 frames, 64x64
Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning.
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.
Ranked #1 on Action Classification on Kinetics-Sounds
To address this issue, we introduce a new challenging task to generate HD maps.
Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal.
Extensive experiments on COCO and OpenImages show that, in the single-object discovery setting where a single prominent object is sought in each image, the proposed LOD (Large-scale Object Discovery) approach is on par with, or better than the state of the art for medium-scale datasets (up to 120K images), and over 37% better than the only other algorithms capable of scaling up to 1. 7M images.
We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance.
In this paper we introduce Segmenter, a transformer model for semantic segmentation.
Ranked #15 on Semantic Segmentation on PASCAL Context
An effective and simple approach to long-tailed visual recognition is to learn feature representations and a classifier separately, with instance and class-balanced sampling, respectively.
Ranked #10 on Long-tail Learning on iNaturalist 2018
CNNs perform remarkably well when the training and test distributions are i. i. d, but unseen image corruptions can cause a surprisingly large drop in performance.
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.
Ranked #8 on Action Classification on Moments in Time (Top 5 Accuracy metric, using extra training data)
Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.
Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.
In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.
Ranked #2 on Zero-Shot Learning on ActivityNet-QA (using extra training data)
Motion planning and obstacle avoidance is a key challenge in robotics applications.
4 code implementations • 19 Aug 2020 • Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Cong-Cong Li, Dragomir Anguelov
Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states.
1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
Based on this observation, we propose to use text as a method for learning video representations.
Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind.
In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others.
Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT (text-to-video Mean Rank metric, using extra training data)
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
To handle inherent modeling error in the consistency loss (e. g. Lambertian assumptions) and for better generalization, we further introduce a learned, output refinement network, which takes the initial predictions, the loss, and the gradient as input, and efficiently predicts a correlated output update.
To this end, we ask annotators to label objects that move at any point in the video, and give names to them post factum.
Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.
Ranked #2 on Contrastive Learning on imagenet-1k
Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e. g. pedestrians and vehicles) and road context information (e. g. lanes, traffic lights).
Modeling hand-object manipulations is essential for understanding how humans interact with their environment.
We assume that the model is updated incrementally for new classes as new data becomes available sequentially. This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding original training images.
We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.
Popular approaches for few-shot classification consist of first learning a generic data representation based on a large annotated dataset, before adapting the representation to new classes given only a few labeled samples.
Ranked #4 on Few-Shot Image Classification on Meta-Dataset Rank
Eye movement and strategic placement of the visual field onto the retina, gives animals increased resolution of the scene and suppresses distracting information.
To learn models or features that generalize across tasks and domains is one of the grand goals of machine learning.
Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.
Moreover, at test time the same network can be applied to detection and tracking, resulting in a unified approach for the two tasks.
In this work we consider the problem of learning a classifier from noisy labels when a few clean labeled examples are given.
Membership inference determines, given a sample and trained parameters of a machine learning model, whether the sample was part of the training set.
Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision.
In this paper, we tackle the problem of 3D human shape estimation from single RGB images.
We present GLNet, a self-supervised framework for learning depth, optical flow, camera pose and intrinsic parameters from monocular video - addressing the difficulty of acquiring realistic ground-truth for such tasks.
This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods.
Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation.
Ranked #10 on 3D Hand Pose Estimation on FreiHAND (PA-F@5mm metric)
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.
Ranked #1 on Action Classification on YouCook2
Few-shot classification consists of learning a predictive model that is able to effectively adapt to a new class, given only a few annotated samples.
Policies learned in simulators, however, do not transfer well to real scenes given the domain gap between real and synthetic data.
1 code implementation • 5 Jan 2019 • Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, Caroline Pantofaru
The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible.
We show that our model significantly improves over existing hybrid models: offering GAN-like samples, IS and FID scores that are competitive with fully adversarial models, and improved likelihood scores.
We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.
A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand.
First, we propose a model that extends variational autoencoders by using deterministic invertible transformation layers to map samples from the decoder to the image space.
Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting.
In this work, we consider object detection, semantic and instance segmentation and augment the training images by blending objects in existing scenes, using instance segmentation annotations.
A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.
Ranked #15 on Action Recognition on AVA v2.1
Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally.
Ranked #2 on Incremental Learning on ImageNet - 10 steps (# M Params metric)
For this approach to be successful, we show that modeling appropriately the visual context surrounding objects is crucial to place them in the right environment.
Our model is based on discriminative clustering and integrates different types of supervision as constraints on the optimization.
In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks.
We use the human joints as these keypoints and term our Pose moTion representation PoTion.
Ranked #1 on Skeleton Based Action Recognition on J-HMDB
In this paper, we introduce an unsupervised learning approach to automatically discover, summarize, and manipulate artistic styles from large collections of paintings.
In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68, 536 activity instances in 68. 8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available.
Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor).
Human shape estimation is an important task for video editing, animation and fashion industry.
Ranked #3 on 3D Human Pose Estimation on Surreal (using extra training data)
We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images.
Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.
We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to correct errors in motion estimation, and (iii) temporal consistency, which imposes additional constraints on the segmentation.
dog and cat jumping, enabling to detect actions of an object without training with these object-actions pairs.
Despite their success for object detection, convolutional neural networks are ill-equipped for incremental learning, i. e., adapting the original model trained on a set of classes to additionally detect objects of new classes, in the absence of the initial training data.
Real-time scene understanding has become crucial in many applications such as autonomous driving.
Ranked #2 on Real-Time Object Detection on PASCAL VOC 2007
This paper introduces a novel approach for modeling visual relations between pairs of objects.
In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i. e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations.
To this end, we regard the evolving landmark data as a high-dimensional path and apply non-linear path signature techniques to provide an expressive, robust, non-linear, and interpretable representation for the sequential events.
We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images.
Ranked #4 on 3D Multi-Person Pose Estimation (root-relative) on MuPoTS-3D (MPJPE metric)
8 code implementations • • Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
Ranked #6 on Action Detection on UCF101-24
This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category.
We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i. e., sequences of bounding boxes with associated scores.
We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations.
The module to build a "visual memory" in video, i. e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences.
Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout.
In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data.
Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.
Ranked #113 on 3D Human Pose Estimation on Human3.6M (PA-MPJPE metric)
Typical human actions last several seconds and exhibit characteristic spatio-temporal structure.
Ranked #61 on Action Recognition on UCF101
We also demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images.
Convolutional neural networks (CNNs) have recently received a lot of attention due to their ability to model local stationary structures in natural images in a multi-scale fashion, when learning all model parameters with supervision.
Patch-level descriptors underlie several important computer vision tasks, such as stereo-matching or content-based image retrieval.
Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout.~Semantic flow methods are designed to handle images depicting different instances of the same object or scene category.
It has been experimentally observed that the performance of BoW and FV representations can be improved by employing discounting transformations such as power normalization.
While important advances were recently made towards temporally localizing and recognizing specific human actions or activities in videos, efficient detection and classification of long video chunks belonging to semantically defined categories such as "pursuit" or "romance" remains challenging. We introduce a new dataset, Action Movie Franchises, consisting of a collection of Hollywood action movie franchises.
We introduce a novel matching algorithm, called DeepMatching, to compute dense correspondences between images.
Ranked #4 on Dense Pixel Correspondence Estimation on HPatches