In this paper, we study multimodal coreference resolution, specifically where a longer descriptive text, i. e., a narration is paired with an image.
Situation Recognition is the task of generating a structured summary of what is happening in an image using an activity verb and the semantic roles played by actors and objects.
Ranked #1 on Situation Recognition on imSitu
We evaluate models on CLAVI and find that all models achieve high performance on multimodal shortcut instances, but most of them have poor performance on the counterfactual instances that necessitate joint multimodal understanding.
The final context-infused spatio-temporal interaction tokens are used for compositional action recognition.
To tackle this, we propose a simple yet effective Regional Prompt Tuning, which encodes "regional visual hints" and "global contexts" separately at fine and coarse-grained levels.
Ranked #1 on Visual Abductive Reasoning on SHERLOCK
EBL can be used to improve the instance selection for a self-training task on the unlabelled target domain, and 2. alignment and normalizing energy scores can learn domain-invariant representations.
Coreference resolution aims to identify words and phrases which refer to same entity in a text, a core task in natural language processing.
On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3. 3% on mean-top5 recall.
Ranked #1 on Action Anticipation on EGTEA
In this paper, we introduce a novel research task known as "abductive action inference" which addresses the question of which actions were executed by a human to reach a specific state shown in a single snapshot.
It is through the submission of this paper that our method is currently the new state-of-the-art for action anticipation in EK55 and EGTEA Gaze+ https://competitions. codalab. org/competitions/20071#results Code available at https://github. com/debadityaroy/Abstract_Goal
In this work, we address two key limitations of such representations, in failing to capture local 3D geometric fine details, and to learn from and generalize to shapes with unseen 3D transformations.
We propose LocFormer, a Transformer-based model for video grounding which operates at a constant memory footprint regardless of the video length, i. e. number of frames.
Scene graph generation (SGG) aims to capture a wide variety of interactions between pairs of objects, which is essential for full scene understanding.
Attention modules for Convolutional Neural Networks (CNNs) are an effective method to enhance performance on multiple computer-vision tasks.
A temporal recurrent encoder captures temporal information of input videos while a self-attention model is used to attend on relevant feature dimensions of the input space.
We propose a framework for early action recognition and anticipation by correlating past features with the future using three novel similarity measures called Jaccard vector similarity, Jaccard cross-correlation and Jaccard Frobenius inner product over covariances.
In this paper, we introduce a new variational model that extends the recurrent network in two ways for the task of video frame prediction.
Capsule networks (CapsNets) have recently shown promise to excel in most computer vision tasks, especially pertaining to scene understanding.
This paper studies the task of temporal moment localization in a long untrimmed video using natural language query.
This paper presents a framework to recognize temporal compositions of atomic actions in videos.
We extend our action sequence forecasting model to perform weakly supervised action forecasting on two challenging datasets, the Breakfast and the 50Salads.
Automatically generating natural language descriptions from an image is a challenging problem in artificial intelligence that requires a good understanding of the visual and textual signals and the correlations between them.
We introduce a novel Recurrent Neural Network-based algorithm for future video feature generation and action anticipation called feature mapping RNN.
Furthermore, we use our model that is trained to output action sequences to solve downstream tasks; such as video captioning and action localization.
Detecting temporal extents of human actions in videos is a challenging computer vision problem that requires detailed manual supervision including frame-level labels.
Action anticipation is critical in scenarios where one needs to react before the action is finalized.
State-of-the-art face super-resolution methods use deep convolutional neural networks to learn a mapping between low-resolution (LR) facial patterns and their corresponding high-resolution (HR) counterparts by exploring local information.
Human action-anticipation methods predict what is the future action by observing only a few portion of an action in progress.
An LR input contains low-frequency facial components of its HR version while its residual face image defined as the difference between the HR ground-truth and interpolated LR images contains the missing high-frequency facial details.
First, we present "discriminative rank pooling" in which the shared weights of our video representation and the parameters of the action classifiers are estimated jointly for a given training dataset of labelled vector sequences using a bilevel optimization formulation of the learning problem.
Unrolling these iterations in a Sinkhorn network layer, we propose DeepPermNet, an end-to-end CNN model for this task.
Most popular deep models for action recognition split video sequences into short sub-sequences consisting of a few frames; frame-based features are then pooled for recognizing the activity.
In contrast to the widely studied problem of recognizing an action given a complete sequence, action anticipation aims to identify the action from only partially available videos.
Three retinal image analysis experts were employed to categorize these images into Accept and Reject classes based on the precise definition of image quality in the context of DR. A deep learning framework was trained using 3428 images.
Existing image captioning models do not generalize well to out-of-domain images containing novel scenes or objects.
On the MPII Cooking dataset we detect action segments with a precision of 21. 6% and recall of 11. 7% over 946 long video pairs and over 5000 ground truth action segments.
This is a powerful idea because it allows to convert any video to an image so that existing CNN models pre-trained for the analysis of still images can be immediately extended to videos.
On action classification, our method obtains 60. 3\% on the UCF101 dataset using only UCF101 data for training which is approximately 10% better than current state-of-the-art self-supervised learning methods.
Ranked #46 on Self-Supervised Action Recognition on UCF101
We outperform the state-of-the-art methods that, as us, rely only on RGB frames as input for both action recognition and anticipation.
This paper introduces an extension of the backpropagation algorithm that enables us to have layers with constrained weights in a deep network.
Some recent works in machine learning and computer vision involve the solution of a bi-level optimization problem.
We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis especially when convolutional neural networks (CNNs) are used.
Ranked #61 on Action Recognition on HMDB-51
We present hierarchical rank pooling, a video sequence encoding method for activity recognition.
We show how the parameters of a function that has been fit to the video data can serve as a robust new video representation.
We present a supervised learning to rank algorithm that effectively orders images by exploiting the structure in image sequences.
We postulate that a function capable of ordering the frames of a video temporally (based on the appearance) captures well the evolution of the appearance within the video.
As the amount of visual data increases, so does the need for summarization tools that can be used to explore large image collections and to quickly get familiar with their content.
This paper presents a novel multi scale gradient and a corner point based shape descriptors.
Domain adaptation aims at adapting the knowledge acquired on a source domain to a new different but related target domain.
We present two approaches to determine the only hyper-parameter in our method corresponding to the size of the subspaces.