Specifically, considering that text modifier may refer to semantic concepts not existing in the reference image and requiring to be added into the target image, we learn the multi-modal concept alignment between the text modifier and the concatenation of reference and target images, under multiple-instance learning framework with image and sentence level weak supervision.
In this paper we propose a novel feature learning model, referred to as CircleNet, to achieve feature adaptation by mimicking the process humans looking at low resolution and occluded objects: focusing on it again, at a finer scale, if the object can not be identified clearly for the first time.
Action understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences.
Ranked #1 on Weakly Supervised Action Localization on FineAction
In this paper, we aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns that are unseen during training time, by leveraging the power of knowledge graphs.
Knowledge graph-to-text (KG-to-text) generation aims to generate easy-to-understand sentences from the KG, and at the same time, maintains semantic consistency between generated sentences and the KG.
We extensively benchmark against the baselines for SSAD and OSAD on our created data splits in THUMOS14 and ActivityNet1. 2, and demonstrate the effectiveness of the proposed UFA and IB methods.
By adjusting the auxiliary task weights to minimize the divergence between the surrogate prior and the true prior of the main task, we obtain a more accurate prior estimation, achieving the goal of minimizing the required amount of training data for the main task and avoiding a costly grid search.
Spatio-temporal action detection in videos requires localizing the action both spatially and temporally in the form of an "action tube".
In this paper, we present a conceptually simple and general yet novel framework for few-shot temporal activity detection based on proposal regression which detects the start and end time of the activities in untrimmed videos.
Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label.
Ranked #9 on Weakly Supervised Action Localization on THUMOS’14
The edge between these two lines of works has yet been underexplored, and the effectiveness of meta-learning in few-shot learning remains unclear.
Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations.
Generating realistic images of complex visual scenes becomes challenging when one wishes to control the structure of the generated images.
Ranked #3 on Layout-to-Image Generation on Visual Genome 256x256
However, while such approaches tend to focus on identifying relationships between elements of the video and language modalities, there is less emphasis on modeling relational context between video frames given the semantic context of the query.
Given a video and a sentence, the goal of weakly-supervised video moment retrieval is to locate the video segment which is described by the sentence without having access to temporal annotations during training.
By leveraging the concept of mobile edge computing (MEC), massive amount of data generated by a large number of Internet of Things (IoT) devices could be offloaded to MEC server at the edge of wireless network for further computational intensive processing.
We address the problem of temporal activity detection in continuous, untrimmed video streams.
Ranked #4 on Action Recognition on THUMOS’14
In this paper, we present a conceptually simple and general yet novel framework for few-shot temporal activity detection which detects the start and end time of the few-shot input activities in an untrimmed video.
Events defined by the interaction of objects in a scene are often of critical importance; yet important events may have insufficient labeled examples to train a conventional deep model to generalize to future object appearance.
To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work.
In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context.
In this paper, we propose the contextual multi-scale region convolutional 3D network (CMS-RC3D) for activity detection.
We address the problem of activity detection in continuous, untrimmed video streams.
Ranked #1 on Action Recognition In Videos on THUMOS’14
We propose a novel spatial attention architecture that aligns words with image patches in the first hop, and obtain improved results by adding a second attention hop which considers the whole question to choose visual evidence based on the results of the first hop.
Most state-of-the-art methods for solving this problem borrow existing deep convolutional neural network (CNN) architectures (AlexNet, GoogLeNet) to extract a visual representation of the input video.
Solving the visual symbol grounding problem has long been a goal of artificial intelligence.