Our key novelty is that we augment the original training videos in the deep feature space, not in the visual spatiotemporal domain as done by previous work.
We propose a Bidirectional Alignment for domain adaptive Detection with Transformers (BiADT) to improve cross domain object detection performance.
This paper addresses incremental few-shot instance segmentation, where a few examples of new object classes arrive when access to training examples of old classes is not available anymore, and the goal is to perform well on both old and new classes.
Second, we use a mini-detector to initialize the content queries in the decoder with classification and regression embeddings of the respective heads in the mini-detector.
More concretely, our CX-ToM framework generates sequence of explanations in a dialog by mediating the differences between the minds of machine and human user.
The resulting predictions on training images are taken as the pseudo-ground truth for the standard training of Mask-RCNN, which we use for amodal instance segmentation of test images.
This paper is about action segmentation under weak supervision in training, where the ground truth provides only a set of actions present, but neither their temporal ordering nor when they occur in a training video.
Our SSL trains an RNN to recognize positive and negative action sequences, and the RNN's hidden layer is taken as our new action-level feature embedding.
This paper addresses unsupervised few-shot object recognition, where all training images are unlabeled, and test images are divided into queries and a few labeled support images per object class of interest.
This paper is about weakly supervised action segmentation, where the ground truth specifies only a set of actions present in a training video, but not their true temporal ordering.
Finally, the target object is segmented in the query image by using a cosine similarity between the class feature vector and the query's feature map.
Ranked #67 on Few-Shot Semantic Segmentation on COCO-20i (5-shot)
This paper is about labeling video frames with action classes under weak supervision in training, where we have access to a temporal ordering of actions, but their start and end frames in training videos are unknown.
We present a new explainable AI (XAI) framework aimed at increasing justified human trust and reliance in the AI machine through explanations.
This paper presents an explainable AI (XAI) system that provides explanations for its predictions.
This motivates us to formulate our problem as a sequential search for informative parts over a deep feature map produced by a deep Convolutional Neural Network (CNN).
In this work, we study a poorly understood trade-off between accuracy and runtime costs for deep semantic video segmentation.
The summarizer is the autoencoder long short-term memory network (LSTM) aimed at, first, selecting video frames, and then decoding the obtained summarization for reconstructing the input video.
This work is about recognizing human activities occurring in videos at distinct semantic levels, including individual actions, interactions, and group activities.
Ranked #11 on Group Activity Recognition on Volleyball
Using deep learning, this paper addresses the problem of joint object boundary detection and boundary motion estimation in videos, which we named boundary flow estimation.
Our second contribution is the algorithm for learning a policy for the sparse selection of supervoxels and their descriptors for budgeted CRF inference.
This paper is about detecting functional objects and inferring human intentions in surveillance videos of public spaces.
This paper argues that large-scale action recognition in video can be greatly improved by providing an additional modality in training data -- namely, 3D human-skeleton sequences -- aimed at complementing poorly represented or missing features of human actions in the training videos.
This paper presents a new probabilistic generative model for image segmentation, i. e. the task of partitioning an image into homogeneous regions.
The mainstream approach to structured prediction problems in computer vision is to learn an energy function such that the solution minimizes that function.
Our model is a latent tree (LT) that represents input features of facial landmark points and FAU intensities as leaf nodes, and encodes their higher-order dependencies with latent nodes at tree levels closer to the root.
This problem is a middle-ground between frame-level person counting, which does not localize counts, and person detection aimed at perfectly localizing people with count-one detections.
This paper addresses a new problem of parsing low-resolution aerial videos of large spatial areas, in terms of 1) grouping, 2) recognizing events and 3) assigning roles to people engaged in events.
This paper presents a new approach to tracking people in crowded scenes, where people are subject to long-term (partial) occlusions and may assume varying postures and articulations.