Under this family, we study Mask R-CNN and discover that instead of its default strategy of training the mask-head with a combination of proposals and groundtruth boxes, training the mask-head with only groundtruth boxes dramatically improves its performance on novel classes.
In recent years, many works in the video action recognition literature have shown that two stream models (combining spatial and temporal input streams) are necessary for achieving state of the art performance.
Ranked #2 on Action Recognition on UCF101
Deep neural networks (DNN) have recently been widely used in speaker recognition systems, achieving state-of-the-art performance on various benchmarks.
Traditionally multi-object tracking and object detection are performed using separate systems with most prior works focusing exclusively on one of these aspects over the other.
Ranked #1 on Multiple Object Tracking on Waymo Open Dataset
Recent progress in visual grounding techniques and Audio Understanding are enabling machines to understand shared semantic concepts and listen to the various sensory events in the environment.
With the recent advancements in Artificial Intelligence (AI), Intelligent Virtual Assistants (IVA) such as Alexa, Google Home, etc., have become a ubiquitous part of many homes.
In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera.
With the recent advancements in AI, Intelligent Virtual Assistants (IVA) have become a ubiquitous part of every home.
In the multimodal setting, the proposed framework improved precision-recall AUC by 10. 2% on the subset of MiT dataset as compared to non-Bayesian baseline.
Understanding Affect from video segments has brought researchers from the language, audio and video domains together.
Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification.
Ranked #21 on Action Recognition on UCF101 (using extra training data)
We propose a new method for learning the structure of convolutional neural networks (CNNs) that is more efficient than recent state-of-the-art methods based on reinforcement learning and evolutionary algorithms.
Ranked #155 on Image Classification on ImageNet
In this work, we present stochastic neural network architectures that handle such multimodality through stochasticity: future trajectories of objects, body joints or frames are represented as deep, non-linear transformations of random (as opposed to deterministic) variables.
This paper proposes a deep learning architecture based on Residual Network that dynamically adjusts the number of executed layers for the regions of the image.
On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.
Ranked #172 on Object Detection on COCO test-dev
We present a system which can recognize the contents of your meal from a single image, and then predict its nutritional contents, such as calories.
In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event.
We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described.
Knowledge tracing---where a machine models the knowledge of a student as they interact with coursework---is a well established problem in computer supported education.
Providing feedback, both assessing final work and giving hints to stuck students, is difficult for open-ended assignments in massive online classes which can range from thousands to millions of students.
We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task.
Simultaneously addressing all of these challenges i. e., designing a compactly representable model which is amenable to efficient inference and can be learned using partial ranking data is a difficult task, but is necessary if we would like to scale to problems with nontrivial size.
In massive open online courses (MOOCs), peer grading serves as a critical tool for scaling the grading of complex, open-ended assignments to courses with tens or hundreds of thousands of students.
Accurate and detailed models of the progression of neurodegenerative diseases such as Alzheimer's (AD) are crucially important for reliable early diagnosis and the determination and deployment of effective treatments.
Representing distributions over permutations can be a daunting task due to the fact that the number of permutations of n objects scales factorially in n. One recent way that has been used to reduce storage complexity has been to exploit probabilistic independence, but as we argue, full independence assumptions impose strong sparsity constraints on distributions and are unsuitable for modeling rankings.