Determining when people are struggling from video enables a finer-grained understanding of actions and opens opportunities for building intelligent support visual interfaces.
This work focuses on low bitrate video streaming scenarios (e. g. 50 - 200Kbps) where the video quality is severely compromised.
We present AROS, a one-shot learning approach that uses an explicit representation of interactions between highly-articulated human poses and 3D scenes.
This work presents a method to implement fully convolutional neural networks (FCNs) on Pixel Processor Array (PPA) sensors, and demonstrates coarse segmentation and object localisation tasks.
This work demonstrates direct visual sensory-motor control using high-speed CNN inference via a SCAMP-5 Pixel Processor Array (PPA).
Neural network designers have reached progressive accuracy by increasing models depth, introducing new layer types and discovering new combinations of layers.
Increasing number of filters in deeper layers when feature maps are decreased is a widely adopted pattern in convolutional network design.
Experimental results demonstrate that the algorithm's ability to enable a ground vehicle to navigate at an average speed of 2. 20 m/s for passing through multiple gates and 3. 88 m/s for a 'slalom' task in an environment featuring significant visual clutter.
Agents that need to act on their surroundings can significantly benefit from the perception of their interaction possibilities or affordances.
This is in contrast to previous works that only use a sensor-level processing to sequentially compute image convolutions, and must transfer data to an external digital processor to complete the computation.
We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations.
This allows images to be stored and manipulated directly at the point of light capture, rather than having to transfer images to external processing hardware.
In this abstract we describe recent [4, 7] and latest work on the determination of affordances in visually perceived 3D scenes.
In addition to attending to task relevant video parts, our proposed loss jointly trains two attention modules to separately attend to video parts which are indicative of higher (pros) and lower (cons) skill.
This paper develops and evaluates a novel method that allows for the detection of affordances in a scalable and multiple-instance manner on visually recovered pointclouds.
We present an approach of estimating constrained motion of a novel Cellular Processor Array (CPA) camera, on which each pixel is capable of limited processing and data storage allowing for fast low power parallel computation to be carried out directly on the focal-plane of the device.
This paper presents a study on the use of Convolutional Neural Networks for camera relocalisation and its application to map compression.
We present a method for assessing skill from video, applicable to a variety of tasks, ranging from surgery to drawing and rolling pizza dough.
Manual annotations of temporal bounds for object interactions (i. e. start and end times) are typical training input to recognition, localization and detection algorithms.
This work deviates from easy-to-define class boundaries for object interactions.
We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels.
This paper presents an unsupervised approach towards automatically extracting video-based guidance on object usage, from egocentric video and wearable gaze tracking, collected from multiple users while performing tasks.