Learned networks in the domain of visual recognition and cognition impress in part because even though they are trained with datasets many orders of magnitude smaller than the full population of possible images, they exhibit sufficient generalization to be applicable to new and previously unseen data.
Drivers deal with multiple concurrent tasks, such as keeping the vehicle in the lane, observing and anticipating the actions of other road users, reacting to hazards, and dealing with distractions inside and outside the vehicle.
We present baseline evaluations with five well-known classification deep neural networks and show that TEOS poses a significant challenge for all of them.
Up to four different two-stream-based approaches, that have been successfully applied to address human action recognition, are adapted here by stacking visual cues from forward-looking video cameras to recognize and anticipate lane-changes of target vehicles.
The key conclusions of this paper are that an executive controller is necessary for human attentional function in vision, and that there is a 'first principles' computational approach to its understanding that is complementary to the previous approaches that focus on modelling or learning from experimental observations directly.
When applying a convolutional kernel to an image, if the output is to remain the same size as the input then some form of padding is required around the image boundary, meaning that for each layer of convolution in a convolutional neural network (CNN), a strip of pixels equal to the half-width of the kernel size is produced with a non-veridical representation.
Different sizes of the regions around the vehicles are analyzed, evaluating the importance of the interaction between vehicles and the context information in the performance.
Furthermore, we investigate the effect of training state-of-the-art CNN-based saliency models on these types of stimuli and conclude that the additional training data does not lead to a significant improvement of their ability to find odd-one-out targets.
To this end, we propose a solution for the problem of pedestrian action anticipation at the point of crossing.
This reveals a strong mismatch between optimal performance ranges of classical theory-driven algorithms and sensor setting distributions in the common vision datasets, while data-driven models were trained for those datasets.
Scene Classification has been addressed with numerous techniques in computer vision literature.
In this paper, we demonstrate a novel algorithm that uses ellipse fitting to estimate the bounding box rotation angle and size with the segmentation(mask) on the target for online and real-time visual object tracking.
Ranked #1 on Visual Object Tracking on VOT2017/18 (using extra training data)
Our simulation results indicate that multiplicative modulations have significant contributions in encoding of hues along intermediate directions in the MacLeod-Boynton diagram and that model V4 neurons have the capacity to encode unique hues.
The current dominant visual processing paradigm in both human and machine research is the feedforward, layered hierarchy of neural-like processing elements.
The Saliency Model Implementation Library for Experimental Research (SMILER) is a new software package which provides an open, standardized, and extensible framework for maintaining and executing computational saliency models.
In this paper, we propose a new model that actively extracts visual information via visual attention techniques and, in conjunction with a non-myopic decision-making algorithm, leads the robot to search more relevant areas of the environment.
It is almost universal to regard attention as the facility that permits an agent, human or machine, to give priority processing resources to relevant stimuli while ignoring the irrelevant.
To make it a reality, autonomous vehicles require the ability to communicate with other road users and understand their intentions.
We present a Polyhedral Scene Generator system which creates a random scene based on a few user parameters, renders the scene from random view points and creates a dataset containing the renderings and corresponding annotation files.
Perceptual judgment of image similarity by humans relies on rich internal representations ranging from low-level features to high-level concepts, scene properties and even cultural associations.
While great advances are made in pattern recognition and machine learning, the successes of such fields remain restricted to narrow applications and seem to break down when training data is scarce, a shift in domain occurs, or when intelligent reasoning is required for rapid adaptation to new environments.
The implications of this intriguing property of deep neural networks are discussed and we suggest ways to harness it to create more robust representations.
Given an existing trained neural network, it is often desirable to learn new capabilities without hindering performance of those already learned.
The results indicate that on average color space C1C2C3 followed by HSI and XYZ achieve the best time in searching for objects of various colors.
Thus, in this survey we wanted to shift the focus towards a more inclusive and high-level overview of the research on cognitive architectures.
In this paper we present a novel dataset for a critical aspect of autonomous driving, the joint attention that must occur between drivers and of pedestrians, cyclists or other drivers.