The rapid emergence of airborne platforms and imaging sensors are enabling new forms of aerial surveillance due to their unprecedented advantages in scale, mobility, deployment and covert observation capabilities.
Point clouds are a key modality used for perception in autonomous vehicles, providing the means for a robust geometric understanding of the surrounding environment.
Retrieval-based place recognition is an efficient and effective solution for enabling re-localization within a pre-built map or global data association for Simultaneous Localization and Mapping (SLAM).
Domain generalization approaches aim to learn a domain invariant prediction model for unknown target domains from multiple training source domains with different distributions.
This paper presents a novel lightweight COVID-19 diagnosis framework using CT scans.
We exceed the state-of-the-art results in all evaluations.
To tackle the aforementioned problem, we introduce an end-to-end feature-norm network (FNN) which is robust to negative transfer as it does not need to match the feature distribution among the source domains.
This network can achieve source to target domain matching by capturing semantic information at the feature level and producing images for unsupervised domain adaptation from both the source and the target domains.
Person re-identification (re-ID) concerns the matching of subject images across different camera views in a multi camera surveillance system.
In a real world environment, person re-identification (Re-ID) is a challenging task due to variations in lighting conditions, viewing angles, pose and occlusions.
However, R-MAC suffers in the presence of background clutter/trivial regions and scale variance, and discards important spatial information.
Recently, Zero-shot Sketch-based Image Retrieval (ZS-SBIR) has attracted the attention of the computer vision community due to it's real-world applications, and the more realistic and challenging setting than found in SBIR.
Machine learning-based medical anomaly detection is an important problem that has been extensively studied.
Unlike the problem of general object recognition, where real-valued neural networks can be used to extract pertinent features, iris recognition depends on the extraction of both phase and magnitude information from the input iris texture in order to better represent its biometric content.
Patient-independent seizure prediction models are designed to offer accurate performance across multiple subjects within a dataset, and have been identified as a real-world solution to the seizure prediction problem.
Conclusion: Recognizing the complexity induced by the inherent temporal nature of biosignal data, the two-stage method proposed in this study is able to effectively simplify the whole process of domain generalization while demonstrating good results on unseen domains and the adopted basis domains.
In addition, we demonstrate the practical implications of the proposed learning strategy, where the feedback path can be shared among multiple neural memory networks as a mechanism for knowledge sharing.
Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction.
In this paper, we present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification.
The use of multi-modal data for deep machine learning has shown promise when compared to uni-modal approaches with fusion of multi-modal features resulting in improved performance in several applications.
Automating the analysis of imagery of the Gastrointestinal (GI) tract captured during endoscopy procedures has substantial potential benefits for patients, as it can provide diagnostic support to medical practitioners and reduce mistakes via human error.
To mitigate this challenge, transfer learning performing fine-tuning on pre-trained models has been applied.
In this study, we explicitly examine the importance of heart sound segmentation as a prior step for heart sound classification, and then seek to apply the obtained insights to propose a robust classifier for abnormal heart sound detection.
Person re-identification (re-ID) remains challenging in a real-world scenario, as it requires a trained network to generalise to totally unseen target data in the presence of variations across domains.
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video.
Multimodal dimensional emotion recognition has drawn a great attention from the affective computing community and numerous schemes have been extensively investigated, making a significant progress in this area.
This paper presents a novel framework for Speech Activity Detection (SAD).
This paper proposes a novel framework for the segmentation of phonocardiogram (PCG) signals into heart states, exploiting the temporal evolution of the PCG as well as considering the salient information that it provides for the detection of the heart state.
Deep learning has been applied to achieve significant progress in emotion recognition.
However, their invariance to target data is pre-defined by the network architecture and training data.
The demand for multimodal sensing systems for robotics is growing due to the increase in robustness, reliability and accuracy offered by these systems.
Inspired by human neurological structures for action anticipation, we present an action anticipation model that enables the prediction of plausible future actions by forecasting both the visual and temporal future.
The results of ablation studies demonstrate that the proposed multi-branch architecture with attention blocks is effective and essential.
Domain adaptation (DA) and domain generalization (DG) have emerged as a solution to the domain shift problem where the distribution of the source and target data is different.
Ranked #13 on Domain Adaptation on ImageCLEF-DA
Advances in computer vision have brought us to the point where we have the ability to synthesise realistic fake content.
In the domain of machine learning, Neural Memory Networks (NMNs) have recently achieved impressive results in a variety of application areas including visual question answering, trajectory prediction, object tracking, and language modelling.
The goal of both GANs is to generate similar `action codes', a vector representation of the current action.
In this paper we address the problem of continuous fine-grained action segmentation, in which multiple actions are present in an unsegmented video stream.
In addition, the new parameterization of this task is general and can be implemented by any fully convolutional network (FCN) architecture.
Ranked #1 on Homography Estimation on S-COCO
Unlike existing methods which only use attention mechanisms to locate 2D discriminative information, our work learns a novel 3D perspective feature representation of a vehicle, which is then fused with 2D appearance feature to predict the category.
Developing such a generic text eraser for real scenes is a challenging task, since it inherits all the challenges of multi-lingual and curved text detection and inpainting.
In the presence of large sets of labeled data, Deep Learning (DL) has accomplished extraordinary triumphs in the avenue of computer vision, particularly in object classification and recognition tasks.
In this paper, we propose a four stream Siamese deep convolutional neural network for person redetection that jointly optimises verification and identification losses over a four image input group.
If DA methods are applied directly to DG by a simple exclusion of the target data from training, poor performance will result for a given task.
Ranked #91 on Domain Generalization on PACS
This paper presents a novel deep learning framework for human trajectory prediction and detecting social group membership in crowds.
The generator is fed with person-level and scene-level features that are mapped temporally through LSTM networks.
The demand for large-scale trademark retrieval (TR) systems has significantly increased to combat the rise in international trademark infringement.
We propose a puppet model-based tracking approach using skeleton prior, which provides a better initialization for tracking articulated movements.
This paper presents a novel framework for human trajectory prediction based on multimodal data (video and radar).
State -of-the-art patch matching techniques take image patches as input to a convolutional neural network to extract the patch features and evaluate their similarity.
The use of deep learning techniques for automatic facial expression recognition has recently attracted great interest but developed models are still unable to generalize well due to the lack of large emotion datasets for deep learning.
With the explosion in the availability of spatio-temporal tracking data in modern sports, there is an enormous opportunity to better analyse, learn and predict important events in adversarial group environments.
This paper presents a novel framework for automatic learning of complex strategies in human decision making.
Representing 3D shape in deep learning frameworks in an accurate, efficient and compact manner still remains an open challenge.
Visual saliency patterns are the result of a variety of factors aside from the image being parsed, however existing approaches have ignored these.
We present a novel, complete deep learning framework for multi-person localisation and tracking.
One challenge that remains open in 3D deep learning is how to efficiently represent 3D data to feed deep networks.
The concept of continuous-time trajectory representation has brought increased accuracy and efficiency to multi-modal sensor fusion in modern SLAM.
In this paper the problem of complex event detection in the continuous domain (i. e. events with unknown starting and ending locations) is addressed.
Our contribution in this paper is a deep fusion framework that more effectively exploits spatial features from CNNs with temporal features from LSTM models.
In this paper, we propose a Tree Memory Network (TMN) for modelling long term and short term relationships in sequence-to-sequence mapping problems.
We illustrate how a simple approximation of attention weights (i. e hard-wired) can be merged together with soft attention weights in order to make our model applicable for challenging real world scenarios with hundreds of neighbours.
Though such systems are still heavily reliant on human labour to monitor the captured information, there have been a number of automatic techniques proposed to analysing the data.
A popular approach in this regard is to represent a sequence using a bag of words (BOW) representation due to its: (i) fixed dimensionality irrespective of the sequence length, and (ii) its ability to compactly model the statistics in the sequence.