To address this problem, we propose a novel combination of the popular triplet loss to improve robustness against image resolution via fine-tuning of existing face recognition models.
Ranked #1 on Face Recognition on XQLFW
Since adversarial vulnerability can be regarded as a high-frequency phenomenon, it is essential to regulate the adversarially-trained neural network models in the frequency domain.
State-of-the-art face recognition (FR) approaches have shown remarkable results in predicting whether two faces belong to the same identity, yielding accuracies between 92% and 100% depending on the difficulty of the protocol.
Gait recognition is a promising biometric with unique properties for identifying individuals from a long distance by their walking patterns.
Recent work even suggests that detectors' confidence predictions are biased with respect to object size and position, but it is still unclear how this bias relates to the performance of the affected object detectors.
Real-world face recognition applications often deal with suboptimal image quality or resolution due to different capturing conditions such as various subject-to-camera distances, poor camera settings, or motion blur.
In this work, we first analyze the impact of image resolutions on the face verification performance with a state-of-the-art face recognition model.
In this paper, we propose a novel approach to partial face recognition capable of recognizing faces with different occluded areas.
Successful active speaker detection requires a three-stage pipeline: (i) audio-visual encoding for all speakers in the clip, (ii) inter-speaker relation modeling between a reference speaker and the background speakers within each frame, and (iii) temporal modeling for the reference speaker.
Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system.
However, silhouette images can lose fine-grained spatial information, and most papers do not regard how to obtain these silhouettes in complex scenes.
Ranked #4 on Multiview Gait Recognition on CASIA-B
Person Re-Identification aims to retrieve person identities from images captured by multiple cameras or the same cameras in different time instances and locations.
Ranked #2 on Person Re-Identification on CUHK03 detected
For this, we propose Lightweight Sinc-Convolutions (LSC) that integrate Sinc-convolutions with depthwise convolutions as a low-parameter machine-learnable feature extraction for end-to-end ASR systems.
Convolutional Neural Networks with 3D kernels (3D-CNNs) currently achieve state-of-the-art results in video recognition tasks due to their supremacy in extracting spatiotemporal features within video frames.
For this task, we introduce a new video-based benchmark, the Driver Anomaly Detection (DAD) dataset, which contains normal driving videos together with a set of anomalous actions in its training set.
The present work proposes MP3 compression as a means to decrease the impact of Adversarial Noise (AN) in audio samples transcribed by ASR systems.
Audio and Speech Processing Cryptography and Security Sound
In this work, we combine freely available corpora for German speech recognition, including yet unlabeled speech data, to a big dataset of over $1700$h of speech data.
Ranked #3 on Speech Recognition on TUDA (using extra training data)
Speech Recognition Audio and Speech Processing
In this work, we propose an HCI system for dynamic recognition of driver micro hand gestures, which can have a crucial impact in automotive sector especially for safety related issues.
To this end, a lightweight network architecture is introduced and mean teacher, virtual adversarial training and pseudo-labeling algorithms are evaluated on 2D-pose estimation for surgical instruments.
YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation.
Ranked #1 on Temporal Action Localization on J-HMDB-21
Keyword Spotting (KWS) enables speech-based user interaction on smart devices.
Understanding actions and gestures in video streams requires temporal reasoning of the spatial content from different time instants, i. e., spatiotemporal (ST) modeling.
Ranked #82 on Action Recognition on Something-Something V2
Experiments show that our approach outperforms the state-of-the art results on the Distracted Driver Dataset (96. 31%), with an accuracy of 99. 10% for 10-class classification while providing real-time performance.
Video representation is a key challenge in many computer vision applications such as video classification, video captioning, and video surveillance.
The use of hand gestures provides a natural alternative to cumbersome interface devices for Human-Computer Interaction (HCI) systems.
Recently, convolutional neural networks with 3D kernels (3D CNNs) have been very popular in computer vision community as a result of their superior ability of extracting spatio-temporal features within video frames compared to 2D CNNs.
Ranked #10 on Action Recognition on Jester
We evaluate our architecture on two publicly available datasets - EgoGesture and NVIDIA Dynamic Hand Gesture Datasets - which require temporal detection and classification of the performed hand gestures.
Ranked #1 on Hand Gesture Recognition on EgoGesture
In this paper, we propose a CNN architecture, Layer Reuse Network (LruNet), where the convolutional layers are used repeatedly without the need of introducing new layers to get a better performance.
To this end, tracklet re-identification is performed by utilizing a novel multi-stage deep network that can jointly reason about the visual appearance and spatio-temporal properties of a pair of tracklets, thereby providing a robust measure of affinity.
While fully-convolutional neural networks are very strong at modeling local features, they fail to aggregate global context due to their constrained receptive field.
In gait recognition, normally, gait feature such as Gait Energy Image (GEI) is extracted from one full gait cycle.
Acquiring spatio-temporal states of an action is the most crucial step for action classification.
Ranked #1 on Hand Gesture Recognition on ChaLean test
In this work, we present a novel background subtraction system that uses a deep Convolutional Neural Network (CNN) to perform the segmentation.
Transcription of broadcast news is an interesting and challenging application for large-vocabulary continuous speech recognition (LVCSR).
The goal of the system is to analyse sounds emitted by walking persons (mostly the step sounds) and identify those persons.