Ground-truth depth, when combined with color data, helps improve object detection accuracy over baseline models that only use color.
We present the effectiveness of maIoU on a state-of-the-art (SOTA) assigner, ATSS, by replacing IoU operation by our maIoU and training YOLACT, a SOTA real-time instance segmentation method.
Ranked #2 on Real-time Instance Segmentation on MSCOCO
RS Loss supervises the classifier, a sub-network of these methods, to rank each positive above all negatives as well as to sort positives among themselves with respect to (wrt.)
Despite being widely used as a performance measure for visual detection tasks, Average Precision (AP) is limited in (i) reflecting localisation quality, (ii) interpretability and (iii) robustness to the design choices regarding its computation, and its applicability to outputs without confidence scores.
In real-world interactions, however, facial expressions are usually more subtle and evolve in a temporal manner requiring AU detection models to learn spatial as well as temporal information.
Deep neural network approaches have demonstrated high performance in object recognition (CNN) and detection (Faster-RCNN) tasks, but experiments have shown that such architectures are vulnerable to adversarial attacks (FFF, UAP): low amplitude perturbations, barely perceptible by the human eye, can lead to a drastic reduction in labeling performance.
We propose average Localisation-Recall-Precision (aLRP), a unified, bounded, balanced and ranking-based loss function for both classification and localisation tasks in object detection.
Ranked #59 on Object Detection on COCO test-dev
In this work, we combine 3D convolution with late temporal modeling for action recognition.
Ranked #1 on Action Recognition on UCF 101
Recognition of expressions of emotions and affect from facial images is a well-studied research problem in the fields of affective computing and computer vision with a large number of datasets available containing facial images and corresponding expression labels.
Robots collaborating with humans in realistic environments will need to be able to detect the tools that can be used and manipulated.
Using our generator as an analysis tool, we show that (i) IoU imbalance has an adverse effect on performance, (ii) hard positive example mining improves the performance only for certain input IoU distributions, and (iii) the imbalance among the foreground classes has an adverse effect on performance and that it can be alleviated at the batch level.
Ranked #156 on Object Detection on COCO minival
Especially in ambiguous settings, humans prefer expressions (called relational referring expressions) that describe an object with respect to a distinguishing, unique object.
Referring to objects in a natural and unambiguous manner is crucial for effective human-robot interaction.
Moreover, we present LRP results of a simple online video object detector which uses a SOTA still image object detector and show that the class-specific optimized thresholds increase the accuracy against the common approach of using a general threshold for all classes.
For this end, we introduce a hybrid version of BMs where relations and affordances are introduced with shared, tri-way connections into the model.
Scene models allow robots to reason about what is in the scene, what else should be in it, and what should not be in it.
Context is an essential capability for robots that are to be as adaptive as possible in challenging environments.
In this paper, we provide a large-scale dataset with benchmark queries with which different TR approaches can be evaluated systematically.