We present LDP, a lightweight dense prediction neural architecture search (NAS) framework.
Accurate and robust visual object tracking is one of the most challenging and fundamental computer vision problems.
In this paper, we present Point Cloud Color Constancy, in short PCCC, an illumination chromaticity estimation algorithm exploiting a point cloud.
Humans can robustly recognize and localize objects by using visual and/or auditory cues.
This work focuses on learning deep visual representation models for retrieval by exploring the interplay between a new loss function, the batch size, and a new regularization approach.
Ranked #1 on Vehicle Re-Identification on VehicleID Large
In addition, we introduce a normalized Hessian loss term invariant to scaling and shear along the depth direction, which is shown to substantially improve the accuracy.
This paper presents a novel neural architecture search method, called LiDNAS, for generating lightweight monocular depth estimation models.
In many computer vision classification tasks, class priors at test time often differ from priors on the training set.
We propose Progressive-X+, a new algorithm for finding an unknown number of geometric models, e. g., homographies.
To that end, we propose a reconstruction module that can be used with many existing semantic segmentation networks, and that is trained to recognize and reconstruct road (drivable) surface from a small bottleneck.
In this paper, we propose enhancing monocular depth estimation by adding 3D points as depth guidance.
Compared to other methods, such as deblatting, the inference is of several orders of magnitude faster and allows applications such as real-time fast moving object detection and retrieval in large video collections.
We propose a method that, given a single image with its estimated background, outputs the object's appearance and position in a series of sub-frames as if captured by a high-speed camera (i. e. temporal super-resolution).
Ranked #1 on Video Super-Resolution on Falling Objects
We propose a new cascaded architecture for novel view synthesis, called RGBD-Net, which consists of two core components: a hierarchical depth regression network and a depth-aware generator network.
We propose ways to speed up the initial pose-graph generation for global Structure-from-Motion algorithms.
We claim, and present evidence, that allowing arXiv publication before a conference or journal submission benefits researchers, especially early career, as well as the whole scientific community.
This paper presents the evaluation methodology, datasets, and results of the BOP Challenge 2020, the third in a series of public competitions organized with the goal to capture the status quo in the field of 6D object pose estimation from an RGB-D image.
Recovering the scene depth from a single image is an ill-posed problem that requires additional priors, often referred to as monocular depth cues, to disambiguate different 3D interpretations.
A data-dependent number of corresponding 3D locations is selected per pixel, and poses of possibly multiple object instances are estimated using a robust and efficient variant of the PnP-RANSAC algorithm.
We introduce a comprehensive benchmark for local features and robust estimation algorithms, focusing on the downstream task -- the accuracy of the reconstructed camera pose -- as our primary metric.
In this work, we propose a deep depth-aware long-term tracker that achieves state-of-the-art RGBD tracking performance and is fast to run.
We propose a novel method that tracks fast moving objects, mainly non-uniform spherical, in full 6 degrees of freedom, estimating simultaneously their 3D motion trajectory, 3D pose and object appearance changes with a time step that is a fraction of the video frame exposure time.
Ranked #1 on Video Super-Resolution on TbD
no code implementations • 1 Jul 2019 • Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng-Lin Liu, Jean-Marc Ogier
With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense.
We propose Progressive NAPSAC, P-NAPSAC in short, which merges the advantages of local and global sampling by drawing samples from gradually growing neighborhoods.
Standard RGB-D trackers treat the target as an inherently 2D structure, which makes modelling appearance changes related even to simple out-of-plane rotation highly challenging.
The paper addresses the problem of acquiring high-quality photographs with handheld smartphone cameras in low-light imaging conditions.
no code implementations • 9 Oct 2018 • Tomas Hodan, Rigas Kouskouridas, Tae-Kyun Kim, Federico Tombari, Kostas Bekris, Bertram Drost, Thibault Groueix, Krzysztof Walas, Vincent Lepetit, Ales Leonardis, Carsten Steger, Frank Michel, Caner Sahin, Carsten Rother, Jiri Matas
The workshop featured four invited talks, oral and poster presentations of accepted workshop papers, and an introduction of the BOP benchmark for 6D object pose estimation.
We propose a deblurring method that incorporates gyroscope measurements into a convolutional neural network (CNN).
1 code implementation • • Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders Glent Buch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, Caner Sahin, Fabian Manhardt, Federico Tombari, Tae-Kyun Kim, Jiri Matas, Carsten Rother
We propose a benchmark for 6D pose estimation of a rigid object from a single RGB-D input image.
We consider a single-query 6-DoF camera pose estimation with reference images and a point cloud, i. e. the problem of estimating the position and orientation of a camera by using reference images and a point cloud.
It is well-known that motion blur decreases the performance of traditional feature detectors and descriptors.
The proposed Maximum a Posteriori estimation increases the prediction accuracy by 2. 8% on PlantCLEF 2017 and by 1. 8% on FGVCx Fungi, where the existing MLE method would lead to a decrease accuracy.
We present a statistical color constancy method that relies on novel gray pixel detection and mean shift clustering.
An end-to-end trainable (fully differentiable) method for multi-language scene text localization and recognition is proposed.
The quality of the deblurring model is also evaluated in a novel way on a real-world problem -- object detection on (de-)blurred images.
Ranked #3 on Deblurring on REDS
A method for learning local affine-covariant regions is presented.
Ranked #5 on Image Matching on IMC PhotoTourism (using extra training data)
We introduce a novel formulation of temporal color constancy which considers multiple frames preceding the frame for which illumination is estimated.
The move replaces a set of labels with the corresponding density mode in the model parameter domain, thus achieving fast and robust optimization.
We introduce a novel loss for learning local feature descriptors which is inspired by the Lowe's matching criterion for SIFT.
There are approximately 39K training and 10K test images from each sensor.
The notion of a Fast Moving Object (FMO), i. e. an object that moves over a distance exceeding its size within the exposure time, is introduced.
A novel similarity-covariant feature detector that extracts points whose neighbourhoods, when treated as a 3D intensity surface, have a saddle-like intensity profile.
Computational color constancy that requires esti- mation of illuminant colors of images is a fundamental yet active problem in computer vision, which can be formulated into a regression problem.
The paper systematically studies the impact of a range of recent advances in CNN architectures and learning methods on the object categorization (ILSVRC) problem.
We present an algorithm that leverages the appearance variety to obtain more complete and accurate scene geometry along with consistent multi-illumination appearance information.
The goal of COCO-Text is to advance state-of-the-art in text detection and recognition in natural images.
After a novel efficient classification step, the number of regions is reduced to 7 times less than the standard method and is still almost 3 times faster.
Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)).
Ranked #24 on Image Classification on MNIST
We have presented a new problem -- the wide multiple baseline stereo (WxBS) -- which considers matching of images that simultaneously differ in more than one image acquisition factor such as viewpoint, illumination, sensor type or where object appearance changes significantly, e. g. over time.
This paper addresses the problem of single-target tracker performance evaluation.
An "elephant in the room" for most current object detection and localization methods is the lack of explicit modelling of partial visibility due to occlusion by other objects or truncation by the image boundary.