We claim that 1) strong cues can be obtained from little amounts of training data if some key design choices are applied, 2) given these strong cues, standard Hungarian matching-based association is enough to obtain impressive results.
The Sinkhorn operator has recently experienced a surge of popularity in computer vision and related fields.
Object detection and forecasting are fundamental components of embodied perception.
no code implementations • • Aysim Toker, Lukas Kondmann, Mark Weber, Marvin Eisenberger, Andrés Camero, Jingliang Hu, Ariadna Pregel Hoderlein, Çağlar Şenaras, Timothy Davis, Daniel Cremers, Giovanni Marchisio, Xiao Xiang Zhu, Laura Leal-Taixé
These observations are paired with pixel-wise monthly semantic segmentation labels of 7 land use and land cover (LULC) classes.
A benchmark that would allow us to perform an apple-to-apple comparison of existing efforts is a crucial first step towards advancing this important research field.
We introduce CenterGroup, an attention-based framework to estimate human poses from a set of identity-agnostic keypoints and person center predictions in an image.
Ranked #6 on Multi-Person Pose Estimation on COCO
It uses this model to analyze differences in the pixel and its spatial context-based predictions in subsequent time periods for change detection.
However, given the initial rotation estimate supplied by Kabsch, we show we can improve point correspondence learning during model training by extending the original optimization problem.
1 code implementation • 17 Jun 2021 • Matthijs Douze, Giorgos Tolias, Ed Pizzi, Zoë Papakipos, Lowik Chanussot, Filip Radenovic, Tomas Jenicek, Maxim Maximov, Laura Leal-Taixé, Ismail Elezi, Ondřej Chum, Cristian Canton Ferrer
This benchmark is used for the Image Similarity Challenge at NeurIPS'21 (ISC2021).
Ranked #1 on Image Similarity Detection on DISC21 dev
Multi-object tracking (MOT) enables mobile robots to perform well-informed motion planning and navigation by localizing surrounding objects in 3D space and time.
Ranked #1 on 3D Multi-Object Tracking on KITTI
We hope to open a new front in multi-object tracking research that will hopefully bring us a step closer to intelligent systems that can operate safely in the real world.
The goal of cross-view image based geo-localization is to determine the location of a given street view image by matching it against a collection of geo-tagged satellite images.
Obstacle avoidance is a fundamental and challenging problem for autonomous navigation of mobile robots.
In this paper, we propose 4D panoptic LiDAR segmentation to assign a semantic class and a temporally-consistent instance ID to a sequence of 3D points.
1 code implementation • 23 Feb 2021 • Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, Aljoša Ošep, Laura Leal-Taixé, Liang-Chieh Chen
The task of assigning semantic classes and track identities to every pixel in a video is called video panoptic segmentation.
To this end, we propose an approach based on message passing networks that takes all the relations in a mini-batch into account.
Ranked #2 on Metric Learning on CARS196
We propose a novel unsupervised learning approach to 3D shape correspondence that builds a multiscale matching pipeline into a deep neural network.
We present MOTChallenge, a benchmark for single-camera Multiple Object Tracking (MOT) launched in late 2014, to collect existing and new data, and create a framework for the standardized evaluation of multiple object tracking methods.
Inspired by human navigation, we model the task of trajectory prediction as an intuitive two-stage process: (i) goal estimation, which predicts the most likely target positions of the agent, followed by a (ii) routing module which estimates a set of plausible trajectories that route towards the estimated goal.
On the other hand, 3D convolutional networks have been successfully applied for video classification tasks, but have not been leveraged as effectively to problems involving dense per-pixel interpretation of videos compared to their 2D convolutional counterparts and lag behind the aforementioned networks in terms of performance.
Ranked #1 on Video Object Segmentation on DAVIS 2016 (using extra training data)
We are able to train our model completely on synthetic data and directly apply it to a wide range of real-world images.
Ranked #1 on Depth Estimation on NYU-Depth V2 (RMSE metric)
In many real-world scenarios like people tracking or action recognition, it is important to be able to process the data while taking careful consideration in protecting people's identity.
We propose to learn a deep latent Gaussian process dynamics (DLGPD) model that learns low-dimensional system dynamics from environment interactions with visual observations.
The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of multiple object tracking methods.
In this paper, we pro-pose a different approach that is well-suited to a variety of tasks involvinginstance segmentation in videos.
Ranked #4 on Unsupervised Video Object Segmentation on DAVIS 2017 (val) (using extra training data)
In our formulation we define a likelihood for a set distribution represented by a) two discrete distributions defining the set cardinally and permutation variables, and b) a joint distribution over set elements with a fixed cardinality.
Our method results in an overall improvement in the count and size distribution prediction as compared to state-of-the-art instance segmentation method Mask R-CNN.
The ability of deep learning models to generalize well across different scenarios depends primarily on the quality and quantity of annotated data.
Additionally, we propose a first set of metrics to quantitatively evaluate the accuracy as well as the perceptual quality of the temporal evolution.
Ranked #11 on Video Super-Resolution on Vid4 - 4x upscaling (PSNR metric)
To this end, we propose to divide the task of vehicle control into two independent modules: a control module which is only trained on one weather condition for which labeled steering data is available, and a perception module which is used as an interface between new weather conditions and the fixed control module.
We demonstrate the validity of this new formulation on two relevant vision problems: object detection, for which our formulation outperforms state-of-the-art detectors such as Faster R-CNN and YOLO, and a complex CAPTCHA test, where we observe that, surprisingly, our set based network acquired the ability of mimicking arithmetics without any rules being coded.
Second, we demonstrate how another network can be used to map from an image or video frames to a DAM network to reproduce this appearance, without using a lengthy optimization such as stochastic gradient descent (learning-to-learn).
Video Object Segmentation, and video processing in general, has been historically dominated by methods that rely on the temporal consistency and redundancy in consecutive video frames.
Ranked #27 on Semi-Supervised Video Object Segmentation on DAVIS 2016
In order to track all persons in a scene, the tracking-by-detection paradigm has proven to be a very effective approach.
Ranked #19 on Multi-Object Tracking on MOT16
This paper introduces a novel algorithm for transductive inference in higher-order MRFs, where the unary energies are parameterized by a variable classifier.
Standardized benchmarks are crucial for the majority of computer vision applications.
In this work we propose a new CNN+LSTM architecture for camera pose regression for indoor and outdoor scenes.
This paper tackles the task of semi-supervised video object segmentation, i. e., the separation of an object from the background in a video, given the mask of the first frame.
Ranked #1 on Visual Object Tracking on YouTube-VOS 2018 val
This paper introduces a novel approach to the task of data association within the context of pedestrian tracking, by introducing a two-stage learning scheme to match pairs of detections.
We discuss the challenges of creating such a framework, collecting existing and new data, gathering state-of-the-art methods to be tested on the datasets, and finally creating a unified evaluation system.
Multiple people tracking is a key problem for many applications such as surveillance, animation or car navigation, and a key input for tasks such as activity recognition.