Existing state-of-the-art methods for Video Object Segmentation (VOS) learn low-level pixel-to-pixel correspondences between frames to propagate object masks across video.
We further show that D^2Conv3D out-performs trivial extensions of existing dilated and deformable convolutions to 3D.
Ranked #2 on Unsupervised Video Object Segmentation on DAVIS 2016 (using extra training data)
Since scene context helps reasoning about object semantics, current works focus on models with large capacity and receptive fields that can fully capture the global context of an input 3D scene.
Ranked #1 on Semantic Segmentation on ScanNet
Person detection is a crucial task for mobile robots navigating in human-populated environments and LiDAR sensors are promising for this task, given their accurate depth measurements and large field of view.
1 code implementation • 23 Feb 2021 • Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, Aljoša Ošep, Laura Leal-Taixé, Liang-Chieh Chen
The task of assigning semantic classes and track identities to every pixel in a video is called video panoptic segmentation.
Through experiments on the JackRabbot dataset with two detector models, DROW3 and DR-SPAAM, we show that self-supervised detectors, trained or fine-tuned with pseudo-labels, outperform detectors trained only on a different dataset.
We use a deep convolutional network to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations and investigate how far such pseudo-labels can carry us for training state-of-the-art VOS approaches.
Multi-Object Tracking (MOT) has been notoriously difficult to evaluate.
On the other hand, 3D convolutional networks have been successfully applied for video classification tasks, but have not been leveraged as effectively to problems involving dense per-pixel interpretation of videos compared to their 2D convolutional counterparts and lag behind the aforementioned networks in terms of performance.
Ranked #1 on Unsupervised Video Object Segmentation on DAVIS-2016 (using extra training data)
Heatmap representations have formed the basis of human pose estimation systems for many years, and their extension to 3D has been a fruitful line of recent research.
Ranked #1 on 3D Human Pose Estimation on 3D Poses in the Wild Challenge (MPJPE metric)
In this paper, we propose to use 3D shape and motion priors to regularize the estimation of the trajectory and the shape of vehicles in sequences of stereo images.
Detecting persons using a 2D LiDAR is a challenging task due to the low information content of 2D range data.
Instead of training the network for estimating keypoint correspondences on video data, it is trained on a large scale image datasets for human pose estimation using self-supervision.
That is, the convolutional kernel weights are mapped to the local surface of a given mesh.
We show that grouping proposals improves over NMS and outperforms previous state-of-the-art methods on the tasks of 3D object detection and semantic instance segmentation on the ScanNetV2 benchmark and the S3DIS dataset.
Ranked #1 on 3D Semantic Instance Segmentation on ScanNetV2
In this paper, we pro-pose a different approach that is well-suited to a variety of tasks involvinginstance segmentation in videos.
Ranked #3 on Unsupervised Video Object Segmentation on DAVIS 2017 (val) (using extra training data)
Furthermore, as the image space is decoupled from the heatmap space, the network can learn to reason about joints beyond the image boundary.
UnOVOST even performs competitively with many semi-supervised video object segmentation algorithms even though it is not given any input as to which objects should be tracked and segmented.
We present Siam R-CNN, a Siamese re-detection architecture which unleashes the full power of two-stage object detection approaches for visual object tracking.
Ranked #2 on Visual Object Tracking on TrackingNet
We present a novel end-to-end single-shot method that segments countable object instances (things) as well as background regions (stuff) into a non-overlapping panoptic segmentation at almost video frame rate.
In this work, we focus on precise 3D track state estimation and propose a learning-based approach for object-centric relative motion estimation of partially observed objects.
Object tracking and 3D reconstruction are often performed together, with tracking used as input for reconstruction.
In a thorough ablation study, we show that the receptive field size is directly related to the performance of 3D point cloud processing tasks, including semantic segmentation and object classification.
Ranked #12 on Semantic Segmentation on S3DIS Area5
We address the problem of learning a single model for person re-identification, attribute classification, body part segmentation, and pose estimation.
Following this paradigm, we present BoLTVOS (Box-Level Tracking for VOS), which consists of an R-CNN detector conditioned on the first-frame bounding box to detect the object of interest, a temporal consistency rescoring algorithm, and a Box2Seg network that converts bounding boxes to segmentation masks.
A lot of progress was made in the field of object classification and semantic segmentation.
Ranked #4 on 3D Semantic Instance Segmentation on ScanNetV2
This paper addresses the problem of object discovery from unlabeled driving videos captured in a realistic automotive setting.
Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use.
Ranked #1 on Semi-Supervised Video Object Segmentation on YouTube
This paper extends the popular task of multi-object tracking to multi-object tracking and segmentation (MOTS).
Ranked #7 on Multiple Object Tracking on KITTI Tracking test
In this paper, we present a deep learning architecture which addresses the problem of 3D semantic segmentation of unstructured point clouds.
We propose to leverage a generic object tracker in order to perform object mining in large-scale unlabeled videos, captured in a realistic automotive setting.
Most of the current vision-based tracking methods perform tracking in the image domain.
Ranked #20 on Multiple Object Tracking on KITTI Tracking test
In this paper we present our winning entry at the 2018 ECCV PoseTrack Challenge on 3D human pose estimation.
Ranked #40 on 3D Human Pose Estimation on Human3.6M
Occlusion is commonplace in realistic human-robot shared environments, yet its effects are not considered in standard 3D human pose estimation benchmarks.
Ranked #42 on 3D Human Pose Estimation on Human3.6M
We address semi-supervised video object segmentation, the task of automatically generating accurate and consistent pixel masks for objects in a video sequence, given the first-frame ground truth annotations.
In the past decade many robots were deployed in the wild, and people detection and tracking is an important component of such deployments.
The recently proposed PointNet architecture presents an interesting step ahead in that it can operate on unstructured point clouds, achieving encouraging segmentation results.
We explore object discovery and detector adaptation based on unlabeled video sequences captured from a mobile platform.
In this paper, we propose a model-free multi-object tracking approach that uses a category-agnostic image segmentation method to track objects.
We tackle the task of semi-supervised video object segmentation, i. e. segmenting the pixels belonging to an object in the video using the ground truth pixel mask for the first frame.
Ranked #2 on Visual Object Tracking on YouTube-VOS 2018 val
Recent progress in Reinforcement Learning (RL), fueled by its combination, with Deep Learning has enabled impressive results in learning to interact with complex virtual environments, yet real-world applications of RL are still scarce.
With the rise of end-to-end learning through deep learning, person detectors and re-identification (ReID) models have recently become very strong.
In the past few years, the field of computer vision has gone through a revolution fueled mainly by the advent of large datasets and the adoption of deep convolutional neural networks for end-to-end learning.
Ranked #3 on Person Re-Identification on CUHK03 (Rank-5 metric)
Our visual-inertial SLAM system is based on a real-time capable visual-inertial odometry method that provides locally consistent trajectory and map estimates.
As such, and due to their quick adoption in a wide range of applications, appropriate benchmarks are crucial for algorithm selection and comparison.
Therefore, additional processing steps have to be performed in order to obtain pixel-accurate segmentation masks at the full image resolution.
Ranked #18 on Real-Time Semantic Segmentation on Cityscapes test
Abstract In this paper, we address the problem of object discovery in time-varying, large-scale image collections.
We evaluate how different choices of methods and parameters for the individual pipeline steps affect overall system performance and examine their effects for different query categories such as buildings, paintings or sculptures.