Sparse local feature matching is pivotal for many computer vision and robotics tasks.
Given an image or a video captured from a monocular camera, amodal layout estimation is the task of predicting semantics and occupancy in bird's eye view.
We evaluate our approach on this dataset, and three diverse sequences from standard datasets including two real-world dynamic sequences and show a significant improvement in robustness and accuracy over a state-of-the-art monocular visual-inertial odometry system.
Given a monocular colour image of a warehouse rack, we aim to predict the bird's-eye view layout for each shelf in the rack, which we term as multi-layer layout prediction.
The use of local detectors and descriptors in typical computer vision pipelines work well until variations in viewpoint and appearance change become extreme.
We present DRACO, a method for Dense Reconstruction And Canonicalization of Object shape from one or more RGB images.
In this paper, we present BirdSLAM, a novel simultaneous localization and mapping (SLAM) system for the challenging scenario of autonomous driving platforms equipped with only a monocular camera.
In particular, our integration of VPR with SLAM by leveraging the robustness of deep-learned features and our homography-based extreme viewpoint invariance significantly boosts the performance of VPR, feature correspondence, and pose graph submodules of the SLAM pipeline.
In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices.
Ranked #5 on Referring Expression Comprehension on Talk2Car
We present a novel Multi-Relational Graph Convolutional Network (MRGCN) based framework to model on-road vehicle behaviors from a sequence of temporally ordered frames as grabbed by a moving monocular camera.
Ranked #1 on Test results on ApolloScape
This paper presents a new system to obtain dense object reconstructions along with 6-DoF poses from a single image.
In this paper, we present a method to reliably detect such obstacles through a multi-modal framework of sparse LiDAR(VLP-16) and Monocular vision.
We further present an extensive benchmark in a photo-realistic 3D simulation across diverse scenes to study the convergence and generalisation of visual servoing approaches.
We dub this problem amodal scene layout estimation, which involves "hallucinating" scene layout for even parts of the world that are occluded in the image.
At the intermediate level, the map is represented as a Manhattan Graph where the nodes and edges are characterized by Manhattan properties and as a Pose Graph at the lower-most level of detail.
In this paper, we tackle the problem of multibody SLAM from a monocular camera.
Understanding on-road vehicle behaviour from a temporal sequence of sensor data is gaining in popularity.
We leverage the expressiveness of the popular stacked hourglass architecture and augment it by adopting memory units between intermediate layers of the network with weights shared across stages for video frames.
In the indoor setting, we use an autonomous drone to navigate various scenarios and also a ground robot which can explore the environment using the trajectories proposed by our framework.
The proposed parameterization associates 3D category-specific CAD model and object under consideration using a dictionary based RANSAC method that uses object Viewpoints as prior and edges detected in the respective intensity image of the scene.
Monocular SLAM refers to using a single camera to estimate robot ego motion while building a map of the environment.
However, with the limited on-chip memory and computation resources of FPGA, meeting the high memory throughput requirement and exploiting the parallelism of CNNs is a major challenge.
Distributed, Parallel, and Cluster Computing
This paper proposes a novel architecture to learn multiple driving behaviors in a traffic scenario.
We show that using a noisy teacher, which could be a standard VO pipeline, and by designing a loss term that enforces geometric consistency of the trajectory, we can train accurate deep models for VO that do not require ground-truth labels.
During training, the network only takes as input a LiDAR point cloud, the corresponding monocular image, and the camera calibration matrix K. At train time, we do not impose direct supervision (i. e., we do not directly regress to the calibration parameters, for example).
We present here, a novel network architecture called MergeNet for discovering small obstacles for on-road scenes in the context of autonomous driving.
The proposed approach significantly improves the state-of-the-art for monocular object localization on arbitrarily-shaped roads.
In this paper, we propose a new approach for simultaneous training of multiple tasks sharing a set of common actions in continuous action spaces, which we call as DiGrad (Differential Policy Gradient).
These category models are instance-independent and aid in the design of object landmark observations that can be incorporated into a generic monocular SLAM framework.
This paper introduces geometry and object shape and pose costs for multi-object tracking in urban driving scenarios.
Ranked #2 on 3D Multi-Object Tracking on KITTI
Efficient and real time segmentation of color images has a variety of importance in many fields of computer vision such as image compression, medical imaging, mapping and autonomous navigation.
In this paper, we present an end-to-end learning based approach for visual servoing in diverse scenes where the knowledge of camera parameters and scene geometry is not available a priori.
This paper proposes an approach to fuse semantic features and motion clues using CNNs, to address the problem of monocular semantic motion segmentation.
We then formulate a shape-aware adjustment problem that uses the learnt shape priors to recover the 3D pose and shape of a query object from an image.
We show results on the challenging KITTI urban dataset for accuracy of motion segmentation and reconstruction of the trajectory and shape of moving objects relative to ground truth.
We pro- pose an algorithm that jointly infers the semantic class and motion labels of an object.