In Atlanta world, given a set of image lines, we aim to cluster them by the unknown-but-sought VPs whose number is unknown.
In large-scale storehouses, precise instance masks are crucial for robotic bin picking but are challenging to obtain.
This work aims to improve unsupervised audio-visual pre-training.
Furthermore, to reduce the influence of different spatial distributions between the mapping and query sequences, which is not considered in previous methods, we also introduce a space constraint term based on 3D discretized grids.
Task automation of surgical robot has the potentials to improve surgical efficiency.
It abstracts the shape prior of a category, and thus can provide constraints on the overall shape of an instance.
To further improve the performance of the stereo framework, StereoPose is equipped with a parallax attention module for stereo feature fusion and an epipolar loss for improving the stereo-view consistency of network predictions.
To continuously improve the quality of pseudo labels, we iterate the above steps by taking the trained student model as a new teacher and re-label real data using the refined teacher model.
Industrial bin picking is a challenging task that requires accurate and robust segmentation of individual object instances.
Given a set of putative 3D-3D point correspondences, we aim to remove outliers and estimate rigid transformation with 6 degrees of freedom (DOF).
Ten learning-based surgical tasks are built in the platform, which are common in the real autonomous surgical execution.
To greatly increase the label efficiency, we explore a new problem, i. e., adaptive instrument segmentation, which is to effectively adapt one source model to new robotic surgical videos from multiple target domains, only given the annotated instruments in the first frame.
Automatic surgical gesture recognition is fundamentally important to enable intelligent cognitive assistance in robotic surgery.
Ranked #1 on Action Segmentation on JIGSAWS
Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc.
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction.
Learning a good 3D human pose representation is important for human pose related tasks, e. g. human 3D pose estimation and action recognition.
no code implementations • • Anil Armagan, Guillermo Garcia-Hernando, Seungryul Baek, Shreyas Hampali, Mahdi Rad, Zhaohui Zhang, Shipeng Xie, Mingxiu Chen, Boshen Zhang, Fu Xiong, Yang Xiao, Zhiguo Cao, Junsong Yuan, Pengfei Ren, Weiting Huang, Haifeng Sun, Marek Hrúz, Jakub Kanis, Zdeněk Krňoul, Qingfu Wan, Shile Li, Linlin Yang, Dongheui Lee, Angela Yao, Weiguo Zhou, Sijia Mei, Yun-hui Liu, Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Philippe Weinzaepfel, Romain Brégier, Grégory Rogez, Vincent Lepetit, Tae-Kyun Kim
To address these issues, we designed a public challenge (HANDS'19) to evaluate the abilities of current 3D hand pose estimators (HPEs) to interpolate and extrapolate the poses of a training set.
In this paper, we propose a method that takes advantage of human hand morphological topology (HMT) structure to improve the pose estimation performance.
Robotics Human-Computer Interaction
The proposal in this paper is verified by a simulated assembly in which a robot arm completed the assembly process including parts picking from bin and a subsequent peg-in-hole assembly.
Place recognition and loop-closure detection are main challenges in the localization, mapping and navigation tasks of self-driving vehicles.
We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach.
Ranked #47 on Self-Supervised Action Recognition on HMDB51
This paper presents an efficient neural network model to generate robotic grasps with high resolution images.
Point cloud based place recognition is still an open issue due to the difficulty in extracting local features from the raw 3D point cloud and generating the global descriptor, and it's even harder in the large-scale dynamic environments.