Our paper aims to automate the generation of medical reports from chest X-ray image inputs, a critical yet time-consuming task for radiologists.
In this paper, we propose Snipper, a framework to perform multi-person 3D pose estimation, tracking and motion forecasting simultaneously in a single inference.
Our approach is flexible, could be used for both text2motion and motion2text tasks.
Ranked #8 on Motion Synthesis on KIT Motion-Language
The laborious and time-consuming manual annotation has become a real bottleneck in various practical scenarios.
Music and dance have always co-existed as pillars of human activities, contributing immensely to the cultural, social, and entertainment functions in virtually all societies.
Automated generation of 3D human motions from text is a challenging problem.
Ranked #7 on Motion Synthesis on KIT Motion-Language
Inspired by recent success in unsupervised contrastive representation learning, we propose a novel denoised cross-video contrastive algorithm, aiming to enhance the feature discrimination ability of video snippets for accurate temporal action localization in the weakly-supervised setting.
Our framework encodes a video into a vector representation by learning to pick video clips that help to distinguish it from other videos via a contrastive objective using dropout noise.
One aspect that has been obviated so far, is the fact that how we represent the skeletal pose has a critical impact on the prediction results.
We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music.
As a by-product, a CapS dataset is constructed by augmenting existing benchmark training set with additional image tags and captions.
This paper considers to jointly tackle the highly correlated tasks of estimating 3D human body poses and predicting future 3D motions from RGB image sequences.
Action2motion stochastically generates plausible 3D pose sequences of a prescribed action category, which are processed and rendered by motion2video to form 2D videos.
Our paper focuses on automating the generation of medical reports from chest X-ray image inputs, a critical yet time-consuming task for radiologists.
Event camera is an emerging imaging sensor for capturing dynamics of moving objects as events, which motivates our work in estimating 3D human pose and shape from the event signals.
This paper focuses on a new problem of estimating human pose and shape from single polarization images.
A dataset of generic 3D objects with ground-truth annotated skeletons is collected.
However, Chamfer distance is quite sensitive to noise and outliers, thus could be unreliable to assign correspondences.
A strong visual object tracker nowadays relies on its well-crafted modules, which typically consist of manually-designed network architectures to deliver high-quality tracking results.
Complex backgrounds and similar appearances between objects and their surroundings are generally recognized as challenging scenarios in Salient Object Detection (SOD).
To our knowledge, our work is the first in producing calibrated predictions under different expertise levels for medical image segmentation.
High-resolution 3D medical images are important for analysis and diagnosis, but axial scanning to acquire them is very time-consuming.
Action recognition is a relatively established task, where givenan input sequence of human motion, the goal is to predict its ac-tion category.
Inspired by the recent advances in human shape estimation from single color images, in this paper, we attempt at estimating human body shapes by leveraging the geometric cues from single polarization images.
First, based on a generative human template, for every two frames having sufficient overlap, an initial pairwise alignment is performed; It is followed by a global non-rigid registration procedure, in which partial results from RGBD frames are collected into a unified 3D shape, under the guidance of correspondences from the pairwise alignment; Finally, the texture map of the reconstructed human model is optimized to deliver a clear and spatially consistent texture.
To address this problem, we introduce a context-aware IoU-guided tracker (COMET) that exploits a multitask two-stream network and an offline reference proposal generation strategy.
Polarization images are known to be able to capture polarized reflected lights that preserve rich geometric cues of an object, which has motivated its recent applications in reconstructing detailed surface normal of the objects of interest.
Generative adversarial networks (GANs), famous for the capability of learning complex underlying data distribution, are however known to be tricky in the training process, which would probably result in mode collapse or performance deterioration.
Existing methods usually perform feature selection and outlier scoring separately, which would select feature subsets that may not optimally serve for outlier detection, leading to unsatisfying performance.
Second, popular visual tracking benchmarks and their respective properties are compared, and their evaluation metrics are summarized.
In this paper, a novel wavelet driven deep neural network termed as WaveletKernelNet (WKN) is presented, where a continuous wavelet convolutional (CWConv) layer is designed to replace the first convolutional layer of the standard CNN.
Presentation bias is one of the key challenges when learning from implicit feedback in search engines, as it confounds the relevance signal.
In this paper, we study how to improve the data efficiency of IPS approaches in the offline comparison setting.
A major bottleneck of pedestrian detection lies on the sharp performance deterioration in the presence of small-size pedestrians that are relatively far from the camera.
This paper aims at synthesizing filamentary structured images such as retinal fundus images and neuronal images, as follows: Given a ground-truth, to generate multiple realistic looking phantoms.
Our model first takes a correction step on the grossly corrupted responses via geodesic curves on the manifold, and then performs multivariate linear regression on the corrected data.
The implementation of our approach and comparison methods as well as the involved datasets are made publicly available in support of the open-source and reproducible research initiatives.
We propose in this paper an atomic action-based Bayesian model that constructs Allen's interval relation networks to characterize complex activities with structural varieties in a probabilistic generative way: By introducing latent variables from the Chinese restaurant process, our approach is able to capture all possible styles of a particular complex activity as a unique set of distributions over atomic actions and relations.
This paper focuses on the challenging problem of 3D pose estimation of a diverse spectrum of articulated objects from single depth images.
Pose estimation, tracking, and action recognition of articulated objects from depth images are important and challenging problems, which are normally considered separately.
Detecting hand actions from ego-centric depth sequences is a practically challenging problem, owing mostly to the complex and dexterous nature of hand articulations as well as non-stationary camera motion.
We focus on the challenging problem of efficient mouse 3D pose estimation based on static images, and especially single depth images.
In this paper we consider the problem of graph-based transductive classification, and we are particularly interested in the directed graph scenario which is a natural form for many real world applications.