Current embedding-based large-scale retrieval models are trained with 0-1 hard label that indicates whether a query is relevant to a document, ignoring rich information of the relevance degree.
The ODD score enhances the VOD system in two ways: 1) it enables the VOD system to select superior global reference frames, thereby improving overall accuracy; and 2) it serves as an indicator in the newly designed ODD Scheduler to eliminate the aggregation of frames that are easy to detect, thus accelerating the VOD process.
However, in order to ensure that the AUV is able to carry out its mission successfully, a control system capable of adapting to changing environmental conditions is required.
The prevalence of domain adaptive semantic segmentation has prompted concerns regarding source domain data leakage, where private information from the source domain could inadvertently be exposed in the target domain.
To take the advantage of image augmentations while mitigating the semantic distortion issue, we propose a novel ZSL approach by Harnessing Adversarial Samples (HAS).
no code implementations • 13 Jul 2023 • MD Wahiduzzaman Khan, Hongwei Sheng, Hu Zhang, Heming Du, Sen Wang, Minas Theodore Coroneo, Farshid Hajati, Sahar Shariflou, Michael Kalloniatis, Jack Phu, Ashish Agar, Zi Huang, Mojtaba Golzan, Xin Yu
Retinal vessel segmentation is generally grounded in image-based datasets collected with bench-top devices.
Motivated by the above mentioned issues, we present in this paper a dedicated end-to-end sparse deep learning approach for event-based pose tracking: 1) to our knowledge this is the first time that 3D human pose tracking is obtained from events only, thus eliminating the need of accessing to any frame-based images as part of input; 2) our approach is based entirely upon the framework of Spiking Neural Networks (SNNs), which consists of Spike-Element-Wise (SEW) ResNet and a novel Spiking Spatiotemporal Transformer; 3) a large-scale synthetic dataset is constructed that features a broad and diverse set of annotated 3D human motions, as well as longer hours of event stream data, named SynEventHPD.
However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points.
Ranked #1 on 3D Object Detection on nuScenes LiDAR only
Motivated by the positive effect of the projector in feature distillation, we propose an ensemble of projectors to further improve the quality of student features.
Deep Reinforcement Learning (DRL) has achieved impressive performance in robotics and autonomous systems (RASs).
We identify two key challenges in our FedZSL protocol: 1) the trained models are prone to be biased to the locally observed classes, thus failing to generalize to the unseen classes and/or seen classes appeared on other devices; 2) as each category in the training data comes from a single source, the central model is highly vulnerable to model replacement (backdoor) attacks.
Based on this, we propose to exploit the image frequency distributions for night-time scene parsing.
Compared to all methods that do not use additional data for training, our models achieve 67. 3% and 41. 5% robust accuracy on CIFAR-10 and CIFAR-100 (improving upon the state-of-the-art by +7. 23% and +9. 07%).
A typical multi-source domain adaptation (MSDA) approach aims to transfer knowledge learned from a set of labeled source domains, to an unlabeled target domain.
To address this issue, we propose a novel flow-based generative framework that consists of multiple conditional affine coupling layers for learning unseen data generation.
Our approach is flexible, could be used for both text2motion and motion2text tasks.
Ranked #8 on Motion Synthesis on KIT Motion-Language
Automated generation of 3D human motions from text is a challenging problem.
Ranked #7 on Motion Synthesis on KIT Motion-Language
The increasing use of Machine Learning (ML) components embedded in autonomous systems -- so-called Learning-Enabled Systems (LESs) -- has resulted in the pressing need to assure their functional safety.
As the Hi-C links of two adjacent contigs concentrate only at the neighbor ends of the contigs, larger contig size will reduce the power to differentiate adjacent (signal) and non-adjacent (noise) contig linkages, leading to a higher rate of mis-assembly.
This paper considers to jointly tackle the highly correlated tasks of estimating 3D human body poses and predicting future 3D motions from RGB image sequences.
Action2motion stochastically generates plausible 3D pose sequences of a prescribed action category, which are processed and rendered by motion2video to form 2D videos.
Neural Architecture Search (NAS) automatically searches for well-performed network architectures from a given search space.
Sequential diagnosis prediction on the Electronic Health Record (EHR) has been proven crucial for predictive analytics in the medical domain.
Latent neural process, a member of NPF, is believed to be capable of modelling the uncertainty on certain points (local uncertainty) as well as the general function priors (global uncertainties).
Event camera is an emerging imaging sensor for capturing dynamics of moving objects as events, which motivates our work in estimating 3D human pose and shape from the event signals.
This paper focuses on a new problem of estimating human pose and shape from single polarization images.
In this paper, we propose a novel learning-based framework that combines the robustness of the parametric model with the flexibility of free-form 3D deformation.
A dataset of generic 3D objects with ground-truth annotated skeletons is collected.
This paper presents a novel dataset for the development of visual navigation and simultaneous localisation and mapping (SLAM) algorithms as well as for underwater intervention tasks.
Hence, some recent works train healthcare representations by incorporating medical ontology, by self-supervised tasks like diagnosis prediction, but (1) the small-scale, monotonous ontology is insufficient for robust learning, and (2) critical contexts or dependencies underlying patient journeys are barely exploited to enhance ontology learning.
However, Chamfer distance is quite sensitive to noise and outliers, thus could be unreliable to assign correspondences.
Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e. g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training.
Existing visually-aware models make use of the visual features as a separate collaborative signal similarly to other features to directly predict the user's preference without considering a potential bias, which gives rise to a visually biased recommendation.
Finally, we verify the proposed framework on the public KITTI dataset with different 3D object detectors.
The embedding-based large-scale query-document retrieval problem is a hot topic in the information retrieval (IR) field.
Meta-Learning (ML) has proven to be a useful tool for training Few-Shot Learning (FSL) algorithms by exposure to batches of tasks sampled from a meta-dataset.
Generalized zero-shot learning (GZSL) aims to classify samples under the assumption that some classes are not observable during training.
Few-Shot Learning (FSL) algorithms are commonly trained through Meta-Learning (ML), which exposes models to batches of tasks sampled from a meta-dataset to mimic tasks seen during evaluation.
Few-shot learning aims to train models on a limited number of labeled samples from a support set in order to generalize to unseen samples from a query set.
In this paper, we teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
To the best of our knowledge, this is the first public radar dataset which provides high-resolution radar images on public roads with a large amount of road actors labelled.
Electronic health records (EHRs) are longitudinal records of a patient's interactions with healthcare systems.
Action recognition is a relatively established task, where givenan input sequence of human motion, the goal is to predict its ac-tion category.
A voting strategy averages the probability distributions output from the classifiers and, given that some patches are more discriminative than others, a discrimination-based attention mechanism helps to weight each patch accordingly.
Inspired by the recent advances in human shape estimation from single color images, in this paper, we attempt at estimating human body shapes by leveraging the geometric cues from single polarization images.
The key challenge of patient journey understanding is to design an effective encoding mechanism which can properly tackle the aforementioned multi-level structured patient journey data with temporal sequential visits and a set of medical codes.
Tracking and grasping a dynamic object with a random trajectory is even harder.
First, based on a generative human template, for every two frames having sufficient overlap, an initial pairwise alignment is performed; It is followed by a global non-rigid registration procedure, in which partial results from RGBD frames are collected into a unified 3D shape, under the guidance of correspondences from the pairwise alignment; Finally, the texture map of the reconstructed human model is optimized to deliver a clear and spatially consistent texture.
Deep convolutional neural networks generally perform well in underwater object recognition tasks on both optical and sonar images.
Polarization images are known to be able to capture polarized reflected lights that preserve rich geometric cues of an object, which has motivated its recent applications in reconstructing detailed surface normal of the objects of interest.
We show that corrected third-generation data can be used to count k-mer frequencies and estimate genome size reliably, in replacement of using second-generation data.
An integral part of video analysis and surveillance is temporal activity detection, which means to simultaneously recognize and localize activities in long untrimmed videos.
Chinese calligraphy is a unique art form with great artistic value but difficult to master.
To protect the image privacy, we propose to locally perturb the image representation before revealing to the data user.
Demand for smartwatches has taken off in recent years with new models which can run independently from smartphones and provide more useful features, becoming first-class mobile platforms.
For high resolution scene mapping and object recognition, optical technologies such as cameras and LiDAR are the sensors of choice.
Chinese calligraphy is a unique form of art that has great artistic value but is difficult to master.
In this paper, we propose a medical concept embedding method based on applying a self-attention mechanism to represent each medical concept.
Mining causality from text is a complex and crucial natural language understanding task corresponding to the human cognition.
The framework directly regresses 3D bounding boxes for all instances in a point cloud, while simultaneously predicting a point-level mask for each instance.
Ranked #12 on 3D Instance Segmentation on S3DIS (mPrec metric)
Inspired by the cognitive process of humans and animals, Curriculum Learning (CL) trains a model by gradually increasing the difficulty of the training data.
To deal with these challenges, we first adopt the deep deterministic policy gradient (DDPG) algorithm, which has the capacity to handle complex state and action spaces in continuous domain.
As it is labor-intensive to annotate semantic parts on real street views, we propose a specific approach to implicitly transfer part features from synthesized images to real street views.
Due to the sparse rewards and high degree of environment variation, reinforcement learning approaches such as Deep Deterministic Policy Gradient (DDPG) are plagued by issues of high variance when applied in complex real world environments.
Semi-supervised learning is crucial for alleviating labelling burdens in people-centric sensing.
However, GRU based approaches are unable to consistently estimate 3D shapes given different permutations of the same set of input images as the recurrent unit is permutation variant.
Ranked #1 on 3D Reconstruction on Data3D−R2N2
Modeling user-item interaction patterns is an important task for personalized recommendations.
This is further confounded by the fact that shape information about encountered objects in the real world is often impaired by occlusions, noise and missing regions e. g. a robot manipulating an object will only be able to observe a partial view of the entire solid.
Multimodal wearable sensor data classification plays an important role in ubiquitous computing and has a wide range of applications in scenarios from healthcare to entertainment.
Modelling the physical properties of everyday objects is a fundamental prerequisite for autonomous robots.
For vehicle autonomy, driver assistance and situational awareness, it is necessary to operate at day and night, and in all weather conditions.
In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is composed of 2 modules.
In this paper, we integrate both soft and hard attention into one context fusion model, "reinforced self-attention (ReSA)", for the mutual benefit of each other.
Ranked #56 on Natural Language Inference on SNLI
To our best knowledge, SMR is the first to learn embeddings of a patient-disease-medicine graph for medicine recommendation in the world.
This paper presents a novel end-to-end framework for monocular VO by using deep Recurrent Convolutional Neural Networks (RCNNs).
UnDeepVO is able to estimate the 6-DoF pose of a monocular camera and the depth of its view by using deep neural networks.
In this paper, we propose a novel 3D-RecGAN approach, which reconstructs the complete 3D structure of a given object from a single arbitrary depth view using generative adversarial networks.
Brain-Computer Interface (BCI) is a system empowering humans to communicate with or control the outside world with exclusively brain intentions.
Human-Computer Interaction Neurons and Cognition
Electronic medical records contain multi-format electronic medical data that consist of an abundance of medical knowledge.
Machine learning techniques, namely convolutional neural networks (CNN) and regression forests, have recently shown great promise in performing 6-DoF localization of monocular images.
In this paper we present a novel approach for depth map enhancement from an RGB-D video sequence.
In this paper we present an on-manifold sequence-to-sequence learning approach to motion estimation using visual and inertial sensors.
Our method works directly with the network code and, in contrast to existing methods, can guarantee that adversarial examples, if they exist, are found for the given region and family of manipulations.
We discovered that these internal contours, which are results of convex parts on an object's surface, can lead to a tighter fit than the original visual hull.
In this paper, we propose an unsupervised feature selection method seeking a feature coefficient matrix to select the most distinctive features.