Moreover, the positional features are embedded through a novel cyclic positional encoding (CPE) method to allow Transformer to effectively capture the circularity and symmetry of VRP solutions (i. e., cyclic sequences).
The asymmetric bilateral encoder has a transformer path and a lightweight CNN path, where the two paths communicate at each encoder stage to learn complementary global contexts and local spatial details, respectively.
We hope that the scale, diversity, and quality of our dataset can benefit researchers in this area and beyond.
We tackle the low-efficiency flaw of vision transformer caused by the high computational/space complexity in Multi-Head Self-Attention (MHSA).
We propose a monocular depth estimator SC-Depth, which requires only unlabelled videos for training and enables the scale-consistent prediction at inference time.
To accomplish better multi-level feature fusion, we construct the Scale-Correlated Pyramid Convolution (SCPC) to build an elegant decoder for recovering object details from the above extreme downsampling.
In all cases, our method outperforms competing methods and relevant baselines particularly in cases where the number of annotations is small and the amount of disagreement is large.
Recent generative methods formulate GZSL as a missing data problem, which mainly adopts GANs or VAEs to generate visual features for unseen classes.
Much of the recent efforts on salient object detection (SOD) have been devoted to producing accurate saliency maps without being aware of their instance labels.
Recent years have seen increasing use of supervised learning methods for segmentation tasks.
no code implementations • • Le Zhang, Damianos Karakos, William Hartmann, Manaj Srivastava, Lee Tarlin, David Akodes, Sanjay Krishna Gouda, Numra Bathool, Lingjun Zhao, Zhuolin Jiang, Richard Schwartz, John Makhoul
In this paper, we describe a cross-lingual information retrieval (CLIR) system that, given a query in English, and a set of audio and text documents in a foreign language, can return a scored list of relevant documents, and present findings in a summary form in English.
Specifically, with a diagnostic analysis, we show that the recurrent structure may not be effective to learn temporal dependencies than what we expected and implicitly yields an orderless representation.
The effectiveness of the triplet loss heavily relies on the triplet selection, in which a common practice is to first sample intra-class patches (positives) from the dataset for batch construction and then mine in-batch negatives to form triplets.
Then according to the statistical features of noise, we propose a novel centroid update approach to enhance the robustness of clustering-based superpixel methods.
According to this, we propose three high-quality matching systems and a Coarse-to-Fine RANSAC estimator.
Nonlinear regression has been extensively employed in many computer vision problems (e. g., crowd counting, age estimation, affective computing).
In this study, we applied powerful deep neural network and explored a process in the forecast of skeletal bone age with the specifically combine joints images to increase the performance accuracy compared with the whole hand images.
We present a fully automatic, high throughput image parsing workflow for the analysis of cardiac MR images, and test its performance on the UK Biobank (UKB) cardiac dataset.
In this paper, we observe that the contexts of a natural image can be well expressed by a high-to-low self-learning of side-output convolutional features.
Apparent personality and emotion analysis are both central to affective computing.
Full coverage of the left ventricle (LV), from base to apex, is a basic criterion for CMR image quality and necessary for accurate measurement of cardiac volume and functional assessment.
This leads to a critical absence in this field that there is no standard datasets and evaluation metrics to evaluate different feature matchers fairly.
Deep convolutional networks (ConvNets) have achieved unprecedented performances on many computer vision tasks.
Ranked #8 on Crowd Counting on WorldExpo’10
Collecting sufficient annotated data is very expensive in many applications, especially for pixel-level prediction tasks such as semantic segmentation.
In this paper, we comprehensively describe the methodology of our submissions to the One-Minute Gradual-Emotion Behavior Challenge 2018.
Semantic edge detection (SED), which aims at jointly extracting edges as well as their category information, has far-reaching applications in domains such as semantic segmentation, object proposal generation, and object recognition.
To this end, we present a uniform benchmark with novel evaluation metrics and a large-scale dataset for evaluating the overall performance of image matching methods.
Unlike conventional orthogonal decision trees that use a single feature and heuristic measures to obtain a split at each node, we propose to use a more powerful proximal SVM to obtain oblique hyperplanes to capture the geometric structure of the data better.