This paper proposes a novel method for 3D shape representation learning, namely Hyperbolic Embedded Attentive Representation (HEAR).
To exploit the progressive interactions among these regions, we represent them as a region graph, on which the parts relation reasoning is performed with graph convolutions, thus leading to our PRR branch.
We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks.
Ranked #1 on RGB-D Salient Object Detection on DES
3D occupancy prediction holds significant promise in the fields of robot perception and autonomous driving, which quantifies 3D scenes into grid cells with semantic labels.
A standard convex SPCA-based model with PSD constraint for unsupervised feature selection is proposed.
These boundary proposals are then incorporated into the proposed image segmentation model, such that the target segmentation contours are made up of a set of selected boundary proposals and the corresponding geodesic paths linking them.
In summary, this survey paper provides a comprehensive understanding of deep multi-modal learning for various BL generations and recognitions for the first time.
In our in-depth examination, we delve into various facets of FSCIL, encompassing the problem definition, the discussion of primary challenges of unreliable empirical risk minimization and the stability-plasticity dilemma, general schemes, and relevant problems of incremental learning and few-shot learning.
Multimedia recommendation involves personalized ranking tasks, where multimedia content is usually represented using a generic encoder.
1 code implementation • 27 Jul 2023 • Lingdong Kong, Yaru Niu, Shaoyuan Xie, Hanjiang Hu, Lai Xing Ng, Benoit R. Cottereau, Ding Zhao, Liangjun Zhang, Hesheng Wang, Wei Tsang Ooi, Ruijie Zhu, Ziyang Song, Li Liu, Tianzhu Zhang, Jun Yu, Mohan Jing, Pengwei Li, Xiaohua Qi, Cheng Jin, Yingfeng Chen, Jie Hou, Jie Zhang, Zhen Kan, Qiang Ling, Liang Peng, Minglei Li, Di Xu, Changpeng Yang, Yuanqi Yao, Gang Wu, Jian Kuai, Xianming Liu, Junjun Jiang, Jiamian Huang, Baojun Li, Jiale Chen, Shuang Zhang, Sun Ao, Zhenyu Li, Runze Chen, Haiyong Luo, Fang Zhao, Jingze Yu
In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation.
To address this limitation, we present Masked Multi-view with Swin Transformers (SwinMM), a novel multi-view pipeline for enabling accurate and data-efficient self-supervised medical image analysis.
In this paper, we propose a hybrid point-wise Radar-Optical fusion approach for object detection in autonomous driving scenarios.
To remedy this, we propose Two-stage Causal Modeling (TsCM) for the SGG task, which takes the long-tailed distribution and semantic confusion as confounders to the Structural Causal Model (SCM) and then decouples the causal intervention into two stages.
This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework.
In this technical report, we present our solution, named UniOCC, for the Vision-Centric 3D occupancy prediction track in the nuScenes Open Dataset Challenge at CVPR 2023.
Ranked #3 on Prediction Of Occupancy Grid Maps on Occ3D-nuScenes
Occlusion problem remains a key challenge in Optical Flow Estimation (OFE) despite the recent significant progress brought by deep learning in the field.
Given an audio clip and a reference face image, the goal of the talking head generation is to generate a high-fidelity talking head video.
Cued Speech (CS) is a multi-modal visual coding system combining lip reading with several hand cues at the phonetic level to make the spoken language visible to the hearing impaired.
Audio-visual speech recognition (AVSR) gains increasing attention from researchers as an important part of human-computer interaction.
These triggers have demonstrated strong attack performance even under backdoor defense, which aims to eliminate or suppress the backdoor effect in the model.
This paper introduces a novel explainable image quality evaluation approach called X-IQE, which leverages visual large language models (LLMs) to evaluate text-to-image generation methods by generating textual explanations.
We introduce FedAds, the first benchmark for CVR estimation with vFL, to facilitate standardized and systematical evaluations for vFL algorithms.
As the first to comprehensively review the progress of segmenting anything task for vision and beyond based on the foundation model of SAM, this work focuses on its applications to various tasks and data types by discussing its historical development, recent progress, and profound impact on broad applications.
Deep Neural Networks (DNNs), from AlexNet to ResNet to ChatGPT, have made revolutionary progress in recent years, and are widely used in various fields.
In practice, the expensive cost of data annotation and the continuously increasing categories of new pills make it meaningful to develop a few-shot class-incremental pill recognition system.
This paper proposes a novel module called middle spectrum grouped convolution (MSGC) for efficient deep convolutional neural networks (DCNNs) with the mechanism of grouped convolution.
The proposed method contributes a mixed clutter variants generation strategy and a new inference branch equipped with channel-weighted mean square error (CWMSE) loss for invariant representation learning.
Interestingly, during the training phase supervised by point labels, we discover that CNNs first learn to segment a cluster of pixels near the targets, and then gradually converge to predict groundtruth point labels.
Therefore, the degree of overfitting for clutter reflects the non-causality of deep learning in SAR ATR.
The ever-increasing demands for intuitive interactions in Virtual Reality has triggered a boom in the realm of Facial Expression Recognition (FER).
Deep learning has been highly successful in computer vision with large amounts of labeled data, but struggles with limited labeled training data.
Experiments on two medical image datasets (i. e., ISIC 2018 challenge and ChestX-ray14) show that our method outperforms state-of-the-art SSL methods.
Some paradigms have been recently developed to explore this adversarial phenomenon occurring at different stages of a machine learning system, such as training-time adversarial attack (i. e., backdoor attack), deployment-time adversarial attack (i. e., weight attack), and inference-time adversarial attack (i. e., adversarial example).
The goal of Few-Shot Continual Learning (FSCL) is to incrementally learn novel tasks with limited labeled samples and preserve previous capabilities simultaneously, while current FSCL methods are all for the class-incremental purpose.
Given a model well-trained with a large-scale base dataset, Few-Shot Class-Incremental Learning (FSCIL) aims at incrementally learning novel classes from a few labeled samples by avoiding overfitting, without catastrophically forgetting all encountered classes previously.
While the US is not a standard paradigm for spinal surgery, the scarcity of intra-operative clinical US data is an insurmountable bottleneck in training a neural network.
Various datasets have been proposed for simultaneous localization and mapping (SLAM) and related problems.
Aligning users across networks using graph representation learning has been found effective where the alignment is accomplished in a low-dimensional embedding space.
In this paper, we propose the learning scheme of Target Inner-Geometry from the LiDAR modality into camera-based BEV detectors for both dense depth and BEV features, termed as TiG-BEV.
In this paper, we propose a deep model called Attention-based Multiple Dimensions EEG Transformer (AMDET), which can exploit the complementarity among the spectral-spatial-temporal features of EEG data by employing the multi-dimensional global attention mechanism.
In this work, we pre-train DNNs on ultrasound (US) domains instead of ImageNet to reduce the domain gap in medical US applications.
Deep transfer learning (DTL) has formed a long-term quest toward enabling deep neural networks (DNNs) to reuse historical experiences as efficiently as humans.
In the first stage, we propose a novel algorithm called polar decomposition-based orthogonal initialization (PDOI) to find a good initialization for the orthogonal optimization.
no code implementations • 7 Nov 2022 • Andrey Ignatov, Radu Timofte, Cheng-Ming Chiang, Hsien-Kai Kuo, Yu-Syuan Xu, Man-Yu Lee, Allen Lu, Chia-Ming Cheng, Chih-Cheng Chen, Jia-Ying Yong, Hong-Han Shuai, Wen-Huang Cheng, Zhuang Jia, Tianyu Xu, Yijian Zhang, Long Bao, Heng Sun, Diankai Zhang, Si Gao, Shaoli Liu, Biao Wu, Xiaofeng Zhang, Chengjian Zheng, Kaidi Lu, Ning Wang, Xiao Sun, HaoDong Wu, Xuncheng Liu, Weizhan Zhang, Caixia Yan, Haipeng Du, Qinghua Zheng, Qi Wang, Wangdu Chen, Ran Duan, Mengdi Sun, Dan Zhu, Guannan Chen, Hojin Cho, Steve Kim, Shijie Yue, Chenghua Li, Zhengyang Zhuge, Wei Chen, Wenxu Wang, Yufeng Zhou, Xiaochen Cai, Hengxing Cai, Kele Xu, Li Liu, Zehua Cheng, Wenyi Lian, Wenjing Lian
While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices.
For ViTs, DyBinaryCCT presents the superiority of the convolutional embedding layer in fully binarized ViTs and achieves 56. 1% on the ImageNet dataset, which is nearly 9% higher than the baseline.
Moreover, the privacy of the system is analyzed to ensure the security of the real data.
This work proposes a hierarchical contrastive learning (HiCo) method to improve the transferability for the US video model pretraining.
Computer model has been extensively adopted to overcome the time limitation of language evolution by transforming language theory into physical modeling mechanism, which helps to explore the general laws of the evolution.
Precisely, the presence of scalar features makes the major part of the network binarizable, while vector features serve to retain rich structural information and ensure SO(3) equivariance.
Toward building more robust DNN-based SAR ATR models, this article explores the domain knowledge of SAR imaging process and proposes a novel Scattering Model Guided Adversarial Attack (SMGAA) algorithm which can generate adversarial perturbations in the form of electromagnetic scattering response (called adversarial scatterers).
Different from existing models, in this paper, we propose a new interpretation method that explains the image similarity models by salience maps and attribute words.
Adversarial training (AT) with samples generated by Fast Gradient Sign Method (FGSM), also known as FGSM-AT, is a computationally simple method to train robust networks.
The goal of Cross-Domain Few-Shot Classification (CDFSC) is to accurately classify a target dataset with limited labelled data by exploiting the knowledge of a richly labelled auxiliary dataset, despite the differences between the domains of the two datasets.
Neural Network (Deep Learning) is a modern model in Artificial Intelligence and it has been exploited in Survival Analysis.
Then, we propose a regression model for the HCD, which decomposes the source signal into the regressed signal and changed signal, and requires the regressed signal have the same spectral property as the target signal on the same graph.
Particularly, the model integrates the macro-level guided-category knowledge and micro-level open-domain dialogue data for the training, leveraging the priori knowledge into the latent space, which enables the model to disentangle the latent variables within the mesoscopic scale.
Few-Shot Class-Incremental Learning (FSCIL) aims at incrementally learning novel classes from a few labeled samples by avoiding the overfitting and catastrophic forgetting simultaneously.
The in-memory approximate nearest neighbor search (ANNS) algorithms have achieved great success for fast high-recall query processing, but are extremely inefficient when handling hybrid queries with unstructured (i. e., feature vectors) and structured (i. e., related attributes) constraints.
In this work, we propose a meta-learning-based LR tuner, named MetaLR, to make different layers automatically co-adapt to downstream tasks based on their transferabilities across domains.
Face recognition is one of the most active tasks in computer vision and has been widely used in the real world.
Visual speech, referring to the visual domain of speech, has attracted increasing attention due to its wide applications, such as public security, medical treatment, military defense, and film entertainment.
Acoustic-to-articulatory inversion (AAI) is to obtain the movement of articulators from speech signals.
Experimental results show that the speech synthesized by our model is comparable to the personalized speech synthesized by training a large amount of audio data in previous works.
Extensive experiments on two vision tasks, includ-ing ImageNet classification and Pascal VOC segmentation, demonstrate the superiority of our ICKD, which consis-tently outperforms many existing methods, advancing thestate-of-the-art in the fields of Knowledge Distillation.
In this work, we propose WebUAV-3M, the largest public UAV tracking benchmark to date, to facilitate both the development and evaluation of deep UAV trackers.
Aircraft detection in Synthetic Aperture Radar (SAR) imagery is a challenging task in SAR Automatic Target Recognition (SAR ATR) areas due to aircraft's extremely discrete appearance, obvious intraclass variation, small size and serious background's interference.
In CRYPTO 2019, Gohr made a pioneering attempt and successfully applied deep learning to the differential cryptanalysis against NSA block cipher SPECK32/64, achieving higher accuracy than the pure differential distinguishers.
Weakly supervised learning can help local feature methods to overcome the obstacle of acquiring a large-scale dataset with densely labeled correspondences.
Ranked #1 on Camera Localization on Aachen Day-Night benchmark
Specifically, motivated by the local motion prior in the spatio-temporal dimension, we propose a local spatio-temporal attention module to perform implicit frame alignment and incorporate the local spatio-temporal information to enhance the local features (especially for small targets).
Since a linear quantizer (i. e., round(*) function) cannot well fit the bell-shaped distributions of weights and activations, many existing methods use pre-defined functions (e. g., exponential function) with learnable parameters to build the quantizer for joint optimization.
Multi-view clustering has received increasing attention due to its effectiveness in fusing complementary information without manual annotations.
Video transformers have achieved impressive results on major video recognition benchmarks, which however suffer from high computational cost.
Social network alignment aims at aligning person identities across social networks.
Ultrasound (US) imaging is commonly used to assist in the diagnosis and interventions of spine diseases, while the standardized US acquisitions performed by manually operating the probe require substantial experience and training of sonographers.
The MMC-HVDC connected offshore wind farms (OWFs) could suffer short circuit fault (SCF), whereas their transient stability is not well analysed.
The results show that DRCA improved the classification accuracy on six subjects (p < 0. 05), compared with the baseline models trained only with the source domain data;, while CPSC did not guarantee the accuracy improvement.
The experimental results prove that our method is an effective and straightforward way to reduce information loss and enhance performance of BNNs.
A faster version of PiDiNet with less than 0. 1M parameters can still achieve comparable performance among state of the arts with 200 FPS.
Ranked #2 on Edge Detection on BRIND
As an effective framework to quantify the prediction reliability, conformal prediction (CP) was developed with the CPKNN (CP with kNN).
To mitigate this problem, a viable approach is to integrate complementary knowledge from other MMKGs.
In each hypergraph, different temporal granularities are captured by hyperedges that connect a set of graph nodes (i. e., part-based features) across different temporal ranges.
Ranked #5 on Person Re-Identification on iLIDS-VID
Extensive experiments on the novel dataset as well as three existing datasets clearly demonstrate the effectiveness of the proposed framework for both group-based re-id tasks.
In this study, we classified different origins of three categories of herbal medicines with different feature extraction methods: manual feature extraction, mathematical transformation, deep learning algorithms.
Person search aims to simultaneously localize and identify a query person from realistic, uncropped images, which can be regarded as the unified task of pedestrian detection and person re-identification (re-id).
Ranked #10 on Person Search on CUHK-SYSU
Furthermore, we propose a confidence-based approach to encode the optimization of image quality in the learning process.
Furthermore, this study provides a systematic analysis of different augmentation strategies.
In recent years a vast amount of visual content has been generated and shared from many fields, such as social media platforms, medical imaging, and robotics.
Scene classification, aiming at classifying a scene image to one of the predefined scene categories by comprehending the entire image, is a longstanding, fundamental and challenging problem in computer vision.
Typically, this requires an agent to fully understand the knowledge from the given text materials and generate correct and fluent novel paragraphs, which is very challenging in practice.
Ranked #3 on KG-to-Text Generation on AGENDA
Long-tailed visual class recognition poses significant challenges to traditional machine learning and emerging deep networks due to its inherent class imbalance.
In this way, the generated partition can guide multi-view matrix factorization to produce more purposive coefficient matrix which, as a feedback, improves the quality of partition.
Extensive experiments on two vision tasks, including ImageNet classification and Pascal VOC segmentation, demonstrate the superiority of our ICKD, which consistently outperforms many existing methods, advancing the state-of-the-art in the fields of Knowledge Distillation.
Ranked #17 on Knowledge Distillation on ImageNet
After that, we theoretically show that the objective of SimpleMKKM is a special case of this local kernel alignment criterion with normalizing each base kernel matrix.
This paper investigates an issue of distributed fusion estimation under network-induced complexity and stochastic parameter uncertainties.
To alleviate this problem, an US dataset named US-4 is constructed for direct pretraining on the same domain.
A major challenge in Fine-Grained Visual Classification (FGVC) is distinguishing various categories with high inter-class similarity by learning the feature that differentiate the details.
Ranked #10 on Fine-Grained Image Classification on FGVC Aircraft
no code implementations • 12 Nov 2020 • Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U Rajendra Acharya, Vladimir Makarenkov, Saeid Nahavandi
Uncertainty quantification (UQ) plays a pivotal role in reduction of uncertainties during both optimization and decision making processes.
Hence, in this paper, we propose to recommend an appropriate sticker to user based on multi-turn dialog context and sticker using history of user.
With the development of radiomics, noninvasive diagnosis like ultrasound (US) imaging plays a very important role in automatic liver fibrosis diagnosis (ALFD).
no code implementations • 26 Oct 2020 • Gang Wang, Qunxi Dong, Jianfeng Wu, Yi Su, Kewei Chen, Qingtang Su, Xiaofeng Zhang, Jinguang Hao, Tao Yao, Li Liu, Caiming Zhang, Richard J Caselli, Eric M Reiman, Yalin Wang
With hippocampal UMIs, the estimated minimum sample sizes needed to detect a 25$\%$ reduction in the mean annual change with 80$\%$ power and two-tailed $P=0. 05$ are 116, 279 and 387 for the longitudinal $A\beta+$ AD, $A\beta+$ mild cognitive impairment (MCI) and $A\beta+$ CU groups, respectively.
Binary neural networks (BNNs), where both weights and activations are binarized into 1 bit, have been widely studied in recent years due to its great benefit of highly accelerated computation and substantially reduced memory footprint that appeal to the development of resource constrained devices.
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text.
To this end, in this work, we present a novel end-to-end 3D lip motion Network (3LMNet) by utilizing the sentence-level 3D lip motion (S3DLM) to recognize speakers in both the text-independent and text-dependent contexts.
However, there is a lack of a sufficiently reasonable contribution measurement mechanism to distribute the reward for each agent.
Results show that GW consistently improves the performance of different architectures, with absolute gains of $1. 02\%$ $\sim$ $1. 49\%$ in top-1 accuracy on ImageNet and $1. 82\%$ $\sim$ $3. 21\%$ in bounding box AP on COCO.
Normalization techniques are essential for accelerating the training and improving the generalization of deep neural networks (DNNs), and have successfully been used in various applications.
Ultrasound (US) is a non-invasive yet effective medical diagnostic imaging technique for the COVID-19 global pandemic.
A key challenge of oversampling in imbalanced classification is that the generation of new minority samples often neglects the usage of majority classes, resulting in most new minority sampling spreading the whole minority space.
(1) The teacher model serves a dual role as a teacher and a student, such that the teacher predictions on unlabeled images may be very close to those of student, which limits the upper-bound of the student.
Given a speaker's speech, it is interesting to see if it is possible to generate this speaker's face.
The key ideas are two-fold: a) explicitly modeling the dependencies among joints and the relations between the pixels and the joints for better local feature representation learning; b) unifying the dense pixel-wise offset predictions and direct joint regression for end-to-end training.
In this paper, we propose dynamic group convolution (DGC) that adaptively selects which part of input channels to be connected within each group for individual samples on the fly.
This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales.
Ranked #69 on Semantic Segmentation on ADE20K val
Compared with static views, abundant dynamic properties between video frames are beneficial to refined depth estimation, especially for dynamic objects.
This work studies black-box adversarial attacks against deep neural networks (DNNs), where the attacker can only access the query feedback returned by the attacked DNN model, while other information such as model parameters or the training datasets are unknown.
We successfully train a 404-layer deep CNN on the ImageNet dataset and a 3002-layer network on CIFAR-10 and CIFAR-100, while the baseline is not able to converge at such extreme depths.
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1 and reduce redundancy in representation.
Our work originates from the observation that while various whitening transformations equivalently improve the conditioning, they show significantly different behaviors in discriminative scenarios and training Generative Adversarial Networks (GANs).
Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps, and some works are dedicated to automatically select sticker response by matching text labels of stickers with previous utterances.
Tubular structure tracking is a crucial task in the fields of computer vision and medical image analysis.
One bottleneck (i. e., binary codes) conveys the high-level intrinsic data structure captured by the code-driven graph to the other (i. e., continuous variables for low-level detail information), which in turn propagates the updated network feedback for the encoder to learn more discriminative binary codes.
To this end, we propose layer-wise conditioning analysis, which explores the optimization landscape with respect to each layer independently.
The key challenge for video SR lies in the effective exploitation of temporal dependency between consecutive frames.
Ranked #6 on Video Super-Resolution on MSU Super-Resolution for Video Compression (BSQ-rate over ERQA metric)
To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds.
DEAN can be interpreted as a GOF game between two generative networks, where one explicit generative network learns an energy-based distribution that fits the real data, and the other implicit generative network is trained by minimizing a GOF test statistic between the energy-based distribution and the generated data, such that the underlying distribution of the generated data is close to the energy-based distribution.
The popular L_2-norm and M-estimator are employed for standard image CS and robust CS problem to fit the data respectively.
The primary goal of skeletal motion prediction is to generate future motion by observing a sequence of 3D skeletons.
Binary optimization, a representative subclass of discrete optimization, plays an important role in mathematical optimization and has various applications in computer vision and machine learning.
Recent binary representation learning models usually require sophisticated binary optimization, similarity measure or even generative models as auxiliaries.
Specifically, to integrate the insights of matching based and propagation based methods, we employ an encoder-decoder framework to learn pixel-level similarity and segmentation in an end-to-end manner.
It is computationally efficient and only marginally increases the cost of computing LBPTOP, yet is extremely effective for ME recognition.
This is motivated by the fact that finding closely similar pixels is more feasible than similar patches in natural images, which can be used to enhance image denoising performance.
A simple but useful observation on our NAC is: as long as the noise is weak, it is feasible to learn a self-supervised network only with the corrupted image, approximating the optimal parameters of a supervised network learned with pairs of noisy and clean images.
A novel Structure and Texture Aware Retinex (STAR) model is further proposed for illumination and reflectance decomposition of a single image.
To address this issue, we present an adaptive multi-model framework that resolves polysemy by visual disambiguation.
With the support of SND, we provide natural explanations to several phenomena from the perspective of optimization, e. g., why group-wise whitening of DBN generally outperforms full-whitening and why the accuracy of BN degenerates with reduced batch sizes.
Experiments show that our selected features have achieved a precision rate of 86. 77%, a recall rate of 89. 03% and an F1 score of 87. 89%.
In this paper, we propose a generative adversarial model based on prior knowledge and attention mechanism to achieve the generation of irradiated material images (data-to-image model), and a prediction model for corresponding industrial performance (image-to-data model).
How to economically cluster large-scale multi-view images is a long-standing problem in computer vision.
Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from a large number of predefined categories in natural images.
Despite the remarkable success of Convolutional Neural Networks (CNNs) on generalized visual tasks, high computational and memory costs restrict their comprehensive applications on consumer electronics (e. g., portable or smart wearable devices).
The generative model learns a mapping that the distributions of sketches can be indistinguishable from the distribution of natural images using an adversarial loss, and simultaneously learns an inverse mapping based on the cycle consistency loss in order to enhance the indistinguishability.
To advance subtle expression recognition, we contribute a Large-scale Subtle Emotions and Mental States in the Wild database (LSEMSW).
First, the formula is abstractly expressed as a multiway tree model, and then each step of the formula derivation transformation is abstracted as a mapping of multiway trees.
While many image colorization algorithms have recently shown the capability of producing plausible color versions from gray-scale photographs, they still suffer from the problems of context confusion and edge color bleeding.
Results: The error rates for different fractional orders of FrScatNet are examined and show that the classification accuracy is significantly improved in fractional scattering domain.
Scene text recognition has drawn great attentions in the community of computer vision and artificial intelligence due to its challenges and wide applications.
In this paper, we propose an end-to-end generic salient object segmentation model called Metric Expression Network (MEnet) to deal with saliency detection with the tolerance of distortion.
Because extreme scale variations are not necessarily present in most standard texture databases, to support the proposed extreme-scale aspects of texture understanding we are developing a new dataset, the Extreme Scale Variation Textures (ESVaT), to test the performance of our framework.
Texture is a fundamental characteristic of many types of images, and texture representation is one of the essential and challenging problems in computer vision and pattern recognition which has attracted extensive research attention.
We introduce a novel approach for annotating large quantity of in-the-wild facial images with high-quality posterior age distribution as labels.
Ranked #6 on Age Estimation on MORPH Album2 (using extra training data)
To eliminate manual annotation, in this work, we propose a novel image dataset construction framework by employing multiple textual queries.
Cross-modal hashing is usually regarded as an effective technique for large-scale textual-visual cross retrieval, where data from different modalities are mapped into a shared Hamming space for matching.
Numerous methods have been proposed for person re-identification, most of which however neglect the matching efficiency.
Our ZSECOC equips the conventional ECOC with the additional capability of ZSAR, by addressing the domain shift problem.
Ranked #4 on Zero-Shot Action Recognition on Olympics