Experiments show that the perceptual representation in the HVS is an effective way of predicting subjective temporal quality, and thus TPQI can, for the first time, achieve comparable performance to the spatial quality metric and be even more effective in assessing videos with large temporal variations.
Consisting of fragments and FANet, the proposed FrAgment Sample Transformer for VQA (FAST-VQA) enables efficient end-to-end deep VQA and learns effective video-quality-related representations.
Ranked #1 on Video Quality Assessment on YouTube-UGC (using extra training data)
Based on prominent time-series modeling ability of transformers, we propose a novel and effective transformer-based VQA method to tackle these two issues.
To achieve this, an adversarial loss is firstly proposed to make the deep learning models attacked by the adversarial images successfully.
To deal with this dilemma, we propose to distill knowledge on semantic patterns for a vast variety of image contents from multiple pre-trained object classification (POC) models to an IAA model.
The global and local contexts significantly contribute to the integrity of predictions in Salient Object Detection (SOD).
Ranked #1 on Salient Object Detection on DUT-OMRON
no code implementations • 30 Mar 2022 • Qian Zheng, Ankur Purwar, Heng Zhao, Guang Liang Lim, Ling Li, Debasish Behera, Qian Wang, Min Tan, Rizhao Cai, Jennifer Werner, Dennis Sng, Maurice van Steensel, Weisi Lin, Alex C Kot
We present an automatic facial skin feature detection method that works across a variety of skin tones and age groups for selfies in the wild.
As the key component of ACCoNet, ACCoM activates the salient regions of output features of the encoder and transmits them to the decoder.
Besides, the JND of the red and blue channels are larger than that of the green one according to the experimental results of the proposed model, which demonstrates that more changes can be tolerated in the red and blue channels, in line with the well-known fact that the human visual system is more sensitive to the green channel in comparison with the red and blue ones.
Then, following the coarse-to-fine strategy, we generate an initial coarse saliency map from high-level semantic features in a Correlation Module (CorrM).
Experimental results show that the VSD can be accurately estimated with the weights learnt by the nonlinear mapping function once its associated S-VSDs are available.
Meanwhile, an image predictor is designed and trained to achieve the general-quality image reconstruction with the 16-bit gray-scale profile and signal features.
In this paper, we propose a novel Multi-Content Complementation Network (MCCNet) to explore the complementarity of multiple content for RSI-SOD.
Instead of explicitly formulating and fusing different masking effects in a bottom-up way, the proposed JND estimation model dedicates to first predicting a critical perceptual lossless (CPL) counterpart of the original image and then calculating the difference map between the original image and the predicted CPL image as the JND map.
As a prevailing task in video surveillance and forensics field, person re-identification (re-ID) aims to match person images captured from non-overlapped cameras.
Then, we design a two-level perturbation fusion strategy to alleviate the conflict between the adversarial watermarks generated by different facial images and models.
In video-based dynamic point cloud compression (V-PCC), 3D point clouds are projected onto 2D images for compressing with the existing video codecs.
Experimental results on image classification demonstrate that we successfully find the JND for deep machine vision.
We present a simple yet effective progressive self-guided loss function to facilitate deep learning-based salient object detection (SOD) in images.
In this paper, we study this problem from the viewpoint of adversarial attack and identify a totally new task, i. e., adversarial exposure attack generating adversarial images by tuning image exposure to mislead the DNNs with significantly high transferability.
To this end, we initiate the very first attempt to study this problem from the perspective of adversarial attack and propose the adversarial denoise attack.
Existing CNN-based methods for pixel labeling heavily depend on multi-scale features to meet the requirements of both semantic comprehension and detail preservation.
Given an image, our quality measure first extracts 17 features through analysis of contrast, sharpness, brightness and more, and then yields a measre of visual quality using a regression module, which is learned with big-data training samples that are much bigger than the size of relevant image datasets.
It is challenging to detect curve texts due to their irregular shapes and varying sizes.
The proposed method, SFA, is compared with nine representative blur-specific NR-IQA methods, two general-purpose NR-IQA methods, and two extra full-reference IQA methods on Gaussian blur images (with and without Gaussian noise/JPEG compression) and realistic blur images from multiple databases, including LIVE, TID2008, TID2013, MLIVE1, MLIVE2, BID, and CLIVE.
The recent advances of hardware technology have made the intelligent analysis equipped at the front-end with deep learning more prevailing and practical.
In this study, we conduct the research on the robustness of pedestrian detection algorithms to video quality degradation.
MCN predicts instance-level bounding boxes by firstly converting an image into a Stochastic Flow Graph (SFG) and then performing Markov Clustering on this graph.
With the acquisition technology development, more comprehensive information, such as depth cue, inter-image correspondence, or temporal relationship, is available to extend image saliency detection to RGBD saliency detection, co-saliency detection, or video saliency detection.
In this paper, we propose an iterative RGBD co-saliency framework, which utilizes the existing single saliency maps as the initialization, and generates the final RGBD cosaliency map by using a refinement-cycle model.
For many computer vision problems, the deep neural networks are trained and validated based on the assumption that the input images are pristine (i. e., artifact-free).