Existing computer vision research in artwork struggles with artwork's fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation.
Learning-based approaches for perceptual image quality assessment (IQA) usually require both the distorted and reference image for measuring the perceptual quality accurately.
Digital cameras transform sensor RAW readings into RGB images by means of their Image Signal Processor (ISP).
Autonomous robots are currently one of the most popular Artificial Intelligence problems, having experienced significant advances in the last decade, from Self-driving cars and humanoids to delivery robots and drones.
It is easier to hear birds than see them, however, they still play an essential role in nature and they are excellent indicators of deteriorating environmental quality and pollution.
In this work, we propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes using the inherent multi-head self-attention mechanism.
The model performance is further improved by constructing multiple sets of attention networks.