Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices.
In this paper, we propose a conceptually novel, efficient, and fully convolutional framework for real-time instance segmentation.
Ranked #1 on Real-time Instance Segmentation on MSCOCO
For segmentation, we integrate AziNorm into KPConv.
Then we develop a transformer-based point-supervised saliency detection model to produce the first round of saliency maps.
Current benchmarks for facial expression recognition (FER) mainly focus on static images, while there are limited datasets for FER in videos.
To improve data efficiency, we propose hierarchically cascaded transformers that exploit intrinsic image structures through spectral tokens pooling and optimize the learnable parameters through latent attribute surrogates.
Ranked #1 on Few-Shot Learning on Mini-Imagenet 5-way (1-shot) (5 way 1~2 shot metric)
To move towards a practical certifiable patch defense, we introduce Vision Transformer (ViT) into the framework of Derandomized Smoothing (DS).
Recently, adversarial attacks have been applied in visual object tracking to deceive deep trackers by injecting imperceptible perturbations into video frames.
no code implementations • 17 Aug 2021 • Weier Wan, Rajkumar Kubendran, Clemens Schaefer, S. Burc Eryilmaz, Wenqiang Zhang, Dabin Wu, Stephen Deiss, Priyanka Raina, He Qian, Bin Gao, Siddharth Joshi, Huaqiang Wu, H. -S. Philip Wong, Gert Cauwenberghs
Realizing today's cloud-level artificial intelligence functionalities directly on devices distributed at the edge of the internet calls for edge hardware capable of processing multiple modalities of sensory data (e. g. video, audio) at unprecedented energy-efficiency.
In this paper, based on the observation that domain adaptation frameworks performed in the source and target domain are almost complementary in terms of image translation and SSL, we propose a novel dual path learning (DPL) framework to alleviate visual inconsistency.
We find that: (1) Different variants of the BLEU metric are used in previous works, which affects the evaluation and understanding of existing methods.
That is, we can only access training data in a high-resource language, while need to answer multilingual questions without any labeled data in target languages.
Firstly, we propose a patch selection and refining scheme to find the pixels which have the greatest importance for attack and remove the inconsequential perturbations gradually.
Humans express feelings or emotions via different channels.
Human pose estimation from image and video is a vital task in many multimedia applications.
In this paper, we propose a learnable sampling module based on variational auto-encoder (VAE) for neural architecture search (NAS), named as VAENAS, which can be easily embedded into existing weight sharing NAS framework, e. g., one-shot approach and gradient-based approach, and significantly improve the performance of searching results.
Underwater images play a key role in ocean exploration, but often suffer from severe quality degradation due to light absorption and scattering in water medium.