Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation.
Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices.
Recent advances in large-scale contrastive visual-language pretraining shed light on a new pathway for visual recognition.
Ranked #2 on Long-tail Learning on ImageNet-LT
Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning.
Object detection with Transformers (DETR) has achieved a competitive performance over traditional detectors, such as Faster R-CNN.
Transformers with remarkable global representation capacities achieve competitive results for visual tasks, but fail to consider high-level local pattern information in input images.
Edge computing is promising to become one of the next hottest topics in artificial intelligence because it benefits various evolving domains such as real-time unmanned aerial systems, industrial applications, and the demand for privacy protection.