Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models.
The hybrid model of self-attention and convolution is one of the methods to lighten ViT.
This technique ensures a superior trade-off between editability and high fidelity to the input image of our method.
Pretraining on large-scale datasets can boost the performance of object detectors while the annotated datasets for object detection are hard to scale up due to the high labor cost.
Based on this observation, we propose a simple strategy, i. e., increasing the number of training shots, to mitigate the loss of intrinsic dimension caused by robustness-promoting regularization.
That is to say, the smaller the model, the lower the mask ratio needs to be.
We develop a simple but effective module to explore the full potential of transformers for visual representation by learning fine-grained and coarse-grained features at a token level and dynamically fusing them.
The pretraining tasks include two tasks: masked representation prediction - predict the representations for the masked patches, and masked patch reconstruction - reconstruct the masked patches.
Recent works apply the contrastive learning on the discriminator of the Generative Adversarial Networks, and there exists little work on exploring if contrastive learning can be applied on encoders to learn disentangled representations.
In self-supervised learning frameworks, deep networks are optimized to align different views of an instance that contains the similar visual semantic information.
At each layer, it exploits a differentiable binarization search (DBS) to minimize the angular error in a student-teacher framework.
Object detection with Transformers (DETR) has achieved a competitive performance over traditional detectors, such as Faster R-CNN.
Transformers with remarkable global representation capacities achieve competitive results for visual tasks, but fail to consider high-level local pattern information in input images.
This leads to a new problem of confidence discrepancy for the detector ensembles.
Therefore, a trade-off between effectiveness and efficiency is necessary in practical scenarios.
Ranked #1 on Object Detection on COCO test-dev (Hardware Burden metric)
To meet these two concerns, we comprehensively evaluate a collection of existing refinements to improve the performance of PP-YOLO while almost keep the infer time unchanged.
Anomaly detection is a challenging task and usually formulated as an one-class learning problem for the unexpectedness of anomalies.
Ranked #19 on Anomaly Detection on VisA
Moreover, Hierarchical-Split block is very flexible and efficient, which provides a large space of potential network architectures for different applications.
1 code implementation • 16 Sep 2020 • Xuehui Yu, Zhenjun Han, Yuqi Gong, Nan Jiang, Jian Zhao, Qixiang Ye, Jie Chen, Yuan Feng, Bin Zhang, Xiaodi Wang, Ying Xin, Jingwei Liu, Mingyuan Mao, Sheng Xu, Baochang Zhang, Shumin Han, Cheng Gao, Wei Tang, Lizuo Jin, Mingbo Hong, Yuchao Yang, Shuiwang Li, Huan Luo, Qijun Zhao, Humphrey Shi
The 1st Tiny Object Detection (TOD) Challenge aims to encourage research in developing novel and accurate methods for tiny object detection in images which have wide views, with a current focus on tiny person detection.
We mainly try to combine various existing tricks that almost not increase the number of model parameters and FLOPs, to achieve the goal of improving the accuracy of detector as much as possible while ensuring that the speed is almost unchanged.
Ranked #137 on Object Detection on COCO test-dev
We present an object detection framework based on PaddlePaddle.
In this manner, the influence of bias and noise in the web data can be gradually alleviated, leading to the steadily improving performance of URNet.