Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module.
Ranked #131 on Image Classification on ImageNet
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
In this paper, we propose a novel contrastive mask prediction (CMP) task for visual representation learning and design a mask contrast (MaskCo) framework to implement the idea.
The proxy task is to estimate the position and size of the image patch in a sequence of video frames, given only the target bounding box in the first frame.
The two stages are connected in series as the input proposals of the FM stage are generated by the CM stage.