Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module.
Ranked #131 on Image Classification on ImageNet
Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript.
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
In this paper, we propose a novel contrastive mask prediction (CMP) task for visual representation learning and design a mask contrast (MaskCo) framework to implement the idea.
This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning.