We present a convolution-free approach to video classification built exclusively on self-attention over space and time.
Ranked #1 on Action Recognition on Diving-48
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
Ranked #1 on Semantic Segmentation on ADE20K
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
Ranked #4 on Speech Synthesis on North American English
In recent years, the use of Generative Adversarial Networks (GANs) has become very popular in generative image modeling.
Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks.
In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks.
Ranked #11 on Conditional Image Generation on ImageNet 128x128
Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding.
Inspired by the common painting process of drawing a draft and revising the details, we introduce a novel feed-forward method named Laplacian Pyramid Network (LapStyle).
The problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs) presents two challenges: given a QA context (question and answer choice), methods need to (i) identify relevant knowledge from large KGs, and (ii) perform joint reasoning over the QA context and KG.
Ranked #1 on Common Sense Reasoning on CommonsenseQA