Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results.
Ranked #4 on Semantic Segmentation on Mapillary val
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.
Ranked #2 on Multimodal Machine Translation on Multi30K (BLUE (DE-EN) metric)
A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence.
Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs.
Ranked #3 on Speech Synthesis on LJSpeech
Graph Convolutional Networks (GCNs) have been drawing significant attention with the power of representation learning on graphs.
Ranked #1 on Node Property Prediction on ogbn-proteins
Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism.
Ranked #7 on Speech Separation on WSJ0-3mix
However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio.
Ranked #1 on Unconstrained Lip-synchronization on LRS3 (using extra training data)
The new update rule is equivalent to the attention mechanism used in transformers.
Immune Repertoire Classification Multiple Instance Learning +1
Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties.
Our model, \emph{ByteFormer}, achieves an ImageNet Top-1 classification accuracy of $77. 33\%$ when training and testing directly on TIFF file bytes using a transformer backbone with configuration similar to DeiT-Ti ($72. 2\%$ accuracy when operating on RGB images).