Robust Speech Recognition via Large-Scale Weak Supervision

openai/whisper Preprint 2022

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.

Robust Speech Recognition

10,865
0.58 stars / hour

Dilated Neighborhood Attention Transformer

SHI-Labs/Neighborhood-Attention-Transformer 29 Sep 2022

These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention.

Image Classification Instance Segmentation +2

564
0.57 stars / hour

towhee

towhee-io/towhee ICCV 2019

Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

Action Classification Action Recognition +5

1,535
0.54 stars / hour

SoundStream: An End-to-End Neural Audio Codec

google/lyra 7 Jul 2021

We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs.

Speech Enhancement

3,245
0.49 stars / hour

High-Resolution Image Synthesis with Latent Diffusion Models

compvis/stable-diffusion CVPR 2022

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond.

Denoising Image Inpainting +3

26,759
0.48 stars / hour

NP-Match: When Neural Processes meet Semi-Supervised Learning

jianf-wang/np-match 3 Jul 2022

Semi-supervised learning (SSL) has been widely explored in recent years, and it is an effective way of leveraging unlabeled data to reduce the reliance on labeled data.

Semi-Supervised Image Classification

119
0.40 stars / hour

Mega: Moving Average Equipped Gated Attention

facebookresearch/mega 21 Sep 2022

The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences.

Image Classification +3

79
0.39 stars / hour

TVLT: Textless Vision-Language Transformer

zinengtang/tvlt 28 Sep 2022

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR).

Automatic Speech Recognition Image Retrieval +6

46
0.35 stars / hour
65
0.31 stars / hour

VToonify: Controllable High-Resolution Portrait Video Style Transfer

williamyang1991/vtoonify 22 Sep 2022

Although a series of successful portrait image toonification models built upon the powerful StyleGAN have been proposed, these image-oriented methods have obvious limitations when applied to videos, such as the fixed frame size, the requirement of face alignment, missing non-facial details and temporal inconsistency.

Face Alignment Style Transfer +1

499
0.29 stars / hour