TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Audio Classification	AudioSet	Audio-MAE (Audio-only, single)	Test mAP	0.473	# 20
Speaker Identification	VoxCeleb1	AudioMAE (global)	Top-1 (%)	94.1	# 5
Speaker Identification	VoxCeleb1	AudioMAE (global)	Accuracy	94.1	# 5
Speaker Identification	VoxCeleb1	AudioMAE (local)	Top-1 (%)	94.8	# 2
Speaker Identification	VoxCeleb1	AudioMAE (local)	Accuracy	94.8	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/masked-autoencoders-that-listen/speaker-identification-on-voxceleb1)](https://paperswithcode.com/sota/speaker-identification-on-voxceleb1?p=masked-autoencoders-that-listen)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/masked-autoencoders-that-listen/audio-classification-on-audioset)](https://paperswithcode.com/sota/audio-classification-on-audioset?p=masked-autoencoders-that-listen)`

Masked Autoencoders that Listen

13 Jul 2022 · Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer ·

This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code and models will be at https://github.com/facebookresearch/AudioMAE.

PDF Abstract

Code

Add Remove Mark official

facebookresearch/audiomae official

470

facebookresearch/multimodal

1,290

rishikksh20/AudioMAE-pytorch

eml-eda/tle-supervised

Tasks

Add Remove

Audio Classification

Representation Learning

Speaker Identification

Datasets

VoxCeleb1

AudioSet

Speech Commands

ESC-50

Results from the Paper

Edit

Ranked #2 on Speaker Identification on VoxCeleb1 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Audio Classification	AudioSet	Audio-MAE (Audio-only, single)	Test mAP	0.473	# 20	Compare
Speaker Identification	VoxCeleb1	AudioMAE (global)	Top-1 (%)	94.1	# 5	Compare
Speaker Identification	VoxCeleb1	AudioMAE (global)	Accuracy	94.1	# 5	Compare
Speaker Identification	VoxCeleb1	AudioMAE (local)	Top-1 (%)	94.8	# 2	Compare
Speaker Identification	VoxCeleb1	AudioMAE (local)	Accuracy	94.8	# 2	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • MAE • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Masked Autoencoders that Listen

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove