TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Audio Classification	ESC-50	M2D ratio=0.7	Top-1 Accuracy	95.0	# 11
Audio Classification	ESC-50	M2D ratio=0.7	Accuracy (5-fold)	95.0	# 11
Keyword Spotting	Google Speech Commands	M2D	Google Speech Commands V2 35	98.5	# 2
Music Genre Classification	GTZAN	M2D ratio=0.7	Accuracy	83.9	# 1
Music Genre Classification	GTZAN	M2D ratio=0.6	Accuracy	83.3	# 2
Speaker Identification	VoxCeleb1	MSM-MAE	Top-1 (%)	95.3	# 1
Speaker Identification	VoxCeleb1	MSM-MAE	Accuracy	95.3	# 1
Speaker Identification	VoxCeleb1	M2D ratio=0.6	Top-1 (%)	94.8	# 2
Speaker Identification	VoxCeleb1	M2D ratio=0.6	Accuracy	94.8	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/masked-modeling-duo-learning-representations/music-genre-classification-on-gtzan)](https://paperswithcode.com/sota/music-genre-classification-on-gtzan?p=masked-modeling-duo-learning-representations)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/masked-modeling-duo-learning-representations/speaker-identification-on-voxceleb1)](https://paperswithcode.com/sota/speaker-identification-on-voxceleb1?p=masked-modeling-duo-learning-representations)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/masked-modeling-duo-learning-representations/keyword-spotting-on-google-speech-commands)](https://paperswithcode.com/sota/keyword-spotting-on-google-speech-commands?p=masked-modeling-duo-learning-representations)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/masked-modeling-duo-learning-representations/audio-classification-on-esc-50)](https://paperswithcode.com/sota/audio-classification-on-esc-50?p=masked-modeling-duo-learning-representations)`

Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

26 Oct 2022 · Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino ·

Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M2D), that learns representations directly while obtaining training signals using only masked patches. In the M2D, the online network encodes visible patches and predicts masked patch representations, and the target network, a momentum encoder, encodes masked patches. To better predict target representations, the online network should model the input well, while the target network should also model it well to agree with online predictions. Then the learned representations should better model the input. We validated the M2D by learning general-purpose audio representations, and M2D set new state-of-the-art performance on tasks such as UrbanSound8K, VoxCeleb1, AudioSet20K, GTZAN, and SpeechCommandsV2. We additionally validate the effectiveness of M2D for images using ImageNet-1K in the appendix.

PDF Abstract

Code

Add Remove Mark official

nttcslab/m2d official

↳ Quickstart in

Colab

Tasks

Add Remove

Audio Classification

Audio Tagging

Keyword Spotting

Keyword Spotting on Google Speech Commands

Music Genre Classification

Self-Supervised Learning

Speaker Identification

Datasets

VoxCeleb1

AudioSet

Speech Commands

ESC-50

UrbanSound8K

NSynth CREMA-D

VoxForge GTZAN

Results from the Paper

Edit

Ranked #1 on Speaker Identification on VoxCeleb1 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Audio Classification	ESC-50	M2D ratio=0.7	Top-1 Accuracy	95.0	# 11	Compare
Audio Classification	ESC-50	M2D ratio=0.7	Accuracy (5-fold)	95.0	# 11	Compare
Keyword Spotting	Google Speech Commands	M2D	Google Speech Commands V2 35	98.5	# 2	Compare
Music Genre Classification	GTZAN	M2D ratio=0.7	Accuracy	83.9	# 1	Compare
Music Genre Classification	GTZAN	M2D ratio=0.6	Accuracy	83.3	# 2	Compare
Speaker Identification	VoxCeleb1	MSM-MAE	Top-1 (%)	95.3	# 1	Compare
Speaker Identification	VoxCeleb1	MSM-MAE	Accuracy	95.3	# 1	Compare
Speaker Identification	VoxCeleb1	M2D ratio=0.6	Top-1 (%)	94.8	# 2	Compare
Speaker Identification	VoxCeleb1	M2D ratio=0.6	Accuracy	94.8	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove