TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Audio Classification	AudioSet	HTS-AT (Single)	Test mAP	0.471	# 21
Audio Classification	AudioSet	HTS-AT (Ensemble)	Test mAP	0.487	# 12
Sound Event Detection	DESED	HTS-AT	event-based F1 score	50.7	# 4
Audio Classification	ESC-50	HTS-AT	Top-1 Accuracy	97.0	# 6
Audio Classification	ESC-50	HTS-AT	PRE-TRAINING DATASET	AudioSet	# 1
Audio Classification	ESC-50	HTS-AT	Accuracy (5-fold)	97.0	# 6
Keyword Spotting	Google Speech Commands	HTS-AT	Google Speech Commands V2 35	98.0	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hts-at-a-hierarchical-token-semantic-audio/sound-event-detection-on-desed)](https://paperswithcode.com/sota/sound-event-detection-on-desed?p=hts-at-a-hierarchical-token-semantic-audio)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hts-at-a-hierarchical-token-semantic-audio/keyword-spotting-on-google-speech-commands)](https://paperswithcode.com/sota/keyword-spotting-on-google-speech-commands?p=hts-at-a-hierarchical-token-semantic-audio)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hts-at-a-hierarchical-token-semantic-audio/audio-classification-on-esc-50)](https://paperswithcode.com/sota/audio-classification-on-esc-50?p=hts-at-a-hierarchical-token-semantic-audio)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hts-at-a-hierarchical-token-semantic-audio/audio-classification-on-audioset)](https://paperswithcode.com/sota/audio-classification-on-audioset?p=hts-at-a-hierarchical-token-semantic-audio)`

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

2 Feb 2022 · Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov ·

Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

PDF Abstract

Code

Add Remove Mark official

retrocirce/hts-audio-transformer official

300

Tasks

Add Remove

Audio Classification

Event Detection

Keyword Spotting

Sound Classification

Sound Event Detection

Datasets

AudioSet

Speech Commands

ESC-50

DESED

Results from the Paper

Edit

Ranked #4 on Sound Event Detection on DESED

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Audio Classification	AudioSet	HTS-AT (Single)	Test mAP	0.471	# 21	Compare
Audio Classification	AudioSet	HTS-AT (Ensemble)	Test mAP	0.487	# 12	Compare
Sound Event Detection	DESED	HTS-AT	event-based F1 score	50.7	# 4	Compare
Audio Classification	ESC-50	HTS-AT	Top-1 Accuracy	97.0	# 6	Compare
			PRE-TRAINING DATASET	AudioSet	# 1	Compare
			Accuracy (5-fold)	97.0	# 6	Compare
Keyword Spotting	Google Speech Commands	HTS-AT	Google Speech Commands V2 35	98.0	# 5	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove