TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Audio-Visual Active Speaker Detection	AVA-ActiveSpeaker	TalkNet	validation mean average precision	92.3%	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/nus-hlt-report-for-activitynet-challenge-2021/audio-visual-active-speaker-detection-on-ava)](https://paperswithcode.com/sota/audio-visual-active-speaker-detection-on-ava?p=nus-hlt-report-for-activitynet-challenge-2021)`

NUS-HLT Report for ActivityNet Challenge 2021 AVA (Speaker)

The ActivityNet Large-Scale Activity Recognition Challenge Workshop, CVPR 2021 · Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li ·

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audiovisual interaction. Unlike the prior work where systems makedecision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 3.0% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker validation and test dataset, respectively. We will release the codes, the models and data logs.

PDF