Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection (AVA)

25 Jun 2019  ·  Joon Son Chung ·

This report describes our submission to the ActivityNet Challenge at CVPR 2019. We use a 3D convolutional neural network (CNN) based front-end and an ensemble of temporal convolution and LSTM classifiers to predict whether a visible person is speaking or not. Our results show significant improvements over the baseline on the AVA-ActiveSpeaker dataset.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Audio-Visual Active Speaker Detection AVA-ActiveSpeaker VGG-{LSTM+TCN} (ensemble) validation mean average precision 87.8% # 14

Methods