TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Classification	CelebV-HQ	MARLIN	Accuracy	95.48	# 1
Action Classification	CelebV-HQ	MARLIN	AUC	0.9406	# 1
Facial Attribute Classification	CelebV-HQ	MARLIN	Accuracy	93.9	# 1
Facial Attribute Classification	CelebV-HQ	MARLIN	AUC	0.9561	# 1
Emotion Classification	CMU-MOSEI	MARLIN (ViT-S)	Accuracy	80.38	# 3
Multimodal Sentiment Analysis	CMU-MOSEI	MARLIN (ViT-L)	Accuracy	74.83	# 12
Multimodal Sentiment Analysis	CMU-MOSEI	MARLIN (ViT-B)	Accuracy	73.7	# 13
Multimodal Sentiment Analysis	CMU-MOSEI	MARLIN (ViT-S)	Accuracy	72.69	# 14
Emotion Classification	CMU-MOSEI	MARLIN (ViT-L)	Accuracy	80.63	# 1
Emotion Classification	CMU-MOSEI	MARLIN (ViT-B)	Accuracy	80.6	# 2
DeepFake Detection	FaceForensics++	MARLIN (ViT-L)	AUC	0.9377	# 2
DeepFake Detection	FaceForensics++	MARLIN (ViT-S)	AUC	0.8863	# 4
DeepFake Detection	FaceForensics++	MARLIN (ViT-B)	AUC	0.9305	# 3
Unconstrained Lip-synchronization	LRS2	Wav2Lip + ViT + MARLIN	LSE-D	7.127	# 1
Unconstrained Lip-synchronization	LRS2	Wav2Lip + ViT + MARLIN	LSE-C	5.528	# 2
Unconstrained Lip-synchronization	LRS2	Wav2Lip + ViT + MARLIN	FID	3.452	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/marlin-masked-autoencoder-for-facial-video/action-classification-on-celebv-hq)](https://paperswithcode.com/sota/action-classification-on-celebv-hq?p=marlin-masked-autoencoder-for-facial-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/marlin-masked-autoencoder-for-facial-video/facial-attribute-classification-on-celebv-hq)](https://paperswithcode.com/sota/facial-attribute-classification-on-celebv-hq?p=marlin-masked-autoencoder-for-facial-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/marlin-masked-autoencoder-for-facial-video/emotion-classification-on-cmu-mosei)](https://paperswithcode.com/sota/emotion-classification-on-cmu-mosei?p=marlin-masked-autoencoder-for-facial-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/marlin-masked-autoencoder-for-facial-video/lip-sync-on-lrs2)](https://paperswithcode.com/sota/lip-sync-on-lrs2?p=marlin-masked-autoencoder-for-facial-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/marlin-masked-autoencoder-for-facial-video/deepfake-detection-on-faceforensics-1)](https://paperswithcode.com/sota/deepfake-detection-on-faceforensics-1?p=marlin-masked-autoencoder-for-facial-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/marlin-masked-autoencoder-for-facial-video/multimodal-sentiment-analysis-on-cmu-mosei-1)](https://paperswithcode.com/sota/multimodal-sentiment-analysis-on-cmu-mosei-1?p=marlin-masked-autoencoder-for-facial-video)`

MARLIN: Masked Autoencoder for facial video Representation LearnINg

CVPR 2023 · Zhixi Cai, Shreya Ghosh, Kalin Stefanov, Abhinav Dhall, Jianfei Cai, Hamid Rezatofighi, Reza Haffari, Munawar Hayat ·

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN .

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

ControlNet/MARLIN official

193

Tasks

Add Remove

Action Classification

Attribute

DeepFake Detection

Emotion Classification

Face Swapping

Facial Attribute Classification

Facial Expression Recognition

Facial Expression Recognition (FER)

Multimodal Sentiment Analysis

Representation Learning

Sentiment Analysis

Unconstrained Lip-synchronization

Datasets

FaceForensics++

CMU-MOSEI

LRS2

CelebV-HQ

Results from the Paper

Edit

Ranked #1 on Emotion Classification on CMU-MOSEI

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Classification	CelebV-HQ	MARLIN	Accuracy	95.48	# 1	Compare
Action Classification	CelebV-HQ	MARLIN	AUC	0.9406	# 1	Compare
Facial Attribute Classification	CelebV-HQ	MARLIN	Accuracy	93.9	# 1	Compare
Facial Attribute Classification	CelebV-HQ	MARLIN	AUC	0.9561	# 1	Compare
Emotion Classification	CMU-MOSEI	MARLIN (ViT-S)	Accuracy	80.38	# 3	Compare
Multimodal Sentiment Analysis	CMU-MOSEI	MARLIN (ViT-L)	Accuracy	74.83	# 12	Compare
Multimodal Sentiment Analysis	CMU-MOSEI	MARLIN (ViT-B)	Accuracy	73.7	# 13	Compare
Multimodal Sentiment Analysis	CMU-MOSEI	MARLIN (ViT-S)	Accuracy	72.69	# 14	Compare
Emotion Classification	CMU-MOSEI	MARLIN (ViT-L)	Accuracy	80.63	# 1	Compare
Emotion Classification	CMU-MOSEI	MARLIN (ViT-B)	Accuracy	80.6	# 2	Compare
DeepFake Detection	FaceForensics++	MARLIN (ViT-L)	AUC	0.9377	# 2	Compare
DeepFake Detection	FaceForensics++	MARLIN (ViT-S)	AUC	0.8863	# 4	Compare
DeepFake Detection	FaceForensics++	MARLIN (ViT-B)	AUC	0.9305	# 3	Compare
Unconstrained Lip-synchronization	LRS2	Wav2Lip + ViT + MARLIN	LSE-D	7.127	# 1	Compare
			LSE-C	5.528	# 2	Compare
			FID	3.452	# 1	Compare

Methods

Add Remove

MAE • MARLIN

Edit Social Preview

MARLIN: Masked Autoencoder for facial video Representation LearnINg

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove