TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Long Term Action Anticipation	Ego4D	RepLAI	ED@20 Noun	83.4	# 1
Long Term Action Anticipation	Ego4D	RepLAI	ED@20 Verb	75.5	# 1
Object State Change Classification	Ego4D	RepLAI	Acc	66.30	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-state-aware-visual-representations/long-term-action-anticipation-on-ego4d)](https://paperswithcode.com/sota/long-term-action-anticipation-on-ego4d?p=learning-state-aware-visual-representations)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-state-aware-visual-representations/object-state-change-classification-on-ego4d)](https://paperswithcode.com/sota/object-state-change-classification-on-ego4d?p=learning-state-aware-visual-representations)`

Learning State-Aware Visual Representations from Audible Interactions

27 Sep 2022 · Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta ·

We propose a self-supervised algorithm to learn representations from egocentric video data. Recently, significant efforts have been made to capture humans interacting with their own environments as they go about their daily activities. In result, several large egocentric datasets of interaction-rich multi-modal data have emerged. However, learning representations from videos can be challenging. First, given the uncurated nature of long-form continuous videos, learning effective representations require focusing on moments in time when interactions take place. Second, visual representations of daily activities should be sensitive to changes in the state of the environment. However, current successful multi-modal learning frameworks encourage representation invariance over time. To address these challenges, we leverage audio signals to identify moments of likely interactions which are conducive to better learning. We also propose a novel self-supervised objective that learns from audible state changes caused by interactions. We validate these contributions extensively on two large-scale egocentric datasets, EPIC-Kitchens-100 and the recently released Ego4D, and show improvements on several downstream tasks, including action recognition, long-term action anticipation, and object state change classification.

PDF Abstract

Code

Add Remove Mark official

HimangiM/RepLAI official

Tasks

Add Remove

Action Anticipation

Action Recognition

Long Term Action Anticipation

Object State Change Classification

Point- of-no-return (PNR) temporal localization

Datasets

EPIC-KITCHENS-100

Ego4D

Results from the Paper

Edit

Ranked #1 on Long Term Action Anticipation on Ego4D (ED@20 Noun metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Long Term Action Anticipation	Ego4D	RepLAI	ED@20 Noun	83.4	# 1	Compare
Long Term Action Anticipation	Ego4D	RepLAI	ED@20 Verb	75.5	# 1	Compare
Object State Change Classification	Ego4D	RepLAI	Acc	66.30	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Learning State-Aware Visual Representations from Audible Interactions

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove