no code implementations • 17 Sep 2024 • Gabriele Goletto, Tushar Nagarajan, Giuseppe Averta, Dima Damen
Egocentric videos provide a unique perspective into individuals' daily experiences, yet their unstructured nature presents challenges for perception.
no code implementations • 15 Apr 2024 • Siddhant Bansal, Michael Wray, Dima Damen
Our results demonstrate that VLMs trained for referral on third person images fail to recognise and refer hands and objects in egocentric images.
1 code implementation • CVPR 2024 • Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, Dima Damen
We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events.
no code implementations • 7 Apr 2024 • Chiara Plizzari, Shubham Goel, Toby Perrett, Jacob Chalk, Angjoo Kanazawa, Dima Damen
As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of sight.
1 code implementation • 26 Mar 2024 • Saptarshi Sinha, Alexandros Stergiou, Dima Damen
We propose an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos.
Ranked #1 on Repetitive Action Counting on UCFRep
no code implementations • 4 Feb 2024 • Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, Dima Damen
The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips.
no code implementations • 25 Dec 2023 • Zhifan Zhu, Dima Damen
Our proposed EPIC-Grasps dataset includes 390 object instances of 9 categories, featuring stable grasps from videos of daily interactions in 141 environments.
no code implementations • 20 Dec 2023 • Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark.
1 code implementation • CVPR 2024 • Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic
We address the task of generating temporally consistent and physically plausible images of actions and object state transformations.
no code implementations • CVPR 2024 • João Carreira, Michael King, Viorica Pătrăucean, Dilara Gokay, Cătălin Ionescu, Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar, Dima Damen, Andrew Zisserman
We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling.
2 code implementations • CVPR 2024 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.
1 code implementation • 28 Nov 2023 • Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett
Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality.
1 code implementation • 26 Oct 2023 • Kevin Flanagan, Dima Damen, Michael Wray
Compared to traditional benchmarks on which this task is evaluated, these datasets offer finer-grained sentences to ground in notably longer videos.
no code implementations • 14 Aug 2023 • Chiara Plizzari, Gabriele Goletto, Antonino Furnari, Siddhant Bansal, Francesco Ragusa, Giovanni Maria Farinella, Dima Damen, Tatiana Tommasi
What will the future be?
no code implementations • ICCV 2023 • Chiara Plizzari, Toby Perrett, Barbara Caputo, Dima Damen
We propose and address a new generalisation problem: can a model trained for action recognition successfully classify actions when they are performed within a previously unseen scenario and in a previously unseen location?
1 code implementation • NeurIPS 2023 • Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Laina, Diane Larlus, Dima Damen, Andrea Vedaldi
Compared to other neural rendering datasets, EPIC Fields is better tailored to video understanding because it is paired with labelled action segments and the recent VISOR segment annotations.
1 code implementation • NeurIPS 2023 • Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e. g. Flamingo, SeViLA, or GPT-4).
Ranked #1 on Point Tracking on Perception Test
1 code implementation • CVPR 2023 • Toby Perrett, Saptarshi Sinha, Tilo Burghardt, Majid Mirmehdi, Dima Damen
We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties.
1 code implementation • 1 Feb 2023 • Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew Zisserman
We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos.
1 code implementation • 25 Oct 2022 • Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett
We obtain state-of-the-art performance on the challenging EPIC-KITCHENS-100 action detection as well as the standard THUMOS14 action detection benchmarks, and achieve improvement on the ActivityNet-1. 3 benchmark.
1 code implementation • 20 Oct 2022 • Alexandros Stergiou, Dima Damen
A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time.
Ranked #4 on Audio Classification on EPIC-KITCHENS-100
1 code implementation • Deep Mind 2022 • Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Skanda Koppula, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman and João Carreira
We propose a novel multimodal benchmark – the Perception Test – that aims to extensively evaluate perception and reasoning skills of multimodal models.
1 code implementation • 9 Oct 2022 • Adriano Fragomeni, Michael Wray, Dima Damen
When the clip is short or visually ambiguous, knowledge of its local temporal context (i. e. surrounding video segments) can be used to improve the retrieval performance.
3 code implementations • 26 Sep 2022 • Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, Dima Damen
VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets.
no code implementations • 14 Jul 2022 • Alessandro Masullo, Toby Perrett, Tilo Burghardt, Ian Craddock, Dima Damen, Majid Mirmehdi
We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) which takes advantage of learning using privileged information (LUPI).
1 code implementation • 4 Jul 2022 • Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, RongCheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou
In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR).
1 code implementation • 11 Jun 2022 • Valentin Popescu, Dima Damen, Toby Perrett
In this paper, we evaluate state-of-the-art OCR methods on Egocentric data.
2 code implementations • 3 Jun 2022 • Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, RongCheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou
Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention.
1 code implementation • CVPR 2023 • Alexandros Stergiou, Dima Damen
We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales.
Ranked #1 on Early Action Prediction on UCF101
1 code implementation • 19 Apr 2022 • Dena Bazazian, Andrew Calway, Dima Damen
We build on the successes of few-shot StyleGAN and single-shot semantic segmentation to minimise the amount of training required in utilising two domains.
no code implementations • 13 Jan 2022 • Jian Ma, Dima Damen
This paper proposes an interaction reasoning network for modelling spatio-temporal relationships between hands and objects in video.
1 code implementation • 2 Jan 2022 • Hanyuan Wang, Dima Damen, Majid Mirmehdi, Toby Perrett
This incorporates a novel Voting Evidence Module to locate temporal boundaries, more accurately, where temporal contextual evidence is accumulated to predict frame-level probabilities of start and end action boundaries.
1 code implementation • CVPR 2022 • Will Price, Carl Vondrick, Dima Damen
Our lives can be seen as a complex weaving of activities; we switch from one activity to another, to maximise our achievements or in reaction to demands placed upon us.
1 code implementation • 1 Nov 2021 • Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen
We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance.
no code implementations • 25 Oct 2021 • Jonathan Munro, Michael Wray, Diane Larlus, Gabriela Csurka, Dima Damen
Given a gallery of uncaptioned video sequences, this paper considers the task of retrieving videos based on their relevance to an unseen text query.
8 code implementations • CVPR 2022 • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.
3 code implementations • CVPR 2021 • Michael Wray, Hazel Doughty, Dima Damen
Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa.
2 code implementations • 5 Mar 2021 • Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs.
Ranked #1 on Human Interaction Recognition on EPIC-SOUNDS
2 code implementations • CVPR 2021 • Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, Dima Damen
We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set.
1 code implementation • 24 Nov 2020 • Will Price, Dima Damen
We offer detailed analysis of supporting/distracting frames, and the relationships of ESVs to the frame's position, class prediction, and sequence length.
1 code implementation • 22 Aug 2020 • Dima Damen, Michael Wray
We propose a three-dimensional discrete and incremental scale to encode a method's level of supervision - i. e. the data and labels used when training a model to achieve a given performance.
1 code implementation • 29 Jul 2020 • Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, Dima Damen
This produces an initialisation for fine-tuning to target which is both context-agnostic and task-generalised.
7 code implementations • 23 Jun 2020 • Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray
This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS.
Ranked #8 on Action Anticipation on EPIC-KITCHENS-100
2 code implementations • 29 Apr 2020 • Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray
Our dataset features 55 hours of video consisting of 11. 5M frames, which we densely labelled for a total of 39. 6K action segments and 454. 2K object bounding boxes.
1 code implementation • CVPR 2020 • Jonathan Munro, Dima Damen
We then combine adversarial training with multi-modal self-supervision, showing that our approach outperforms other UDA methods by 3%.
1 code implementation • CVPR 2020 • Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, Dima Damen
We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations.
no code implementations • 22 Oct 2019 • Farnoosh Heidarivincheh, Majid Mirmehdi, Dima Damen
In this work, we target detecting the completion moment of actions, that is the moment when the action's goal has been successfully accomplished.
no code implementations • 3 Oct 2019 • Alessandro Masullo, Tilo Burghardt, Toby Perrett, Dima Damen, Majid Mirmehdi
We present the first fully automated Sit-to-Stand or Stand-to-Sit (StS) analysis framework for long-term monitoring of patients in free-living environments using video silhouettes.
no code implementations • 20 Sep 2019 • Will Price, Dima Damen
We investigate video transforms that result in class-homogeneous label-transforms.
1 code implementation • ICCV 2019 • Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i. e. the combination of modalities within a range of temporal offsets.
Ranked #2 on Egocentric Activity Recognition on EPIC-KITCHENS-55
no code implementations • ICCV 2019 • Michael Wray, Diane Larlus, Gabriela Csurka, Dima Damen
We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting.
2 code implementations • 2 Aug 2019 • Will Price, Dima Damen
We benchmark contemporary action recognition models (TSN, TRN, and TSM) on the recently introduced EPIC-Kitchens dataset and release pretrained models on GitHub (https://github. com/epic-kitchens/action-models) for others to build upon.
1 code implementation • 25 Jul 2019 • Michael Wray, Dima Damen
We collect multi-verb annotations for three action video datasets and evaluate the verb-only labelling representations for action recognition and cross-modal retrieval (video-to-text and text-to-video).
no code implementations • CVPR 2019 • Toby Perrett, Dima Damen
Domain alignment in convolutional networks aims to learn the degree of layer-specific feature alignment beneficial to the joint learning of source and target datasets.
1 code implementation • CVPR 2019 • Davide Moltisanti, Sanja Fidler, Dima Damen
We propose a method that is supervised by single timestamps located around each action instance, in untrimmed videos.
1 code implementation • CVPR 2019 • Hazel Doughty, Walterio Mayol-Cuevas, Dima Damen
In addition to attending to task relevant video parts, our proposed loss jointly trains two attention modules to separately attend to video parts which are indicative of higher (pros) and lower (cons) skill.
no code implementations • 21 Jun 2018 • Alessandro Masullo, Tilo Burghardt, Dima Damen, Sion Hannuna, Victor Ponce-López, Majid Mirmehdi
We propose a novel deep fusion architecture, CaloriNet, for the online estimation of energy expenditure for free living monitoring in private environments, where RGB data is discarded and replaced by silhouettes.
no code implementations • 11 Jun 2018 • Víctor Ponce-López, Tilo Burghardt, Sion Hannunna, Dima Damen, Alessandro Masullo, Majid Mirmehdi
We present a deep person re-identification approach that combines semantically selective, deep data augmentation with clustering-based network compression to generate high performance, light and fast inference networks.
1 code implementation • 17 May 2018 • Farnoosh Heidarivincheh, Majid Mirmehdi, Dima Damen
The paper proposes a joint classification-regression recurrent model that predicts completion from a given frame, and then integrates frame-level contributions to detect sequence-level completion moment.
no code implementations • 10 May 2018 • Michael Wray, Davide Moltisanti, Dima Damen
This work introduces verb-only representations for actions and interactions; the problem of describing similar motions (e. g. 'open door', 'open cupboard'), and distinguish differing ones (e. g. 'open door' vs 'open bottle') using verb-only labels.
2 code implementations • ECCV 2018 • Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray
First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention.
no code implementations • 6 Oct 2017 • Farnoosh Heidarivincheh, Majid Mirmehdi, Dima Damen
Action completion detection is the problem of modelling the action's progression towards localising the moment of completion - when the action's goal is confidently considered achieved.
no code implementations • CVPR 2018 • Hazel Doughty, Dima Damen, Walterio Mayol-Cuevas
We present a method for assessing skill from video, applicable to a variety of tasks, ranging from surgery to drawing and rolling pizza dough.
no code implementations • ICCV 2017 • Davide Moltisanti, Michael Wray, Walterio Mayol-Cuevas, Dima Damen
Manual annotations of temporal bounds for object interactions (i. e. start and end times) are typical training input to recognition, localization and detection algorithms.
no code implementations • 24 Mar 2017 • Michael Wray, Davide Moltisanti, Walterio Mayol-Cuevas, Dima Damen
This work deviates from easy-to-define class boundaries for object interactions.
no code implementations • 28 Jul 2016 • Michael Wray, Davide Moltisanti, Walterio Mayol-Cuevas, Dima Damen
We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels.
no code implementations • 27 Jul 2016 • Lili Tao, Tilo Burghardt, Majid Mirmehdi, Dima Damen, Ashley Cooper, Sion Hannuna, Massimo Camplani, Adeline Paiement, Ian Craddock
We present a new framework for vision-based estimation of calorific expenditure from RGB-D data - the first that is validated on physical gas exchange measurements and applied to daily living scenarios.
no code implementations • 14 Jun 2016 • Massimo Camplani, Adeline Paiement, Majid Mirmehdi, Dima Damen, Sion Hannuna, Tilo Burghardt, Lili Tao
Finally, we present a brief comparative evaluation of the performance of those works that have applied their methods to these datasets.
no code implementations • 16 Oct 2015 • Dima Damen, Teesid Leelasawassuk, Walterio Mayol-Cuevas
This paper presents an unsupervised approach towards automatically extracting video-based guidance on object usage, from egocentric video and wearable gaze tracking, collected from multiple users while performing tasks.