no code implementations • 24 Apr 2023 • Alexey Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid, Anurag Arnab
The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks.
no code implementations • 5 Apr 2023 • Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo
All such recipes rely on augmenting visual embeddings with temporal information (i. e., image -> video), often keeping text embeddings unchanged or even being discarded.
Ranked #7 on
Zero-Shot Action Recognition
on HMDB51
1 code implementation • 21 Mar 2023 • Seokju Cho, Heeseong Shin, Sunghwan Hong, Seungjun An, Seungjun Lee, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim
However, the problem of transferring these capabilities learned from image-level supervision to the pixel-level task of segmentation and addressing arbitrary unseen categories at inference makes this task challenging.
no code implementations • 10 Feb 2023 • Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby
The scaling of Transformers has driven breakthrough capabilities for language models.
Ranked #1 on
Linear-Probe Classification
on ImageNet
(using extra training data)
1 code implementation • 30 Jan 2023 • Fuzhao Xue, Valerii Likhosherstov, Anurag Arnab, Neil Houlsby, Mostafa Dehghani, Yang You
However, most standard neural networks have the same function type and fixed computation budget on different samples regardless of their nature and difficulty.
no code implementations • CVPR 2023 • Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.
no code implementations • 9 Dec 2022 • Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab
Can we leverage the audiovisual information already present in video to improve self-supervised representation learning?
1 code implementation • CVPR 2023 • Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab
The model's memory module ensures that a new observation will only be processed with the contents of the memory (and not the entire history), meaning that it can efficiently process long sequences with a bounded computational cost at each step.
Ranked #1 on
Action Detection
on Charades
2 code implementations • 20 Sep 2022 • Li Zhang, Mohan Chen, Anurag Arnab, xiangyang xue, Philip H. S. Torr
A fully-connected graph, such as the self-attention operation in Transformers, is beneficial for such modelling, however, its computational overhead is prohibitive.
no code implementations • 8 Jul 2022 • Anurag Arnab, Xuehan Xiong, Alexey Gritsenko, Rob Romijnders, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid
Transfer learning is the predominant paradigm for training deep networks on small target datasets.
no code implementations • 20 Jun 2022 • Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid
This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge.
Ranked #1 on
Action Recognition
on EPIC-KITCHENS-100
(using extra training data)
2 code implementations • 12 May 2022 • Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification.
Ranked #1 on
One-Shot Object Detection
on COCO
1 code implementation • CVPR 2022 • Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid
Recent advances in deep learning have relied on large, labelled datasets to train high-capacity models.
no code implementations • CVPR 2022 • Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
Recent video and language pretraining frameworks lack the ability to generate sentences.
Ranked #8 on
Video Captioning
on MSR-VTT
(using extra training data)
1 code implementation • CVPR 2022 • Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid
Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.
Ranked #2 on
Action Recognition
on EPIC-KITCHENS-100
(using extra training data)
1 code implementation • NeurIPS 2021 • Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.
no code implementations • 25 Nov 2021 • Valerii Likhosherstov, Anurag Arnab, Krzysztof Choromanski, Mario Lucic, Yi Tay, Adrian Weller, Mostafa Dehghani
Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters?
no code implementations • ICLR 2022 • Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, Yi Tay
We further present suggestions to improve reporting of efficiency metrics.
1 code implementation • CVPR 2022 • Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab, Matthias Minderer, Yi Tay
Scenic is an open-source JAX library with a focus on Transformer-based models for computer vision research and beyond.
1 code implementation • NeurIPS 2021 • Kuang-Huei Lee, Anurag Arnab, Sergio Guadarrama, John Canny, Ian Fischer
We verify this by developing SimCLR and BYOL formulations compatible with the Conditional Entropy Bottleneck (CEB) objective, allowing us to both measure and control the amount of compression in the learned representation, and observe their impact on downstream tasks.
Ranked #29 on
Self-Supervised Image Classification
on ImageNet
1 code implementation • NeurIPS 2021 • Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.
Ranked #1 on
Audio Classification
on VGGSound
(Top 5 Accuracy metric)
4 code implementations • 21 Jun 2021 • Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.
Ranked #1 on
Action Classification
on Charades
no code implementations • ICCV 2021 • Anurag Arnab, Chen Sun, Cordelia Schmid
Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.
5 code implementations • ICCV 2021 • Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.
Ranked #8 on
Action Classification
on Moments in Time
(Top 5 Accuracy metric, using extra
training data)
no code implementations • ECCV 2020 • Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid
Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind.
5 code implementations • 13 Sep 2019 • Li Zhang, Xiangtai Li, Anurag Arnab, Kuiyuan Yang, Yunhai Tong, Philip H. S. Torr
Exploiting long-range contextual information is key for pixel-wise prediction tasks such as semantic segmentation.
Ranked #26 on
Semantic Segmentation
on Cityscapes test
1 code implementation • CVPR 2020 • Li Zhang, Dan Xu, Anurag Arnab, Philip H. S. Torr
We propose a dynamic graph message passing network, that significantly reduces the computational complexity compared to related works modelling a fully-connected graph.
1 code implementation • CVPR 2019 • Anurag Arnab, Carl Doersch, Andrew Zisserman
We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos.
Ranked #1 on
Monocular 3D Human Pose Estimation
on Human3.6M
(Use Video Sequence metric)
1 code implementation • 4 Dec 2018 • Harkirat Singh Behl, Mohammad Najafi, Anurag Arnab, Philip H. S. Torr
We address this problem by considering the task of video object segmentation.
1 code implementation • ECCV 2018 • Qizhu Li, Anurag Arnab, Philip H. S. Torr
We present a weakly supervised model that jointly performs both semantic- and instance-segmentation -- a particularly relevant problem given the substantial cost of obtaining pixel-perfect annotation for these tasks.
Ranked #31 on
Panoptic Segmentation
on Cityscapes val
1 code implementation • CVPR 2018 • Anurag Arnab, Ondrej Miksik, Philip H. S. Torr
Deep Neural Networks (DNNs) have demonstrated exceptional performance on most recognition tasks such as image classification and segmentation.
1 code implementation • 11 Sep 2017 • Qizhu Li, Anurag Arnab, Philip H. S. Torr
We address this problem by segmenting the parts of objects at an instance-level, such that each pixel in the image is assigned a part label, as well as the identity of the object it belongs to.
Ranked #2 on
Multi-Human Parsing
on PASCAL-Part
1 code implementation • CVPR 2017 • Anurag Arnab, Philip H. S. Torr
This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances.
Ranked #8 on
Instance Segmentation
on Cityscapes test
no code implementations • 24 Jan 2017 • Måns Larsson, Anurag Arnab, Fredrik Kahl, Shuai Zheng, Philip Torr
It is empirically demonstrated that such learned potentials can improve segmentation accuracy and that certain label class interactions are indeed better modelled by a non-Gaussian potential.
no code implementations • 8 Sep 2016 • Anurag Arnab, Philip H. S. Torr
Traditional Scene Understanding problems such as Object Detection and Semantic Segmentation have made breakthroughs in recent years due to the adoption of deep learning.
no code implementations • 10 Jan 2016 • Anurag Arnab, Michael Sapienza, Stuart Golodetz, Julien Valentin, Ondrej Miksik, Shahram Izadi, Philip Torr
It is not always possible to recognise objects and infer material properties for a scene from visual cues alone, since objects can look visually similar whilst being made of very different materials.
1 code implementation • 25 Nov 2015 • Anurag Arnab, Sadeep Jayasumana, Shuai Zheng, Philip Torr
Recent deep learning approaches have incorporated CRFs into Convolutional Neural Networks (CNNs), with some even training the CRF end-to-end with the rest of the network.
Ranked #55 on
Semantic Segmentation
on PASCAL Context
no code implementations • 13 Oct 2015 • Stuart Golodetz, Michael Sapienza, Julien P. C. Valentin, Vibhav Vineet, Ming-Ming Cheng, Anurag Arnab, Victor A. Prisacariu, Olaf Kähler, Carl Yuheng Ren, David W. Murray, Shahram Izadi, Philip H. S. Torr
We present an open-source, real-time implementation of SemanticPaint, a system for geometric reconstruction, object-class segmentation and learning of 3D scenes.