Search Results for author: Anurag Arnab

Found 46 papers, 28 papers with code

Higher Order Conditional Random Fields in Deep Neural Networks

1 code implementation25 Nov 2015 Anurag Arnab, Sadeep Jayasumana, Shuai Zheng, Philip Torr

Recent deep learning approaches have incorporated CRFs into Convolutional Neural Networks (CNNs), with some even training the CRF end-to-end with the rest of the network.

Segmentation Semantic Segmentation +1

Joint Object-Material Category Segmentation from Audio-Visual Cues

no code implementations10 Jan 2016 Anurag Arnab, Michael Sapienza, Stuart Golodetz, Julien Valentin, Ondrej Miksik, Shahram Izadi, Philip Torr

It is not always possible to recognise objects and infer material properties for a scene from visual cues alone, since objects can look visually similar whilst being made of very different materials.

Object

Bottom-up Instance Segmentation using Deep Higher-Order CRFs

no code implementations8 Sep 2016 Anurag Arnab, Philip H. S. Torr

Traditional Scene Understanding problems such as Object Detection and Semantic Segmentation have made breakthroughs in recent years due to the adoption of deep learning.

Instance Segmentation Object +5

A Projected Gradient Descent Method for CRF Inference allowing End-To-End Training of Arbitrary Pairwise Potentials

no code implementations24 Jan 2017 Måns Larsson, Anurag Arnab, Fredrik Kahl, Shuai Zheng, Philip Torr

It is empirically demonstrated that such learned potentials can improve segmentation accuracy and that certain label class interactions are indeed better modelled by a non-Gaussian potential.

Segmentation Semantic Segmentation +1

Pixelwise Instance Segmentation with a Dynamically Instantiated Network

1 code implementation CVPR 2017 Anurag Arnab, Philip H. S. Torr

This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances.

Instance Segmentation Object +4

Holistic, Instance-Level Human Parsing

1 code implementation11 Sep 2017 Qizhu Li, Anurag Arnab, Philip H. S. Torr

We address this problem by segmenting the parts of objects at an instance-level, such that each pixel in the image is assigned a part label, as well as the identity of the object it belongs to.

Human Detection Multi-Human Parsing +2

On the Robustness of Semantic Segmentation Models to Adversarial Attacks

1 code implementation CVPR 2018 Anurag Arnab, Ondrej Miksik, Philip H. S. Torr

Deep Neural Networks (DNNs) have demonstrated exceptional performance on most recognition tasks such as image classification and segmentation.

General Classification Image Classification +3

Weakly- and Semi-Supervised Panoptic Segmentation

1 code implementation ECCV 2018 Qizhu Li, Anurag Arnab, Philip H. S. Torr

We present a weakly supervised model that jointly performs both semantic- and instance-segmentation -- a particularly relevant problem given the substantial cost of obtaining pixel-perfect annotation for these tasks.

Instance Segmentation Panoptic Segmentation +4

Exploiting temporal context for 3D human pose estimation in the wild

1 code implementation CVPR 2019 Anurag Arnab, Carl Doersch, Andrew Zisserman

We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos.

 Ranked #1 on Monocular 3D Human Pose Estimation on Human3.6M (Use Video Sequence metric)

3D Pose Estimation Monocular 3D Human Pose Estimation

Dynamic Graph Message Passing Networks

1 code implementation CVPR 2020 Li Zhang, Dan Xu, Anurag Arnab, Philip H. S. Torr

We propose a dynamic graph message passing network, that significantly reduces the computational complexity compared to related works modelling a fully-connected graph.

Image Classification object-detection +3

Dual Graph Convolutional Network for Semantic Segmentation

6 code implementations13 Sep 2019 Li Zhang, Xiangtai Li, Anurag Arnab, Kuiyuan Yang, Yunhai Tong, Philip H. S. Torr

Exploiting long-range contextual information is key for pixel-wise prediction tasks such as semantic segmentation.

Semantic Segmentation

ViViT: A Video Vision Transformer

8 code implementations ICCV 2021 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.

Ranked #8 on Action Classification on MiT (Top 5 Accuracy metric, using extra training data)

Action Classification Action Recognition +4

Unified Graph Structured Models for Video Understanding

no code implementations ICCV 2021 Anurag Arnab, Chen Sun, Cordelia Schmid

Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.

Action Detection Graph Classification +3

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

4 code implementations21 Jun 2021 Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

Action Classification Image Classification +3

Attention Bottlenecks for Multimodal Fusion

1 code implementation NeurIPS 2021 Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.

Action Classification Action Recognition +2

Compressive Visual Representations

1 code implementation NeurIPS 2021 Kuang-Huei Lee, Anurag Arnab, Sergio Guadarrama, John Canny, Ian Fischer

We verify this by developing SimCLR and BYOL formulations compatible with the Conditional Entropy Bottleneck (CEB) objective, allowing us to both measure and control the amount of compression in the learned representation, and observe their impact on downstream tasks.

Contrastive Learning Self-Supervised Image Classification

SCENIC: A JAX Library for Computer Vision Research and Beyond

1 code implementation CVPR 2022 Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab, Matthias Minderer, Yi Tay

Scenic is an open-source JAX library with a focus on Transformer-based models for computer vision research and beyond.

The Efficiency Misnomer

no code implementations ICLR 2022 Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, Yi Tay

We further present suggestions to improve reporting of efficiency metrics.

PolyViT: Co-training Vision Transformers on Images, Videos and Audio

no code implementations25 Nov 2021 Valerii Likhosherstov, Anurag Arnab, Krzysztof Choromanski, Mario Lucic, Yi Tay, Adrian Weller, Mostafa Dehghani

Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters?

Audio Classification

TokenLearner: Adaptive Space-Time Tokenization for Videos

1 code implementation NeurIPS 2021 Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

Representation Learning Video Recognition +1

Multiview Transformers for Video Recognition

1 code implementation CVPR 2022 Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.

Ranked #5 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Classification Action Recognition +1

M&M Mix: A Multimodal Multiview Transformer Ensemble

no code implementations20 Jun 2022 Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge.

Ranked #2 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Recognition Video Recognition

Dynamic Graph Message Passing Networks for Visual Recognition

2 code implementations20 Sep 2022 Li Zhang, Mohan Chen, Anurag Arnab, xiangyang xue, Philip H. S. Torr

A fully-connected graph, such as the self-attention operation in Transformers, is beneficial for such modelling, however, its computational overhead is prohibitive.

Image Classification object-detection +3

Token Turing Machines

1 code implementation CVPR 2023 Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab

The model's memory module ensures that a new observation will only be processed with the contents of the memory (and not the entire history), meaning that it can efficiently process long sequences with a bounded computational cost at each step.

Action Detection Activity Detection

Audiovisual Masked Autoencoders

2 code implementations ICCV 2023 Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning?

 Ranked #1 on Audio Classification on EPIC-KITCHENS-100 (using extra training data)

Audio Classification Representation Learning

Adaptive Computation with Elastic Input Sequence

1 code implementation30 Jan 2023 Fuzhao Xue, Valerii Likhosherstov, Anurag Arnab, Neil Houlsby, Mostafa Dehghani, Yang You

However, most standard neural networks have a fixed function type and computation budget regardless of the sample's nature or difficulty.

Inductive Bias

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

3 code implementations21 Mar 2023 Seokju Cho, Heeseong Shin, Sunghwan Hong, Seungjun An, Seungjun Lee, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim

However, the problem of transferring these capabilities learned from image-level supervision to the pixel-level task of segmentation and addressing arbitrary unseen categories at inference makes this task challenging.

Image Segmentation Open Vocabulary Semantic Segmentation +3

VicTR: Video-conditioned Text Representations for Activity Recognition

no code implementations5 Apr 2023 Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo

All such recipes rely on augmenting visual embeddings with temporal information (i. e., image -> video), often keeping text embeddings unchanged or even being discarded.

Action Classification Activity Recognition +1

End-to-End Spatio-Temporal Action Localisation with Video Transformers

no code implementations24 Apr 2023 Alexey Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid, Anurag Arnab

The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks.

 Ranked #1 on Action Recognition on AVA v2.1 (using extra training data)

Action Detection Action Recognition +1

Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

no code implementations7 Jun 2023 Shreyank N Gowda, Anurag Arnab, Jonathan Huang

In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks.

Action Recognition

Dense Video Object Captioning from Disjoint Supervision

1 code implementation20 Jun 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

We show our task is more general than grounding, and models trained on our task can directly be applied to grounding by finding the bounding box with the maximum likelihood of generating the query sentence.

Object Sentence +1

How can objects help action recognition?

1 code implementation CVPR 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.

Action Recognition Object

Does Visual Pretraining Help End-to-End Reasoning?

no code implementations NeurIPS 2023 Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid

A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.

Image Classification Object +3

UnLoc: A Unified Framework for Video Localization Tasks

1 code implementation ICCV 2023 Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid

While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.

Action Segmentation Moment Retrieval +5

Pixel Aligned Language Models

no code implementations14 Dec 2023 Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.

Language Modelling

Time-, Memory- and Parameter-Efficient Visual Adaptation

no code implementations5 Feb 2024 Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab

Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.

Video Classification

Cannot find the paper you are looking for? You can Submit a new open access paper.