2 code implementations • 1 Feb 2024 • Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew Zisserman
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes.
Ranked #1 on Point Tracking on TAP-Vid-RGB-Stacking
1 code implementation • ICCV 2023 • Chuhan Zhang, Ankush Gupta, Andrew Zisserman
We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i. e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e. g. Episodic Memory in Ego4D).
3 code implementations • ICCV 2023 • Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Ranked #1 on Visual Tracking on Kinetics
1 code implementation • NeurIPS 2023 • Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e. g. Flamingo, SeViLA, or GPT-4).
Ranked #1 on Point Tracking on Perception Test
2 code implementations • ICCV 2023 • Vishaal Udandarao, Ankush Gupta, Samuel Albanie
Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models.
3 code implementations • 7 Nov 2022 • Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move.
1 code implementation • Deep Mind 2022 • Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Skanda Koppula, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman and João Carreira
We propose a novel multimodal benchmark – the Perception Test – that aims to extensively evaluate perception and reasoning skills of multimodal models.
no code implementations • 20 Jul 2022 • Chuhan Zhang, Ankush Gupta, Andrew Zisserman
The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory 'modalities' of the video clip.
no code implementations • NAACL 2021 • Arvind Agarwal, Laura Chiticariu, Poornima Chozhiyath Raman, Marina Danilevsky, Diman Ghazi, Ankush Gupta, Shanmukha Guttula, Yannis Katsis, Rajasekar Krishnamurthy, Yunyao Li, Shubham Mudgal, Vitobha Munigala, Nicholas Phan, Dhaval Sonawane, Sneha Srinivasan, Sudarshan R. Thitte, Mitesh Vasa, Ramiya Venkatachalam, Vinitha Yaski, Huaiyu Zhu
Contracts are arguably the most important type of business documents.
no code implementations • CVPR 2021 • Chuhan Zhang, Ankush Gupta, Andrew Zisserman
It attends to relevant segments for each query with a temporal attention mechanism, and can be trained using only the labels for each query.
Ranked #13 on Action Recognition on Diving-48
no code implementations • 3 Nov 2020 • Markus Wulfmeier, Arunkumar Byravan, Tim Hertweck, Irina Higgins, Ankush Gupta, tejas kulkarni, Malcolm Reynolds, Denis Teplyashin, Roland Hafner, Thomas Lampe, Martin Riedmiller
Furthermore, the value of each representation is evaluated in terms of three properties: dimensionality, observability and disentanglement.
no code implementations • ECCV 2020 • Chuhan Zhang, Ankush Gupta, Andrew Zisserman
In this work, our objective is to address the problems of generalization and flexibility for text recognition in documents.
6 code implementations • NeurIPS 2020 • Carl Doersch, Ankush Gupta, Andrew Zisserman
In this work, we illustrate how the neural network representations which underpin modern vision systems are subject to supervision collapse, whereby they lose any information that is not necessary for performing the training task, including information that may be necessary for transfer to new tasks or domains.
no code implementations • 20 Aug 2019 • Srikanth G Tamilselvam, Ankush Gupta, Arvind Agarwal
Compliance officers responsible for maintaining adherence constantly struggle to keep up with the large amount of changes in regulatory requirements.
no code implementations • CVPR 2020 • Tomas Jakab, Ankush Gupta, Hakan Bilen, Andrea Vedaldi
We propose KeypointGAN, a new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses.
6 code implementations • NeurIPS 2019 • Tejas Kulkarni, Ankush Gupta, Catalin Ionescu, Sebastian Borgeaud, Malcolm Reynolds, Andrew Zisserman, Volodymyr Mnih
In this work we aim to learn object representations that are useful for control and reinforcement learning (RL).
no code implementations • 23 Sep 2018 • Ankush Gupta, Andrea Vedaldi, Andrew Zisserman
This work presents a method for visual text recognition without using any paired supervisory data.
no code implementations • 21 Jul 2018 • Ankush Gupta, Andrea Vedaldi, Andrew Zisserman
End-to-end trained Recurrent Neural Networks (RNNs) have been successfully applied to numerous problems that require processing sequences, such as image captioning, machine translation, and text recognition.
2 code implementations • NeurIPS 2018 • Tomas Jakab, Ankush Gupta, Hakan Bilen, Andrea Vedaldi
We propose a method for learning landmark detectors for visual objects (such as the eyes and the nose in a face) without any manual supervision.
1 code implementation • 15 Sep 2017 • Ankush Gupta, Arvind Agarwal, Prawaan Singh, Piyush Rai
In this paper, we address the problem of generating paraphrases automatically.
3 code implementations • CVPR 2016 • Ankush Gupta, Andrea Vedaldi, Andrew Zisserman
In this paper we introduce a new method for text detection in natural images.
Ranked #12 on Scene Text Detection on ICDAR 2013