Search Results for author: Alexey Gritsenko

Found 13 papers, 6 papers with code

Time-, Memory- and Parameter-Efficient Visual Adaptation

no code implementations5 Feb 2024 Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab

Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.

Video Classification

Time- Memory- and Parameter-Efficient Visual Adaptation

no code implementations CVPR 2024 Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab

Here we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone or fully-finetuning a smaller backbone with the same GPU and less training time.

Video Classification

Video OWL-ViT: Temporally-consistent open-world localization in video

no code implementations ICCV 2023 Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Lučić, Fisher Yu, Thomas Kipf

Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector.

Decoder Object +1

Scaling Open-Vocabulary Object Detection

1 code implementation NeurIPS 2023 Matthias Minderer, Alexey Gritsenko, Neil Houlsby

However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31. 2% to 44. 6% (43% relative improvement).

Ranked #2 on Zero-Shot Object Detection on LVIS v1.0 val (using extra training data)

Image Classification Language Modelling +4

VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling

no code implementations10 Dec 2021 Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, Alexey Gritsenko

Our model consists of a multimodal Transformer encoder that jointly encodes UI images and structures, and performs UI object detection when the UI structures are absent in the input.

object-detection Object Detection +2

SCENIC: A JAX Library for Computer Vision Research and Beyond

1 code implementation CVPR 2022 Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab, Matthias Minderer, Yi Tay

Scenic is an open-source JAX library with a focus on Transformer-based models for computer vision research and beyond.

Cannot find the paper you are looking for? You can Submit a new open access paper.