1 code implementation • 17 Mar 2025 • Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, Federico Tombari
Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects.
no code implementations • 19 Dec 2024 • Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hofmann, Federico Tombari
We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training.
no code implementations • 27 Nov 2024 • Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem, Talfan Evans, Samuel Albanie, Federico Tombari, Yongqin Xian, Alessio Tonioni, Olivier J. Hénaff
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones.
2 code implementations • 30 Oct 2024 • Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan Eric Lenssen, LiWei Wang, Federico Tombari, Bernt Schiele
By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values.
no code implementations • 29 Jun 2024 • Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Muhammad Ferjad Naeem, Bernt Schiele, Federico Tombari
In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it.
1 code implementation • 28 Mar 2024 • Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, André Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai
In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa).
no code implementations • 19 Dec 2023 • Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari
In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task.
Ranked #25 on
Video Question Answering
on NExT-QA
no code implementations • 14 Dec 2023 • Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hofmann, Federico Tombari
A significant challenge within this domain is localized editing, where specific areas of an image are modified without affecting the rest of the content.
no code implementations • 29 Nov 2023 • Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc van Gool, Xi Wang
Traditional methods heavily rely on representation learning that is trained on a large amount of video data.
no code implementations • 20 Oct 2023 • Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc van Gool, Federico Tombari
However, the contrastive objective used by these models only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks.
1 code implementation • 22 Apr 2023 • Qian Wang, Yongqin Xian, Hefei Ling, Jinyuan Zhang, Xiaorui Lin, Ping Li, Jiazhong Chen, Ning Yu
Adversarial attacks aim to disturb the functionality of a target system by adding specific noise to the input samples, bringing potential threats to security and robustness when applied to facial recognition systems.
1 code implementation • 1 Feb 2023 • Saurabh Sharma, Yongqin Xian, Ning Yu, Ambuj Singh
In this work, we show that learning prototype classifiers addresses the biased softmax problem in LTR.
Ranked #8 on
Long-tail Learning
on CIFAR-100-LT (ρ=10)
1 code implementation • CVPR 2023 • Anurag Das, Yongqin Xian, Dengxin Dai, Bernt Schiele
In this work, we propose a common framework to use different weak labels, e. g. image, point and coarse labels from target domain to reduce this performance gap.
no code implementations • 15 Dec 2022 • Anurag Das, Yongqin Xian, Yang He, Zeynep Akata, Bernt Schiele
For best performance, today's semantic segmentation methods use large and carefully labeled datasets, requiring expensive annotation budgets.
1 code implementation • CVPR 2023 • JieZhang Cao, Qin Wang, Yongqin Xian, Yawei Li, Bingbing Ni, Zhiming Pi, Kai Zhang, Yulun Zhang, Radu Timofte, Luc van Gool
We explicitly design an implicit attention network to learn the ensemble weights for the nearby local features.
no code implementations • CVPR 2023 • Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Luc van Gool, Federico Tombari
Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views.
no code implementations • 21 Sep 2022 • Muhammad Ferjad Naeem, Yongqin Xian, Luc van Gool, Federico Tombari
In order to distill discriminative visual words from noisy documents, we introduce a new cross-modal attention module that learns fine-grained interactions between image patches and document words.
no code implementations • 4 Apr 2022 • Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata
While a visual-semantic embedding layer learns global features, local features are learned through an attribute prototype network that simultaneously regresses and decorrelates attributes from intermediate features.
Ranked #5 on
GZSL Video Classification
on ActivityNet-GZSL(main)
1 code implementation • CVPR 2022 • Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity, and further imposes their class discrimination and semantic relatedness.
no code implementations • 29 Nov 2021 • Muhammad Ferjad Naeem, Evin Pınar Örnek, Yongqin Xian, Luc van Gool, Federico Tombari
Parts represent a basic unit of geometric and semantic similarity across different objects.
2 code implementations • 3 May 2021 • Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, Zeynep Akata
In this work, we overcome this assumption operating on the open world setting, where no limit is imposed on the compositional space at test time, and the search space contains a large number of unseen compositions.
1 code implementation • CVPR 2021 • Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata
Having access to multi-modal cues (e. g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality.
1 code implementation • 21 Apr 2021 • Giuseppe Pastore, Fabio Cermelli, Yongqin Xian, Massimiliano Mancini, Zeynep Akata, Barbara Caputo
Being able to segment unseen classes not observed during training is an important technical challenge in deep learning, because of its potential to reduce the expensive annotation required for semantic segmentation.
Ranked #8 on
Zero-Shot Semantic Segmentation
on PASCAL VOC
1 code implementation • CVPR 2021 • Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, Zeynep Akata
In compositional zero-shot learning, the goal is to recognize unseen compositions (e. g. old dog) of observed visual primitives states (e. g. old, cute) and objects (e. g. car, dog) in the training set.
2 code implementations • CVPR 2021 • Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, Zeynep Akata
After estimating the feasibility score of each composition, we use these scores to either directly mask the output space or as a margin for the cosine similarity between visual features and compositional embeddings during training.
1 code implementation • 30 Nov 2020 • Fabio Cermelli, Massimiliano Mancini, Yongqin Xian, Zeynep Akata, Barbara Caputo
Semantic segmentation models have two fundamental weaknesses: i) they require large training sets with costly pixel-level annotations, and ii) they have a static output space, constrained to the classes of the training set.
no code implementations • NeurIPS 2020 • Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata
As an additional benefit, our model points to the visual evidence of the attributes in an image, e. g. for the CUB dataset, confirming the improved attribute localization ability of our image representation.
1 code implementation • 9 Jul 2020 • Yongqin Xian, Bruno Korbar, Matthijs Douze, Lorenzo Torresani, Bernt Schiele, Zeynep Akata
Few-shot learning aims to recognize novel classes from a few examples.
no code implementations • 5 Feb 2020 • Yue Fan, Yongqin Xian, Max Maria Losch, Bernt Schiele
In this paper, we are pushing the envelope and aim to further investigate the reliance on spatial information.
1 code implementation • CVPR 2019 • Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, Zeynep Akata
In this paper we take this one step further and focus on the challenging task of zero- and few-shot learning of semantic segmentation.
Ranked #10 on
Zero-Shot Semantic Segmentation
on COCO-Stuff
no code implementations • CVPR 2019 • Yongqin Xian, Saurabh Sharma, Bernt Schiele, Zeynep Akata
When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes.
Ranked #4 on
Generalized Zero-Shot Learning
on SUN Attribute
4 code implementations • CVPR 2018 • Yongqin Xian, Tobias Lorenz, Bernt Schiele, Zeynep Akata
Suffering from the extreme training data imbalance between seen and unseen classes, most of existing state-of-the-art approaches fail to achieve satisfactory results for the challenging generalized zero-shot learning task.
Ranked #6 on
Generalized Zero-Shot Learning
on SUN Attribute
Generalized Zero-Shot Learning
Generative Adversarial Network
10 code implementations • 3 Jul 2017 • Yongqin Xian, Christoph H. Lampert, Bernt Schiele, Zeynep Akata
Due to the importance of zero-shot learning, i. e. classifying images where there is a lack of labeled training data, the number of proposed approaches has recently increased steadily.
1 code implementation • CVPR 2017 • Yongqin Xian, Bernt Schiele, Zeynep Akata
Due to the importance of zero-shot learning, the number of proposed approaches has increased steadily recently.
no code implementations • CVPR 2016 • Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, Bernt Schiele
We train the model with a ranking based objective function which penalizes incorrect rankings of the true class for a given image.