Search Results for author: Leonid Karlinsky

Found 50 papers, 29 papers with code

TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

no code implementations13 Sep 2023 M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Horst Possegger, Rogerio Feris, Horst Bischof

Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts.

Zero-Shot Learning

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

no code implementations31 May 2023 Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more.

Cross-Modal Retrieval Question Answering +2

LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

no code implementations29 May 2023 M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Mateusz Kozinski, Horst Possegger, Rogerio Feris, Horst Bischof

Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts.

Language Modelling Large Language Model

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

no code implementations21 May 2023 Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each.

Listen, Think, and Understand

1 code implementation18 May 2023 Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass

In this paper, we propose a novel audio foundation model, called LTU (Listen, Think, and Understand).

Ranked #3 on Music Question Answering on MusicQA Dataset (using extra training data)

Language Modelling Large Language Model +1

Going Beyond Nouns With Vision & Language Models Using Synthetic Data

1 code implementation30 Mar 2023 Paola Cascante-Bonilla, Khaled Shehada, James Seale Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, Gül Varol, Aude Oliva, Vicente Ordonez, Rogerio Feris, Leonid Karlinsky

We contribute Synthetic Visual Concepts (SyViC) - a million-scale synthetic dataset and data generation codebase allowing to generate additional suitable data to improve VLC understanding and compositional reasoning of VL models.

Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning

no code implementations6 Mar 2023 Zhen Wang, Rameswar Panda, Leonid Karlinsky, Rogerio Feris, Huan Sun, Yoon Kim

Prompt tuning, in which a base pretrained model is adapted to each task via conditioning on learned prompt vectors, has emerged as a promising approach for efficiently adapting large language models to multiple downstream tasks.

Transfer Learning

Learning to Grow Pretrained Models for Efficient Transformer Training

no code implementations2 Mar 2023 Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, Yoon Kim

Scaling transformers has led to significant breakthroughs in many domains, leading to a paradigm in which larger versions of existing models are trained and released on a periodic basis.

PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

no code implementations8 Dec 2022 Roei Herzig, Ofir Abramovich, Elad Ben-Avraham, Assaf Arbelle, Leonid Karlinsky, Ariel Shamir, Trevor Darrell, Amir Globerson

We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task.

Action Recognition Video Understanding

On the Transferability of Visual Features in Generalized Zero-Shot Learning

1 code implementation22 Nov 2022 Paola Cascante-Bonilla, Leonid Karlinsky, James Seale Smith, Yanjun Qi, Vicente Ordonez

Generalized Zero-Shot Learning (GZSL) aims to train a classifier that can generalize to unseen classes, using a set of attributes as auxiliary information, and the visual features extracted from a pre-trained convolutional neural network.

Generalized Zero-Shot Learning Knowledge Distillation +2

ConStruct-VL: Data-Free Continual Structured VL Concepts Learning

1 code implementation CVPR 2023 James Seale Smith, Paola Cascante-Bonilla, Assaf Arbelle, Donghyun Kim, Rameswar Panda, David Cox, Diyi Yang, Zsolt Kira, Rogerio Feris, Leonid Karlinsky

This leads to reasoning mistakes, which need to be corrected as they occur by teaching VL models the missing SVLC skills; often this must be done using private data where the issue was found, which naturally leads to a data-free continual (no task-id) VL learning setting.

On the Importance of Calibration in Semi-supervised Learning

no code implementations10 Oct 2022 Charlotte Loh, Rumen Dangovski, Shivchander Sudalairaj, Seungwook Han, Ligong Han, Leonid Karlinsky, Marin Soljacic, Akash Srivastava

State-of-the-art (SOTA) semi-supervised learning (SSL) methods have been highly successful in leveraging a mix of labeled and unlabeled data by combining techniques of consistency regularization and pseudo-labeling.

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

1 code implementation7 Oct 2022 Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English.

Knowledge Distillation Retrieval +2

Contrastive Audio-Visual Masked Autoencoder

1 code implementation2 Oct 2022 Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.

 Ranked #1 on Audio Tagging on AudioSet (using extra training data)

Audio Classification Audio Tagging +4

VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models

1 code implementation12 Sep 2022 Felix Vogel, Nina Shvetsova, Leonid Karlinsky, Hilde Kuehne

We follow up with the analysis of the attribute-based zero-shot learning capabilities of these models, evaluating how well this classical zero-shot notion emerges from large-scale webly supervision.

Retrieval Text Retrieval +1

FETA: Towards Specializing Foundation Models for Expert Task Applications

1 code implementation8 Sep 2022 Amit Alfassy, Assaf Arbelle, Oshri Halimi, Sivan Harary, Roei Herzig, Eli Schwartz, Rameswar Panda, Michele Dolfi, Christoph Auer, Kate Saenko, PeterW. J. Staar, Rogerio Feris, Leonid Karlinsky

However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e. g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training.

Domain Generalization Image Retrieval +6

Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022

no code implementations15 Jun 2022 Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos.

Point- of-no-return (PNR) temporal localization Temporal Localization

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

no code implementations13 Jun 2022 Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

We explore a particular instantiation of scene structure, namely a \emph{Hand-Object Graph}, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges.

Action Recognition Video Understanding

Unsupervised Domain Generalization by Learning a Bridge Across Domains

1 code implementation CVPR 2022 Sivan Harary, Eli Schwartz, Assaf Arbelle, Peter Staar, Shady Abu-Hussein, Elad Amrani, Roei Herzig, Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi, Kate Saenko, Rogerio Feris, Leonid Karlinsky

The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system.

Domain Generalization Self-Supervised Learning

Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data

1 code implementation NeurIPS 2021 Ashraful Islam, Chun-Fu Chen, Rameswar Panda, Leonid Karlinsky, Rogerio Feris, Richard J. Radke

As the base dataset and unlabeled dataset are from different domains, projecting the target images in the class-domain of the base dataset with a fixed pretrained model might be sub-optimal.

cross-domain few-shot learning

Self-Supervised Classification Network

2 code implementations19 Mar 2021 Elad Amrani, Leonid Karlinsky, Alex Bronstein

To guarantee non-degenerate solutions (i. e., solutions where all labels are assigned to the same class) we propose a mathematically motivated variant of the cross-entropy loss that has a uniform prior asserted on the predicted labels.

Classification Clustering +5

A Maximal Correlation Approach to Imposing Fairness in Machine Learning

no code implementations30 Dec 2020 Joshua Lee, Yuheng Bu, Prasanna Sattigeri, Rameswar Panda, Gregory Wornell, Leonid Karlinsky, Rogerio Feris

As machine learning algorithms grow in popularity and diversify to many industries, ethical and legal concerns regarding their fairness have become increasingly relevant.

BIG-bench Machine Learning Fairness

AR-Net: Adaptive Frame Resolution for Efficient Action Recognition

1 code implementation ECCV 2020 Yue Meng, Chung-Ching Lin, Rameswar Panda, Prasanna Sattigeri, Leonid Karlinsky, Aude Oliva, Kate Saenko, Rogerio Feris

Specifically, given a video frame, a policy network is used to decide what input resolution should be used for processing by the action recognition model, with the goal of improving both accuracy and efficiency.

Action Recognition

TAFSSL: Task-Adaptive Feature Sub-Space Learning for few-shot classification

1 code implementation ECCV 2020 Moshe Lichtenstein, Prasanna Sattigeri, Rogerio Feris, Raja Giryes, Leonid Karlinsky

The field of Few-Shot Learning (FSL), or learning from very few (typically $1$ or $5$) examples per novel class (unseen during training), has received a lot of attention and significant performance advances in the recent literature.

Few-Shot Learning General Classification

A Broader Study of Cross-Domain Few-Shot Learning

2 code implementations ECCV 2020 Yunhui Guo, Noel C. Codella, Leonid Karlinsky, James V. Codella, John R. Smith, Kate Saenko, Tajana Rosing, Rogerio Feris

Extensive experiments on the proposed benchmark are performed to evaluate state-of-art meta-learning approaches, transfer learning approaches, and newer methods for cross-domain few-shot learning.

cross-domain few-shot learning Few-Shot Image Classification +1

MetAdapt: Meta-Learned Task-Adaptive Architecture for Few-Shot Classification

no code implementations1 Dec 2019 Sivan Doveh, Eli Schwartz, Chao Xue, Rogerio Feris, Alex Bronstein, Raja Giryes, Leonid Karlinsky

In this work, we propose to employ tools inspired by the Differentiable Neural Architecture Search (D-NAS) literature in order to optimize the architecture for FSL without over-fitting.

Classification Few-Shot Learning +2

Baby steps towards few-shot learning with multiple semantics

no code implementations5 Jun 2019 Eli Schwartz, Leonid Karlinsky, Rogerio Feris, Raja Giryes, Alex M. Bronstein

Learning from one or few visual examples is one of the key capabilities of humans since early infancy, but is still a significant challenge for modern AI systems.

Few-Shot Image Classification Few-Shot Learning

RepMet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection

1 code implementation CVPR 2019 Leonid Karlinsky, Joseph Shtok, Sivan Harary, Eli Schwartz, Amit Aides, Rogerio Feris, Raja Giryes, Alex M. Bronstein

Distance metric learning (DML) has been successfully applied to object classification, both in the standard regime of rich training data and in the few-shot scenario, where each category is represented by only a few examples.

Classification Few-Shot Object Detection +3

LaSO: Label-Set Operations networks for multi-label few-shot learning

2 code implementations CVPR 2019 Amit Alfassy, Leonid Karlinsky, Amit Aides, Joseph Shtok, Sivan Harary, Rogerio Feris, Raja Giryes, Alex M. Bronstein

We conduct numerous experiments showing promising results for the label-set manipulation capabilities of the proposed approach, both directly (using the classification and retrieval metrics), and in the context of performing data augmentation for multi-label few-shot learning.

Data Augmentation Few-Shot Learning +2

Co-regularized Alignment for Unsupervised Domain Adaptation

no code implementations NeurIPS 2018 Abhishek Kumar, Prasanna Sattigeri, Kahini Wadhawan, Leonid Karlinsky, Rogerio Feris, William T. Freeman, Gregory Wornell

Deep neural networks, trained with large amount of labeled data, can fail to generalize well when tested with examples from a \emph{target domain} whose distribution differs from the training data distribution, referred as the \emph{source domain}.

Unsupervised Domain Adaptation

RepMet: Representative-based metric learning for classification and one-shot object detection

1 code implementation12 Jun 2018 Leonid Karlinsky, Joseph Shtok, Sivan Harary, Eli Schwartz, Amit Aides, Rogerio Feris, Raja Giryes, Alex M. Bronstein

Distance metric learning (DML) has been successfully applied to object classification, both in the standard regime of rich training data and in the few-shot scenario, where each category is represented by only a few examples.

Classification Few-Shot Object Detection +4

Fine-Grained Recognition of Thousands of Object Categories With Single-Example Training

1 code implementation CVPR 2017 Leonid Karlinsky, Joseph Shtok, Yochay Tzur, Asaf Tzadok

We approach the problem of fast detection and recognition of a large number (thousands) of object categories while training on a very limited amount of examples, usually one per category.

Using body-anchored priors for identifying actions in single images

no code implementations NeurIPS 2010 Leonid Karlinsky, Michael Dinerstein, Shimon Ullman

The task is easy for humans but difficult for current approaches to object recognition, because action instances may be similar in terms of body pose, and often require detailed examination of relations between participating objects and body parts in order to be recognized.

Object Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.