Search Results for author: Roei Herzig

Found 24 papers, 16 papers with code

TraveLER: A Multi-LMM Agent Framework for Video Question-Answering

no code implementations • 1 Apr 2024 • Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, Roei Herzig

Specifically, we propose TraveLER, a model that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question.

Question Answering Video Question Answering

Paper
Add Code

Unsupervised Universal Image Segmentation

1 code implementation • 28 Dec 2023 • Dantong Niu, Xudong Wang, Xinyang Han, Long Lian, Roei Herzig, Trevor Darrell

Several unsupervised image segmentation approaches have been proposed which eliminate the need for dense manually-annotated segmentation masks; current models separately handle either semantic segmentation (e. g., STEGO) or class-agnostic instance segmentation (e. g., CutLER), but not both (i. e., panoptic segmentation).

Ranked #1 on Unsupervised Panoptic Segmentation on COCO val2017

Image Segmentation Instance Segmentation +7

125

Paper
Code

Recursive Visual Programming

no code implementations • 4 Dec 2023 • Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, Trevor Darrell

Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA).

Code Generation Question Answering +1

Paper
Add Code

Object-based (yet Class-agnostic) Video Domain Adaptation

no code implementations • 29 Nov 2023 • Dantong Niu, Amir Bar, Roei Herzig, Trevor Darrell, Anna Rohrbach

Existing video-based action recognition systems typically require dense annotation and struggle in environments when there is significant distribution shift relative to the training data.

Action Recognition Domain Adaptation +1

Paper
Add Code

Compositional Chain-of-Thought Prompting for Large Multimodal Models

1 code implementation • 27 Nov 2023 • Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig

The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks.

Ranked #30 on Visual Reasoning on Winoground

Language Modelling Large Language Model +1

Paper
Code

Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs

no code implementations • 10 May 2023 • Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, Amir Globerson

For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions that highlight different compositional aspects of the scene.

Ranked #24 on Visual Reasoning on Winoground

Scene Understanding Visual Reasoning

Paper
Add Code

Teaching Structured Vision & Language Concepts to Vision & Language Models

1 code implementation • CVPR 2023 • Sivan Doveh, Assaf Arbelle, Sivan Harary, Eli Schwartz, Roei Herzig, Raja Giryes, Rogerio Feris, Rameswar Panda, Shimon Ullman, Leonid Karlinsky

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks.

Paper
Code

PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

no code implementations • 8 Dec 2022 • Roei Herzig, Ofir Abramovich, Elad Ben-Avraham, Assaf Arbelle, Leonid Karlinsky, Ariel Shamir, Trevor Darrell, Amir Globerson

In this work, we propose an approach to leverage synthetic scene data for improving video understanding.

Action Recognition Video Understanding

Paper
Add Code

Teaching Structured Vision&Language Concepts to Vision&Language Models

1 code implementation • 21 Nov 2022 • Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks.

Paper
Code

FETA: Towards Specializing Foundation Models for Expert Task Applications

1 code implementation • 8 Sep 2022 • Amit Alfassy, Assaf Arbelle, Oshri Halimi, Sivan Harary, Roei Herzig, Eli Schwartz, Rameswar Panda, Michele Dolfi, Christoph Auer, Kate Saenko, PeterW. J. Staar, Rogerio Feris, Leonid Karlinsky

However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e. g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training.

Ranked #1 on Image-to-Text Retrieval on FETA Car-Manuals

Domain Generalization Image Retrieval +6

Paper
Code

Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022

no code implementations • 15 Jun 2022 • Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos.

Point- of-no-return (PNR) temporal localization Temporal Localization

Paper
Add Code

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

no code implementations • 13 Jun 2022 • Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

We explore a particular instantiation of scene structure, namely a \emph{Hand-Object Graph}, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges.

Action Recognition Video Understanding

Paper
Add Code

Unsupervised Domain Generalization by Learning a Bridge Across Domains

1 code implementation • CVPR 2022 • Sivan Harary, Eli Schwartz, Assaf Arbelle, Peter Staar, Shady Abu-Hussein, Elad Amrani, Roei Herzig, Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi, Kate Saenko, Rogerio Feris, Leonid Karlinsky

The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system.

Domain Generalization Self-Supervised Learning

Paper
Code

Object-Region Video Transformers

1 code implementation • CVPR 2022 • Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations.

Ranked #6 on Action Recognition on Diving-48

Action Detection Few-Shot action recognition +3

Paper
Code

DETReg: Unsupervised Pretraining with Region Priors for Object Detection

1 code implementation • CVPR 2022 • Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

Recent self-supervised pretraining methods for object detection largely focus on pretraining the backbone of the object detector, neglecting key parts of detection architecture.

Ranked #1 on Few-Shot Object Detection on COCO 2017

Few-Shot Learning Few-Shot Object Detection +6

332

Paper
Code

Learning Object Detection from Captions via Textual Scene Attributes

no code implementations • 30 Sep 2020 • Achiya Jerbi, Roei Herzig, Jonathan Berant, Gal Chechik, Amir Globerson

In this work, we argue that captions contain much richer information about the image, including attributes of objects and their relations.

Image Captioning Object +2

Paper
Add Code

Compositional Video Synthesis with Action Graphs

1 code implementation • 27 Jun 2020 • Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, Amir Globerson

Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.

Scheduling Video Generation +2

Paper
Code

Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks

1 code implementation • CVPR 2020 • Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, Trevor Darrell

Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations.

Action Recognition Object

138

Paper
Code

Learning Canonical Representations for Scene Graph to Image Generation

2 code implementations • ECCV 2020 • Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, Amir Globerson

Generating realistic images of complex visual scenes becomes challenging when one wishes to control the structure of the generated images.

Ranked #3 on Layout-to-Image Generation on Visual Genome 256x256

Layout-to-Image Generation Scene Generation

Paper
Code

Accurate Visual Localization for Automotive Applications

1 code implementation • 1 May 2019 • Eli Brosh, Matan Friedmann, Ilan Kadar, Lev Yitzhak Lavy, Elad Levi, Shmuel Rippa, Yair Lempert, Bruno Fernandez-Ruiz, Roei Herzig, Trevor Darrell

We propose a hybrid coarse-to-fine approach that leverages visual and GPS location cues.

Retrieval Visual Localization

Paper
Code

Precise Detection in Densely Packed Scenes

5 code implementations • CVPR 2019 • Eran Goldman, Roei Herzig, Aviv Eisenschtat, Oria Ratzon, Itsik Levi, Jacob Goldberger, Tal Hassner

We propose a novel, deep-learning based method for precise object detection, designed for such challenging settings.

Ranked #4 on Dense Object Detection on SKU-110K

Dense Object Detection Object +1

754

Paper
Code

Differentiable Scene Graphs

1 code implementation • 26 Feb 2019 • Moshiko Raboh, Roei Herzig, Gal Chechik, Jonathan Berant, Amir Globerson

In many domains, it is preferable to train systems jointly in an end-to-end manner, but SGs are not commonly used as intermediate components in visual reasoning systems because being discrete and sparse, scene-graph representations are non-differentiable and difficult to optimize.

Visual Reasoning

Paper
Code

Spatio-Temporal Action Graph Networks

1 code implementation • 4 Dec 2018 • Roei Herzig, Elad Levi, Huijuan Xu, Hang Gao, Eli Brosh, Xiaolong Wang, Amir Globerson, Trevor Darrell

Events defined by the interaction of objects in a scene are often of critical importance; yet important events may have insufficient labeled examples to train a conventional deep model to generalize to future object appearance.

Activity Recognition Autonomous Driving +3

Paper
Code

Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction

1 code implementation • NeurIPS 2018 • Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, Amir Globerson

Machine understanding of complex images is a key goal of artificial intelligence.

Scene Graph Classification Scene Graph Generation +1

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.