Search Results for author: Leonid Sigal

Found 98 papers, 26 papers with code

Light Field Neural Rendering

1 code implementation CVPR 2022 Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia

Classical light field rendering for novel view synthesis can accurately reproduce view-dependent effects such as reflection, refraction, and translucency, but requires a dense view sampling of the scene.

Neural Rendering Novel View Synthesis

Improved Few-Shot Visual Classification

2 code implementations CVPR 2020 Peyman Bateni, Raghav Goyal, Vaden Masrani, Frank Wood, Leonid Sigal

Few-shot learning is a fundamental task in computer vision that carries the promise of alleviating the need for exhaustively labeled data.

Classification Few-Shot Image Classification +3

Beyond Simple Meta-Learning: Multi-Purpose Models for Multi-Domain, Active and Continual Few-Shot Learning

2 code implementations13 Jan 2022 Peyman Bateni, Jarred Barber, Raghav Goyal, Vaden Masrani, Jan-Willem van de Meent, Leonid Sigal, Frank Wood

The first method, Simple CNAPS, employs a hierarchically regularized Mahalanobis-distance based classifier combined with a state of the art neural adaptive feature extractor to achieve strong performance on Meta-Dataset, mini-ImageNet and tiered-ImageNet benchmarks.

Active Learning continual few-shot learning +3

Multi-level Semantic Feature Augmentation for One-shot Learning

1 code implementation15 Apr 2018 Zitian Chen, Yanwei Fu, yinda zhang, Yu-Gang Jiang, xiangyang xue, Leonid Sigal

In semantic space, we search for related concepts, which are then projected back into the image feature spaces by the decoder portion of the TriNet.

Novel Concepts One-Shot Learning

Discriminative Feature Alignment: Improving Transferability of Unsupervised Domain Adaptation by Gaussian-guided Latent Alignment

1 code implementation23 Jun 2020 Jing Wang, Jiahong Chen, Jianzhe Lin, Leonid Sigal, Clarence W. de Silva

To solve this problem, we introduce a Gaussian-guided latent alignment approach to align the latent feature distributions of the two domains under the guidance of the prior distribution.

Data Augmentation Domain Generalization +3

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

1 code implementation NeurIPS 2021 Muchen Li, Leonid Sigal

As an important step towards visual reasoning, visual grounding (e. g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures.

Referring Expression Referring Expression Comprehension +4

DwNet: Dense warp-based network for pose-guided human video generation

2 code implementations21 Oct 2019 Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, Leonid Sigal

In this paper, we focus on human motion transfer - generation of a video depicting a particular subject, observed in a single image, performing a series of motions exemplified by an auxiliary (driving) video.

Video Generation

Multilevel Language and Vision Integration for Text-to-Clip Retrieval

1 code implementation13 Apr 2018 Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, Kate Saenko

To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work.

Retrieval Sentence

Make-A-Story: Visual Memory Conditioned Consistent Story Generation

1 code implementation CVPR 2023 Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, Leonid Sigal

Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.

Sentence Story Generation +1

Joint Event Detection and Description in Continuous Video Streams

1 code implementation28 Feb 2018 Huijuan Xu, Boyang Li, Vasili Ramanishka, Leonid Sigal, Kate Saenko

In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context.

Dense Captioning Dense Video Captioning +2

Modular Generative Adversarial Networks

2 code implementations ECCV 2018 Bo Zhao, Bo Chang, Zequn Jie, Leonid Sigal

Existing methods for multi-domain image-to-image translation (or generation) attempt to directly map an input image (or a random vector) to an image in one of the output domains.

Attribute Image-to-Image Translation +1

VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge

1 code implementation24 Oct 2022 Sahithya Ravi, Aditya Chinchure, Leonid Sigal, Renjie Liao, Vered Shwartz

In contrast to previous methods which inject knowledge from static knowledge bases, we investigate the incorporation of contextualized knowledge using Commonsense Transformer (COMET), an existing knowledge model trained on human-curated knowledge bases.

Ranked #8 on Visual Question Answering (VQA) on A-OKVQA (DA VQA Score metric)

Question Answering Visual Question Answering

A Neural Multi-sequence Alignment TeCHnique (NeuMATCH)

1 code implementation CVPR 2018 Pelin Dogan, Boyang Li, Leonid Sigal, Markus Gross

The alignment of heterogeneous sequential data (video to text) is an important and challenging problem.

Dynamic Time Warping

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

1 code implementation26 Oct 2021 Tanzila Rahman, Mengyu Yang, Leonid Sigal

In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention.

Pose Retrieval Representation Learning +1

TriBERT: Human-centric Audio-visual Representation Learning

1 code implementation NeurIPS 2021 Tanzila Rahman, Mengyu Yang, Leonid Sigal

In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention.

Pose Retrieval Representation Learning +1

Front2Back: Single View 3D Shape Reconstruction via Front to Back Prediction

1 code implementation CVPR 2020 Yuan Yao, Nico Schertler, Enrique Rosales, Helge Rhodin, Leonid Sigal, Alla Sheffer

Reconstruction of a 3D shape from a single 2D image is a classical computer vision problem, whose difficulty stems from the inherent ambiguity of recovering occluded or only partially observed surfaces.

3D Shape Reconstruction Surface Reconstruction

Attribute-guided image generation from layout

2 code implementations27 Aug 2020 Ke Ma, Bo Zhao, Leonid Sigal

Also, the generated images from our model have higher resolution, object classification accuracy and consistency, as compared to the previous state-of-the-art.

Attribute Image Generation +2

Vocabulary-informed Zero-shot and Open-set Learning

1 code implementation3 Jan 2023 Yanwei Fu, Xiaomei Wang, Hanze Dong, Yu-Gang Jiang, Meng Wang, xiangyang xue, Leonid Sigal

Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels.

Object Categorization Open Set Learning +1

DINN360: Deformable Invertible Neural Network for Latitude-Aware 360deg Image Rescaling

1 code implementation CVPR 2023 Yichen Guo, Mai Xu, Lai Jiang, Leonid Sigal, Yunjin Chen

To alleviate this issue, we propose the first attempt at 360deg image rescaling, which refers to downscaling a 360deg image to a visually valid low-resolution (LR) counterpart and then upscaling to a high-resolution (HR) 360deg image given the LR variant.

valid

Visual Reference Resolution using Attention Memory for Visual Dialog

no code implementations NeurIPS 2017 Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal

From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references.

Ranked #13 on Visual Dialog on VisDial v0.9 val (R@1 metric)

Parameter Prediction Question Answering +3

Recent Advances in Zero-shot Recognition

no code implementations13 Oct 2017 Yanwei Fu, Tao Xiang, Yu-Gang Jiang, xiangyang xue, Leonid Sigal, Shaogang Gong

With the recent renaissance of deep convolution neural networks, encouraging breakthroughs have been achieved on the supervised recognition tasks, where each class has sufficient training data and fully annotated training data.

Open Set Learning Zero-Shot Learning

Action Classification and Highlighting in Videos

no code implementations31 Aug 2017 Atousa Torabi, Leonid Sigal

Inspired by recent advances in neural machine translation, that jointly align and translate using encoder-decoder networks equipped with attention, we propose an attentionbased LSTM model for human activity recognition.

Action Classification Classification +4

Weakly-supervised Visual Grounding of Phrases with Linguistic Structures

no code implementations CVPR 2017 Fanyi Xiao, Leonid Sigal, Yong Jae Lee

We propose a weakly-supervised approach that takes image-sentence pairs as input and learns to visually ground (i. e., localize) arbitrary linguistic phrases, in the form of spatial attention masks.

Sentence Visual Grounding

Weakly-Supervised Spatial Context Networks

no code implementations10 Apr 2017 Zuxuan Wu, Larry S. Davis, Leonid Sigal

In particular, we propose spatial context networks that learn to predict a representation of one image patch from another image patch, within the same image, conditioned on their real-valued relative spatial offset.

Object Object Categorization

Semi-Latent GAN: Learning to generate and modify facial images from attributes

no code implementations7 Apr 2017 Weidong Yin, Yanwei Fu, Leonid Sigal, xiangyang xue

Generating and manipulating human facial images using high-level attributal controls are important and interesting problems.

Attribute Generative Adversarial Network

Learning to Generate Posters of Scientific Papers by Probabilistic Graphical Models

no code implementations21 Feb 2017 Yu-ting Qiang, Yanwei Fu, Xiao Yu, Yanwen Guo, Zhi-Hua Zhou, Leonid Sigal

In order to bridge the gap between panel attributes and the composition within each panel, we also propose a recursive page splitting algorithm to generate the panel layout for a poster.

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

no code implementations26 Sep 2016 Atousa Torabi, Niket Tandon, Leonid Sigal

We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test.

Multiple-choice Retrieval +1

Semi-supervised Vocabulary-informed Learning

no code implementations CVPR 2016 Yanwei Fu, Leonid Sigal

Despite significant progress in object categorization, in recent years, a number of important challenges remain, mainly, ability to learn from limited labeled data and ability to recognize object classes within large, potentially open, set of labels.

Object Categorization Open Set Learning +1

Learning to Generate Posters of Scientific Papers

no code implementations5 Apr 2016 Yu-ting Qiang, Yanwei Fu, Yanwen Guo, Zhi-Hua Zhou, Leonid Sigal

Then, given inferred layout and attributes, composition of graphical elements within each panel is synthesized.

Robust Classification by Pre-conditioned LASSO and Transductive Diffusion Component Analysis

no code implementations19 Nov 2015 Yanwei Fu, De-An Huang, Leonid Sigal

Collecting datasets in this way, however, requires robust and efficient ways for detecting and excluding outliers that are common and prevalent.

BIG-bench Machine Learning Classification +3

Learning from Synthetic Data Using a Stacked Multichannel Autoencoder

no code implementations17 Sep 2015 Xi Zhang, Yanwei Fu, Shanshan Jiang, Leonid Sigal, Gady Agam

In this paper, we investigate and formalize a general framework-Stacked Multichannel Autoencoder (SMCAE) that enables bridging the synthetic gap and learning from synthetic data more efficiently.

Sketch Recognition

Learning Classifiers from Synthetic Data Using a Multichannel Autoencoder

no code implementations11 Mar 2015 Xi Zhang, Yanwei Fu, Andi Zang, Leonid Sigal, Gady Agam

Experimental results on two datasets validate the efficiency of our MCAE model and our methodology of generating synthetic data.

General Classification

Hierarchical Maximum-Margin Clustering

no code implementations6 Feb 2015 Guang-Tong Zhou, Sung Ju Hwang, Mark Schmidt, Leonid Sigal, Greg Mori

We present a hierarchical maximum-margin clustering method for unsupervised data analysis.

Clustering

A Unified Semantic Embedding: Relating Taxonomies and Attributes

no code implementations NeurIPS 2014 Sung Ju Hwang, Leonid Sigal

We propose a method that learns a discriminative yet semantic space for object categorization, where we also embed auxiliary semantic entities such as supercategories and attributes.

Object Categorization

High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso

no code implementations2 Feb 2012 Makoto Yamada, Wittawat Jitkrittum, Leonid Sigal, Eric P. Xing, Masashi Sugiyama

We first show that, with particular choices of kernel functions, non-redundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures.

feature selection Vocal Bursts Intensity Prediction

Learning the Compositional Spaces for Generalized Zero-shot Learning

no code implementations ICLR 2019 Hanze Dong, Yanwei Fu, Sung Ju Hwang, Leonid Sigal, xiangyang xue

This paper studies the problem of Generalized Zero-shot Learning (G-ZSL), whose goal is to classify instances belonging to both seen and unseen classes at the test time.

Generalized Zero-Shot Learning Open Set Learning

Middle-Out Decoding

no code implementations NeurIPS 2018 Shikib Mehri, Leonid Sigal

Despite being virtually ubiquitous, sequence-to-sequence models are challenged by their lack of diversity and inability to be externally controlled.

Video Captioning

Image Generation from Layout

no code implementations CVPR 2019 Bo Zhao, Lili Meng, Weidong Yin, Leonid Sigal

The representation of each object is disentangled into a specified/certain part (category) and an unspecified/uncertain part (appearance).

Layout-to-Image Generation Object

Walking on Thin Air: Environment-Free Physics-based Markerless Motion Capture

no code implementations4 Dec 2018 Micha Livne, Leonid Sigal, Marcus A. Brubaker, David J. Fleet

To our knowledge, this is the first approach to take physics into account without explicit {\em a priori} knowledge of the environment or body dimensions.

Markerless Motion Capture

Traversing the Continuous Spectrum of Image Retrieval with Deep Dynamic Models

no code implementations1 Dec 2018 Ziad Al-Halah, Andreas M. Lehrmann, Leonid Sigal

While the proposed approaches in the literature can be roughly categorized into two main groups: category- and instance-based retrieval, in this work we show that the retrieval task is much richer and more complex.

Attribute Continuous Control +2

Non-parametric Structured Output Networks

no code implementations NeurIPS 2017 Andreas Lehrmann, Leonid Sigal

End-to-end training methods for models with structured graphical dependencies on top of neural predictions have recently emerged as a principled way of combining these two paradigms.

Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

no code implementations NeurIPS 2011 Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus

We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence.

Expanding Object Detector's Horizon: Incremental Learning Framework for Object Detection in Videos

no code implementations CVPR 2015 Alina Kuznetsova, Sung Ju Hwang, Bodo Rosenhahn, Leonid Sigal

By incrementally detecting object instances in video and adding confident detections into the model, we are able to dynamically adjust the complexity of the detector over time by instantiating new prototypes to span all domains the model has seen.

Domain Adaptation Incremental Learning +3

Where and when to look? Spatial-temporal attention for action recognition in videos

no code implementations ICLR 2019 Lili Meng, Bo Zhao, Bo Chang, Gao Huang, Frederick Tung, Leonid Sigal

Our model is efficient, as it proposes a separable spatio-temporal mechanism for video attention, while being able to identify important parts of the video both spatially and temporally.

Action Recognition In Videos Temporal Action Localization +1

Poselet Key-Framing: A Model for Human Activity Recognition

no code implementations CVPR 2013 Michalis Raptis, Leonid Sigal

We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.

Human Activity Recognition Temporal Localization

Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction

no code implementations CVPR 2014 Gunhee Kim, Leonid Sigal, Eric P. Xing

The reconstruction of storyline graphs is formulated as the inference of sparse time-varying directed graphs from a set of photo streams with assistance of videos.

16k Video Summarization

Ranking and Retrieval of Image Sequences From Multiple Paragraph Queries

no code implementations CVPR 2015 Gunhee Kim, Seungwhan Moon, Leonid Sigal

While most previous work has dealt with the relations between a natural language sentence and an image or a video, our work extends to the relations between paragraphs and image sequences.

Retrieval Sentence

Joint Photo Stream and Blog Post Summarization and Exploration

no code implementations CVPR 2015 Gunhee Kim, Seungwhan Moon, Leonid Sigal

We alternate between solving the two coupled latent SVM problems, by first fixing the summarization and solving for the alignment from blog images to photo streams and vice versa.

Transfer Learning

Space-Time Tree Ensemble for Action Recognition

no code implementations CVPR 2015 Shugao Ma, Leonid Sigal, Stan Sclaroff

Using the action vocabulary we then utilize tree mining with subsequent tree clustering and ranking to select a compact set of highly discriminative tree patterns.

Action Recognition Clustering +1

Learning Activity Progression in LSTMs for Activity Detection and Early Detection

no code implementations CVPR 2016 Shugao Ma, Leonid Sigal, Stan Sclaroff

In this work we improve training of temporal deep models to better learn activity progression for activity detection and early detection.

Action Detection Activity Detection +1

Storyline Representation of Egocentric Videos With an Applications to Story-Based Search

no code implementations ICCV 2015 Bo Xiong, Gunhee Kim, Leonid Sigal

To address this, we propose a storyline representation that expresses an egocentric video as a set of jointly inferred, through MRF inference, story elements comprising of actors, locations, supporting objects and events, depicted on a timeline.

AttentionRNN: A Structured Spatial Attention Mechanism

no code implementations ICCV 2019 Siddhesh Khandelwal, Leonid Sigal

Visual attention mechanisms have proven to be integrally important constituent components of many modern deep neural architectures.

Image Categorization Image Generation +1

Interpretable Spatio-temporal Attention for Video Action Recognition

no code implementations1 Oct 2018 Lili Meng, Bo Zhao, Bo Chang, Gao Huang, Wei Sun, Frederich Tung, Leonid Sigal

Inspired by the observation that humans are able to process videos efficiently by only paying attention where and when it is needed, we propose an interpretable and easy plug-in spatial-temporal attention mechanism for video action recognition.

Action Recognition Temporal Action Localization

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

no code implementations ICCV 2019 Tanzila Rahman, Bicheng Xu, Leonid Sigal

Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning.

OptiBox: Breaking the Limits of Proposals for Visual Grounding

no code implementations29 Nov 2019 Zicong Fan, Si Yi Meng, Leonid Sigal, James J. Little

The problem of language grounding has attracted much attention in recent years due to its pivotal role in more general image-lingual high level reasoning tasks (e. g., image captioning, VQA).

Image Captioning Visual Grounding +1

Variational Hyper RNN for Sequence Modeling

no code implementations24 Feb 2020 Ruizhi Deng, Yanshuai Cao, Bo Chang, Leonid Sigal, Greg Mori, Marcus A. Brubaker

In this work, we propose a novel probabilistic sequence model that excels at capturing high variability in time series data, both across sequences and within an individual sequence.

Time Series Time Series Analysis

Consistent Multiple Sequence Decoding

no code implementations2 Apr 2020 Bicheng Xu, Leonid Sigal

Our formulation utilizes a consistency fusion mechanism, implemented using message passing in a Graph Neural Network (GNN), to aggregate context from related decoders.

Image Captioning

UniT: Unified Knowledge Transfer for Any-shot Object Detection and Segmentation

no code implementations CVPR 2021 Siddhesh Khandelwal, Raghav Goyal, Leonid Sigal

Weakly-supervised approaches draw on image-level labels to build detectors/segmentors, while zero/few-shot methods assume abundant instance-level data for a set of base classes, and none to a few examples for novel classes.

object-detection Object Detection +1

Person-in-Context Synthesiswith Compositional Structural Space

no code implementations28 Aug 2020 Weidong Yin, Ziwei Liu, Leonid Sigal

To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared ``compositional structural space'', which encodes shape, location and appearance information for both context and person structures in a disentangled manner.

Weakly-supervised Audio-visual Sound Source Detection and Separation

no code implementations25 Mar 2021 Tanzila Rahman, Leonid Sigal

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task.

Audio Source Separation Denoising +5

Segmentation-grounded Scene Graph Generation

no code implementations ICCV 2021 Siddhesh Khandelwal, Mohammed Suhail, Leonid Sigal

Our framework is agnostic to the underlying scene graph generation method and address the lack of segmentation annotations in target scene graph datasets (e. g., Visual Genome) through transfer and multi-task learning from, and with, an auxiliary dataset (e. g., MS COCO).

Graph Generation Multi-Task Learning +2

Saliency-Guided Image Translation

no code implementations CVPR 2021 Lai Jiang, Mai Xu, Xiaofei Wang, Leonid Sigal

In this paper, we propose a novel task for saliency-guided image translation, with the goal of image-to-image translation conditioned on the user specified saliency map.

Generative Adversarial Network Image-to-Image Translation +1

Layered Controllable Video Generation

no code implementations24 Nov 2021 Jiahui Huang, Yuhe Jin, Kwang Moo Yi, Leonid Sigal

In the first stage, with the rich set of losses and dynamic foreground size prior, we learn how to separate the frame into foreground and background layers and, conditioned on these layers, how to generate the next frame using VQ-VAE generator.

Video Generation

Self-supervision through Random Segments with Autoregressive Coding (RandSAC)

no code implementations22 Mar 2022 Tianyu Hua, Yonglong Tian, Sucheng Ren, Michalis Raptis, Hang Zhao, Leonid Sigal

We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-segment) predictions which are effective for feature learning.

Representation Learning Self-Supervised Learning

Generalizable Patch-Based Neural Rendering

no code implementations21 Jul 2022 Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia

Neural rendering has received tremendous attention since the advent of Neural Radiance Fields (NeRF), and has pushed the state-of-the-art on novel-view synthesis considerably.

Neural Rendering Novel View Synthesis

Iterative Scene Graph Generation

no code implementations27 Jul 2022 Siddhesh Khandelwal, Leonid Sigal

In this work, we propose a novel framework for scene graph generation that addresses this limitation, as well as introduces dynamic conditioning on the image, using message passing in a Markov Random Field.

Graph Generation Scene Graph Generation

GraphPNAS: Learning Distribution of Good Neural Architectures via Deep Graph Generative Models

no code implementations28 Nov 2022 Muchen Li, Jeffrey Yunfan Liu, Leonid Sigal, Renjie Liao

Moreover, our graph generator leads to a learnable probabilistic search method that is more flexible and efficient than the commonly used RNN generator and random search methods.

Neural Architecture Search

Framework-agnostic Semantically-aware Global Reasoning for Segmentation

no code implementations6 Dec 2022 Mir Rayat Imtiaz Hossain, Leonid Sigal, James J. Little

Recent advances in pixel-level tasks (e. g. segmentation) illustrate the benefit of of long-range interactions between aggregated region-based representations that can enhance local features.

Instance Segmentation Segmentation +1

Self-Supervised Relation Alignment for Scene Graph Generation

no code implementations2 Feb 2023 Bicheng Xu, Renjie Liao, Leonid Sigal

In the auxiliary branch, relational input features are partially masked prior to message passing and predicate prediction.

Graph Generation Relation +1

Frustratingly Simple but Effective Zero-shot Detection and Segmentation: Analysis and a Strong Baseline

no code implementations14 Feb 2023 Siddhesh Khandelwal, Anirudth Nambirajan, Behjat Siddiquie, Jayan Eledath, Leonid Sigal

Methods for object detection and segmentation often require abundant instance-level annotations for training, which are time-consuming and expensive to collect.

Object object-detection +3

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

no code implementations16 Feb 2023 Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran

Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences.

Action Detection Sentence +2

Implicit and Explicit Commonsense for Multi-sentence Video Captioning

no code implementations14 Mar 2023 Shih-Han Chou, James J. Little, Leonid Sigal

We show that our commonsense knowledge enhanced approach produces significant improvements on this task (up to 57% in METEOR and 8. 5% in CIDEr), as well as the state-of-the-art result on more traditional video captioning in the ActivityNet Captions dataset [29].

Imitation Learning Sentence +1

Omnimatte3D: Associating Objects and Their Effects in Unconstrained Monocular Video

no code implementations CVPR 2023 Mohammed Suhail, Erika Lu, Zhengqi Li, Noah Snavely, Leonid Sigal, Forrester Cole

Instead, our method applies recent progress in monocular camera pose and depth estimation to create a full, RGBD video layer for the background, along with a video layer for each foreground object.

Depth Estimation

INVE: Interactive Neural Video Editing

no code implementations15 Jul 2023 Jiahui Huang, Leonid Sigal, Kwang Moo Yi, Oliver Wang, Joon-Young Lee

We present Interactive Neural Video Editing (INVE), a real-time video editing solution, which can assist the video editing process by consistently propagating sparse frame edits to the entire video clip.

Video Editing

TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models

no code implementations3 Dec 2023 Aditya Chinchure, Pushkar Shukla, Gaurav Bhatt, Kiri Salij, Kartik Hosanagar, Leonid Sigal, Matthew Turk

Text-to-Image (TTI) generative models have shown great progress in the past few years in terms of their ability to generate complex and high-quality imagery.

counterfactual Counterfactual Reasoning

TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking

no code implementations13 Dec 2023 Raghav Goyal, Wan-Cyuan Fan, Mennatullah Siam, Leonid Sigal

In this work we propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges.

Semantic Segmentation Video Object Segmentation +1

Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models

no code implementations19 Dec 2023 Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, Leonid Sigal

Further, we leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image.

Image Generation Prompt Engineering

Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

no code implementations2 Jan 2024 Bicheng Xu, Qi Yan, Renjie Liao, Lele Wang, Leonid Sigal

While previous works have explored image generation conditioned on scene graphs or layouts, our task is distinctive and important as it involves generating scene graphs themselves unconditionally from noise, enabling efficient and interpretable control for image generation.

Graph Generation Image Generation +2

Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)

no code implementations23 Jan 2024 Shih-Han Chou, Matthew Kowal, Yasmin Niknam, Diana Moyano, Shayaan Mehdi, Richard Pito, Cheng Zhang, Ian Knopke, Sedef Akinli Kocak, Leonid Sigal, Yalda Mohsenzadeh

Towards a solution for designing this ability in algorithms, we present a large-scale analysis on an in-house dataset collected by the Reuters News Agency, called Reuters Video-Language News (ReutersViLNews) dataset which focuses on high-level video-language understanding with an emphasis on long-form news.

Miscellaneous Video Description

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

no code implementations18 Feb 2024 Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal

We illustrate that such joint alternating refinement leads to the learning of better tokens for concepts and, as a bi-product, latent masks.

Image Generation

Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection

no code implementations21 Mar 2024 Gaurav Bhatt, James Ross, Leonid Sigal

Modern pre-trained architectures struggle to retain previous information while undergoing continuous fine-tuning on new tasks.

Information Retrieval

Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach

no code implementations17 Apr 2024 Mir Rayat Imtiaz Hossain, Mennatullah Siam, Leonid Sigal, James J. Little

These learned visual prompts are used to prompt a multiscale transformer decoder to facilitate accurate dense predictions.

Cannot find the paper you are looking for? You can Submit a new open access paper.