no code implementations • ICLR 2019 • Lili Meng, Bo Zhao, Bo Chang, Gao Huang, Frederick Tung, Leonid Sigal
Our model is efficient, as it proposes a separable spatio-temporal mechanism for video attention, while being able to identify important parts of the video both spatially and temporally.
Action Recognition In Videos
Temporal Action Localization
+1
no code implementations • 14 Mar 2023 • Shih-Han Chou, James J. Little, Leonid Sigal
We show that our commonsense knowledge enhanced approach produces significant improvements on this task (up to 57% in METEOR and 8. 5% in CIDEr), as well as the state-of-the-art result on more traditional video captioning in the ActivityNet Captions dataset [29].
no code implementations • 16 Feb 2023 • Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran
Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences.
no code implementations • 14 Feb 2023 • Siddhesh Khandelwal, Anirudth Nambirajan, Behjat Siddiquie, Jayan Eledath, Leonid Sigal
Methods for object detection and segmentation often require abundant instance-level annotations for training, which are time-consuming and expensive to collect.
no code implementations • 2 Feb 2023 • Bicheng Xu, Renjie Liao, Leonid Sigal
In the auxiliary branch, relational input features are partially masked prior to message passing and predicate prediction.
1 code implementation • 3 Jan 2023 • Yanwei Fu, Xiaomei Wang, Hanze Dong, Yu-Gang Jiang, Meng Wang, xiangyang xue, Leonid Sigal
Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels.
1 code implementation • CVPR 2023 • Yichen Guo, Mai Xu, Lai Jiang, Leonid Sigal, Yunjin Chen
To alleviate this issue, we propose the first attempt at 360deg image rescaling, which refers to downscaling a 360deg image to a visually valid low-resolution (LR) counterpart and then upscaling to a high-resolution (HR) 360deg image given the LR variant.
no code implementations • CVPR 2023 • Mohammed Suhail, Erika Lu, Zhengqi Li, Noah Snavely, Leonid Sigal, Forrester Cole
Instead, our method applies recent progress in monocular camera pose and depth estimation to create a full, RGBD video layer for the background, along with a video layer for each foreground object.
no code implementations • 6 Dec 2022 • Mir Rayat Imtiaz Hossain, Leonid Sigal, James J. Little
Recent advances in pixel-level tasks (e. g., segmentation) illustrate the benefit of long-range interactions between aggregated region-based representations that can enhance local features.
no code implementations • 28 Nov 2022 • Muchen Li, Jeffrey Yunfan Liu, Leonid Sigal, Renjie Liao
Moreover, our graph generator leads to a learnable probabilistic search method that is more flexible and efficient than the commonly used RNN generator and random search methods.
1 code implementation • CVPR 2023 • Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, Leonid Sigal
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.
1 code implementation • 24 Oct 2022 • Sahithya Ravi, Aditya Chinchure, Leonid Sigal, Renjie Liao, Vered Shwartz
In contrast to previous methods which inject knowledge from static knowledge bases, we investigate the incorporation of contextualized knowledge using Commonsense Transformer (COMET), an existing knowledge model trained on human-curated knowledge bases.
Ranked #5 on
Visual Question Answering (VQA)
on A-OKVQA
(DA VQA Score metric)
3 code implementations • 4 Oct 2022 • Peyman Bateni, Leonid Sigal
The user's pulse wave is then used to determine stress (according to the Baevsky Stress Index), heart rate, and heart rate variability.
no code implementations • 27 Jul 2022 • Siddhesh Khandelwal, Leonid Sigal
In this work, we propose a novel framework for scene graph generation that addresses this limitation, as well as introduces dynamic conditioning on the image, using message passing in a Markov Random Field.
no code implementations • 21 Jul 2022 • Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia
Neural rendering has received tremendous attention since the advent of Neural Radiance Fields (NeRF), and has pushed the state-of-the-art on novel-view synthesis considerably.
no code implementations • 22 Mar 2022 • Tianyu Hua, Yonglong Tian, Sucheng Ren, Michalis Raptis, Hang Zhao, Leonid Sigal
We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-segment) predictions which are effective for feature learning.
2 code implementations • 13 Jan 2022 • Peyman Bateni, Jarred Barber, Raghav Goyal, Vaden Masrani, Jan-Willem van de Meent, Leonid Sigal, Frank Wood
The first method, Simple CNAPS, employs a hierarchically regularized Mahalanobis-distance based classifier combined with a state of the art neural adaptive feature extractor to achieve strong performance on Meta-Dataset, mini-ImageNet and tiered-ImageNet benchmarks.
1 code implementation • CVPR 2022 • Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia
Classical light field rendering for novel view synthesis can accurately reproduce view-dependent effects such as reflection, refraction, and translucency, but requires a dense view sampling of the scene.
1 code implementation • NeurIPS 2021 • Tanzila Rahman, Mengyu Yang, Leonid Sigal
In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention.
no code implementations • 24 Nov 2021 • Jiahui Huang, Yuhe Jin, Kwang Moo Yi, Leonid Sigal
In the first stage, with the rich set of losses and dynamic foreground size prior, we learn how to separate the frame into foreground and background layers and, conditioned on these layers, how to generate the next frame using VQ-VAE generator.
1 code implementation • 26 Oct 2021 • Tanzila Rahman, Mengyu Yang, Leonid Sigal
In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention.
no code implementations • CVPR 2021 • Lai Jiang, Mai Xu, Xiaofei Wang, Leonid Sigal
In this paper, we propose a novel task for saliency-guided image translation, with the goal of image-to-image translation conditioned on the user specified saliency map.
1 code implementation • NeurIPS 2021 • Muchen Li, Leonid Sigal
As an important step towards visual reasoning, visual grounding (e. g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures.
Ranked #6 on
Referring Expression Segmentation
on RefCOCO testA
no code implementations • ICCV 2021 • Siddhesh Khandelwal, Mohammed Suhail, Leonid Sigal
Our framework is agnostic to the underlying scene graph generation method and address the lack of segmentation annotations in target scene graph datasets (e. g., Visual Genome) through transfer and multi-task learning from, and with, an auxiliary dataset (e. g., MS COCO).
no code implementations • 25 Mar 2021 • Tanzila Rahman, Leonid Sigal
Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task.
1 code implementation • CVPR 2021 • Mohammed Suhail, Abhay Mittal, Behjat Siddiquie, Chris Broaddus, Jayan Eledath, Gerard Medioni, Leonid Sigal
The proposed formulation allows for efficiently incorporating the structure of scene graphs in the output space.
Ranked #2 on
Scene Graph Generation
on Visual Genome
1 code implementation • 4 Nov 2020 • Tanzila Rahman, Shih-Han Chou, Leonid Sigal, Giuseppe Carenini
We also propose multimodal fusion module to combine both visual and textual information.
no code implementations • 28 Aug 2020 • Weidong Yin, Ziwei Liu, Leonid Sigal
To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared ``compositional structural space'', which encodes shape, location and appearance information for both context and person structures in a disentangled manner.
2 code implementations • 27 Aug 2020 • Ke Ma, Bo Zhao, Leonid Sigal
Also, the generated images from our model have higher resolution, object classification accuracy and consistency, as compared to the previous state-of-the-art.
no code implementations • 25 Jun 2020 • Polina Zablotskaia, Edoardo A. Dominici, Leonid Sigal, Andreas M. Lehrmann
Unsupervised multi-object scene decomposition is a fast-emerging problem in representation learning.
1 code implementation • 23 Jun 2020 • Jing Wang, Jiahong Chen, Jianzhe Lin, Leonid Sigal, Clarence W. de Silva
To solve this problem, we introduce a Gaussian-guided latent alignment approach to align the latent feature distributions of the two domains under the guidance of the prior distribution.
Ranked #1 on
Domain Adaptation
on MNIST-to-USPS
no code implementations • CVPR 2021 • Siddhesh Khandelwal, Raghav Goyal, Leonid Sigal
Weakly-supervised approaches draw on image-level labels to build detectors/segmentors, while zero/few-shot methods assume abundant instance-level data for a set of base classes, and none to a few examples for novel classes.
no code implementations • 2 Apr 2020 • Bicheng Xu, Leonid Sigal
Our formulation utilizes a consistency fusion mechanism, implemented using message passing in a Graph Neural Network (GNN), to aggregate context from related decoders.
no code implementations • 24 Feb 2020 • Ruizhi Deng, Yanshuai Cao, Bo Chang, Leonid Sigal, Greg Mori, Marcus A. Brubaker
In this work, we propose a novel probabilistic sequence model that excels at capturing high variability in time series data, both across sequences and within an individual sequence.
1 code implementation • CVPR 2020 • Yuan Yao, Nico Schertler, Enrique Rosales, Helge Rhodin, Leonid Sigal, Alla Sheffer
Reconstruction of a 3D shape from a single 2D image is a classical computer vision problem, whose difficulty stems from the inherent ambiguity of recovering occluded or only partially observed surfaces.
2 code implementations • CVPR 2020 • Peyman Bateni, Raghav Goyal, Vaden Masrani, Frank Wood, Leonid Sigal
Few-shot learning is a fundamental task in computer vision that carries the promise of alleviating the need for exhaustively labeled data.
Ranked #2 on
Few-Shot Image Classification
on Mini-Imagenet 10-way (5-shot)
(using extra training data)
no code implementations • ECCV 2020 • Megha Nawhal, Mengyao Zhai, Andreas Lehrmann, Leonid Sigal, Greg Mori
Human activity videos involve rich, varied interactions between people and objects.
no code implementations • 29 Nov 2019 • Zicong Fan, Si Yi Meng, Leonid Sigal, James J. Little
The problem of language grounding has attracted much attention in recent years due to its pivotal role in more general image-lingual high level reasoning tasks (e. g., image captioning, VQA).
2 code implementations • 21 Oct 2019 • Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, Leonid Sigal
In this paper, we focus on human motion transfer - generation of a video depicting a particular subject, observed in a single image, performing a series of motions exemplified by an auxiliary (driving) video.
no code implementations • ICCV 2019 • Tanzila Rahman, Bicheng Xu, Leonid Sigal
Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning.
2 code implementations • ICCV 2019 • Akash Abdu Jyothi, Thibaut Durand, JiaWei He, Leonid Sigal, Greg Mori
Recently there is an increasing interest in scene generation within the research community.
no code implementations • ICCV 2019 • Siddhesh Khandelwal, Leonid Sigal
Visual attention mechanisms have proven to be integrally important constituent components of many modern deep neural architectures.
no code implementations • CVPR 2019 • Nazanin Mehrasa, Akash Abdu Jyothi, Thibaut Durand, JiaWei He, Leonid Sigal, Greg Mori
We propose a novel probabilistic generative model for action sequences.
no code implementations • CVPR 2019 • Pelin Dogan, Leonid Sigal, Markus Gross
We propose an end-to-end approach for phrase grounding in images.
no code implementations • 4 Dec 2018 • Micha Livne, Leonid Sigal, Marcus A. Brubaker, David J. Fleet
To our knowledge, this is the first approach to take physics into account without explicit {\em a priori} knowledge of the environment or body dimensions.
no code implementations • 1 Dec 2018 • Ziad Al-Halah, Andreas M. Lehrmann, Leonid Sigal
While the proposed approaches in the literature can be roughly categorized into two main groups: category- and instance-based retrieval, in this work we show that the retrieval task is much richer and more complex.
no code implementations • CVPR 2019 • Bo Zhao, Lili Meng, Weidong Yin, Leonid Sigal
The representation of each object is disentangled into a specified/certain part (category) and an unspecified/uncertain part (appearance).
Ranked #2 on
Layout-to-Image Generation
on Visual Genome 64x64
no code implementations • NeurIPS 2018 • Shikib Mehri, Leonid Sigal
Despite being virtually ubiquitous, sequence-to-sequence models are challenged by their lack of diversity and inability to be externally controlled.
no code implementations • ICLR 2019 • Hanze Dong, Yanwei Fu, Sung Ju Hwang, Leonid Sigal, xiangyang xue
This paper studies the problem of Generalized Zero-shot Learning (G-ZSL), whose goal is to classify instances belonging to both seen and unseen classes at the test time.
no code implementations • 1 Oct 2018 • Lili Meng, Bo Zhao, Bo Chang, Gao Huang, Wei Sun, Frederich Tung, Leonid Sigal
Inspired by the observation that humans are able to process videos efficiently by only paying attention where and when it is needed, we propose an interpretable and easy plug-in spatial-temporal attention mechanism for video action recognition.
1 code implementation • CVPR 2018 • Hareesh Ravi, Lezi Wang, Carlos Muniz, Leonid Sigal, Dimitris Metaxas, Mubbasir Kapadia
We propose an end-to-end network for the visual illustration of a sequence of sentences forming a story.
1 code implementation • 15 Apr 2018 • Zitian Chen, Yanwei Fu, yinda zhang, Yu-Gang Jiang, xiangyang xue, Leonid Sigal
In semantic space, we search for related concepts, which are then projected back into the image feature spaces by the decoder portion of the TriNet.
1 code implementation • 13 Apr 2018 • Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, Kate Saenko
To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work.
2 code implementations • ECCV 2018 • Bo Zhao, Bo Chang, Zequn Jie, Leonid Sigal
Existing methods for multi-domain image-to-image translation (or generation) attempt to directly map an input image (or a random vector) to an image in one of the output domains.
1 code implementation • ECCV 2018 • Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, Leonid Sigal
Videos express highly structured spatio-temporal patterns of visual data.
1 code implementation • 28 Feb 2018 • Huijuan Xu, Boyang Li, Vasili Ramanishka, Leonid Sigal, Kate Saenko
In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context.
1 code implementation • CVPR 2018 • Pelin Dogan, Boyang Li, Leonid Sigal, Markus Gross
The alignment of heterogeneous sequential data (video to text) is an important and challenging problem.
no code implementations • NeurIPS 2017 • Andreas Lehrmann, Leonid Sigal
End-to-end training methods for models with structured graphical dependencies on top of neural predictions have recently emerged as a principled way of combining these two paradigms.
no code implementations • 13 Oct 2017 • Yanwei Fu, Tao Xiang, Yu-Gang Jiang, xiangyang xue, Leonid Sigal, Shaogang Gong
With the recent renaissance of deep convolution neural networks, encouraging breakthroughs have been achieved on the supervised recognition tasks, where each class has sufficient training data and fully annotated training data.
no code implementations • NeurIPS 2017 • Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal
From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references.
Ranked #13 on
Visual Dialog
on VisDial v0.9 val
(R@1 metric)
no code implementations • 31 Aug 2017 • Atousa Torabi, Leonid Sigal
Inspired by recent advances in neural machine translation, that jointly align and translate using encoder-decoder networks equipped with attention, we propose an attentionbased LSTM model for human activity recognition.
no code implementations • CVPR 2017 • Fanyi Xiao, Leonid Sigal, Yong Jae Lee
We propose a weakly-supervised approach that takes image-sentence pairs as input and learns to visually ground (i. e., localize) arbitrary linguistic phrases, in the form of spatial attention masks.
no code implementations • 10 Apr 2017 • Zuxuan Wu, Larry S. Davis, Leonid Sigal
In particular, we propose spatial context networks that learn to predict a representation of one image patch from another image patch, within the same image, conditioned on their real-valued relative spatial offset.
no code implementations • 7 Apr 2017 • Weidong Yin, Yanwei Fu, Leonid Sigal, xiangyang xue
Generating and manipulating human facial images using high-level attributal controls are important and interesting problems.
no code implementations • 21 Feb 2017 • Yu-ting Qiang, Yanwei Fu, Xiao Yu, Yanwen Guo, Zhi-Hua Zhou, Leonid Sigal
In order to bridge the gap between panel attributes and the composition within each panel, we also propose a recursive page splitting algorithm to generate the panel layout for a poster.
no code implementations • 26 Sep 2016 • Atousa Torabi, Niket Tandon, Leonid Sigal
We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test.
Ranked #30 on
Video Retrieval
on MSR-VTT
no code implementations • CVPR 2016 • Shugao Ma, Leonid Sigal, Stan Sclaroff
In this work we improve training of temporal deep models to better learn activity progression for activity detection and early detection.
no code implementations • CVPR 2016 • Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, Leonid Sigal
Large-scale action recognition and video categorization are important problems in computer vision.
no code implementations • CVPR 2016 • Yanwei Fu, Leonid Sigal
Despite significant progress in object categorization, in recent years, a number of important challenges remain, mainly, ability to learn from limited labeled data and ability to recognize object classes within large, potentially open, set of labels.
no code implementations • 5 Apr 2016 • Yu-ting Qiang, Yanwei Fu, Yanwen Guo, Zhi-Hua Zhou, Leonid Sigal
Then, given inferred layout and attributes, composition of graphical elements within each panel is synthesized.
no code implementations • 22 Dec 2015 • Shugao Ma, Sarah Adel Bargal, Jianming Zhang, Leonid Sigal, Stan Sclaroff
In contrast, collecting action images from the Web is much easier and training on images requires much less computation.
Ranked #13 on
Action Recognition
on ActivityNet
(using extra training data)
no code implementations • ICCV 2015 • Bo Xiong, Gunhee Kim, Leonid Sigal
To address this, we propose a storyline representation that expresses an egocentric video as a set of jointly inferred, through MRF inference, story elements comprising of actors, locations, supporting objects and events, depicted on a timeline.
no code implementations • 19 Nov 2015 • Yanwei Fu, De-An Huang, Leonid Sigal
Collecting datasets in this way, however, requires robust and efficient ways for detecting and excluding outliers that are common and prevalent.
no code implementations • 16 Nov 2015 • Baohan Xu, Yanwei Fu, Yu-Gang Jiang, Boyang Li, Leonid Sigal
Emotion is a key element in user-generated videos.
no code implementations • 17 Sep 2015 • Xi Zhang, Yanwei Fu, Shanshan Jiang, Leonid Sigal, Gady Agam
In this paper, we investigate and formalize a general framework-Stacked Multichannel Autoencoder (SMCAE) that enables bridging the synthetic gap and learning from synthetic data more efficiently.
no code implementations • CVPR 2015 • Gunhee Kim, Seungwhan Moon, Leonid Sigal
While most previous work has dealt with the relations between a natural language sentence and an image or a video, our work extends to the relations between paragraphs and image sequences.
no code implementations • CVPR 2015 • Gunhee Kim, Seungwhan Moon, Leonid Sigal
We alternate between solving the two coupled latent SVM problems, by first fixing the summarization and solving for the alignment from blog images to photo streams and vice versa.
no code implementations • CVPR 2015 • Shugao Ma, Leonid Sigal, Stan Sclaroff
Using the action vocabulary we then utilize tree mining with subsequent tree clustering and ranking to select a compact set of highly discriminative tree patterns.
no code implementations • CVPR 2015 • Alina Kuznetsova, Sung Ju Hwang, Bodo Rosenhahn, Leonid Sigal
By incrementally detecting object instances in video and adding confident detections into the model, we are able to dynamically adjust the complexity of the detector over time by instantiating new prototypes to span all domains the model has seen.
no code implementations • 11 Mar 2015 • Xi Zhang, Yanwei Fu, Andi Zang, Leonid Sigal, Gady Agam
Experimental results on two datasets validate the efficiency of our MCAE model and our methodology of generating synthetic data.
no code implementations • 6 Feb 2015 • Guang-Tong Zhou, Sung Ju Hwang, Mark Schmidt, Leonid Sigal, Greg Mori
We present a hierarchical maximum-margin clustering method for unsupervised data analysis.
no code implementations • NeurIPS 2014 • Sung Ju Hwang, Leonid Sigal
We propose a method that learns a discriminative yet semantic space for object categorization, where we also embed auxiliary semantic entities such as supercategories and attributes.
no code implementations • CVPR 2014 • Gunhee Kim, Leonid Sigal, Eric P. Xing
The reconstruction of storyline graphs is formulated as the inference of sparse time-varying directed graphs from a set of photo streams with assistance of videos.
no code implementations • NeurIPS 2013 • Nataliya Shapovalova, Michalis Raptis, Leonid Sigal, Greg Mori
We propose a new weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video.
no code implementations • CVPR 2013 • Michalis Raptis, Leonid Sigal
We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
Ranked #3 on
Human Interaction Recognition
on UT
no code implementations • 2 Feb 2012 • Makoto Yamada, Wittawat Jitkrittum, Leonid Sigal, Eric P. Xing, Masashi Sugiyama
We first show that, with particular choices of kernel functions, non-redundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures.
no code implementations • NeurIPS 2011 • Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus
We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence.