10 code implementations • ICCV 2021 • Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, Xiaolong Wang
The edge between these two lines of works has yet been underexplored, and the effectiveness of meta-learning in few-shot learning remains unclear.
3 code implementations • ICCV 2017 • Huijuan Xu, Abir Das, Kate Saenko
We address the problem of activity detection in continuous, untrimmed video streams.
Ranked #1 on Action Recognition In Videos on THUMOS’14
1 code implementation • CVPR 2020 • Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, Trevor Darrell
Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations.
1 code implementation • HLT 2015 • Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko
Solving the visual symbol grounding problem has long been a goal of artificial intelligence.
1 code implementation • 13 Apr 2018 • Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, Kate Saenko
To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work.
2 code implementations • ECCV 2020 • Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, Amir Globerson
Generating realistic images of complex visual scenes becomes challenging when one wishes to control the structure of the generated images.
Ranked #3 on Layout-to-Image Generation on Visual Genome 256x256
1 code implementation • 28 Feb 2018 • Huijuan Xu, Boyang Li, Vasili Ramanishka, Leonid Sigal, Kate Saenko
In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context.
1 code implementation • 24 Jul 2022 • Zhi Li, Lu He, Huijuan Xu
Action understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences.
Ranked #1 on Weakly Supervised Action Localization on FineAction
1 code implementation • 3 Dec 2018 • Ximeng Sun, Huijuan Xu, Kate Saenko
Video generation is an inherently challenging task, as it requires modeling realistic temporal dynamics as well as spatial content.
1 code implementation • 4 Dec 2018 • Roei Herzig, Elad Levi, Huijuan Xu, Hang Gao, Eli Brosh, Xiaolong Wang, Amir Globerson, Trevor Darrell
Events defined by the interaction of objects in a scene are often of critical importance; yet important events may have insufficient labeled examples to train a conventional deep model to generalize to future object appearance.
1 code implementation • 17 Nov 2015 • Huijuan Xu, Kate Saenko
We propose a novel spatial attention architecture that aligns words with image patches in the first hop, and obtain improved results by adding a second attention hop which considers the whole question to choose visual evidence based on the results of the first hop.
1 code implementation • ECCV 2020 • Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, Huijuan Xu
Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label.
Ranked #9 on Weakly Supervised Action Localization on THUMOS’14
1 code implementation • 17 Mar 2024 • Shu Zhao, Xiaohan Zou, Tan Yu, Huijuan Xu
Meanwhile, our RebQ leverages extensive multi-modal knowledge from pre-trained LMMs to reconstruct the data of missing modality.
1 code implementation • Findings (NAACL) 2022 • Jin Liu, Chongfeng Fan, Fengyu Zhou, Huijuan Xu
Knowledge graph-to-text (KG-to-text) generation aims to generate easy-to-understand sentences from the KG, and at the same time, maintains semantic consistency between generated sentences and the KG.
1 code implementation • 2 Oct 2023 • Shu Zhao, Huijuan Xu
To fill this gap, we present a new task called Local Scene Graph Generation.
no code implementations • 28 Jan 2018 • Yancheng Bai, Huijuan Xu, Kate Saenko, Bernard Ghanem
In this paper, we propose the contextual multi-scale region convolutional 3D network (CMS-RC3D) for activity detection.
no code implementations • 21 May 2015 • Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus Rohrbach, Kate Saenko
Most state-of-the-art methods for solving this problem borrow existing deep convolutional neural network (CNN) architectures (AlexNet, GoogLeNet) to extract a visual representation of the input video.
no code implementations • 25 Dec 2018 • Huijuan Xu, Bingyi Kang, Ximeng Sun, Jiashi Feng, Kate Saenko, Trevor Darrell
In this paper, we present a conceptually simple and general yet novel framework for few-shot temporal activity detection which detects the start and end time of the few-shot input activities in an untrimmed video.
no code implementations • 5 Jun 2019 • Huijuan Xu, Abir Das, Kate Saenko
We address the problem of temporal activity detection in continuous, untrimmed video streams.
Ranked #4 on Action Recognition on THUMOS’14
no code implementations • 19 Jun 2019 • Lei Lei, Huijuan Xu, Xiong Xiong, Kan Zheng, Wei Xiang, Xianbin Wang
By leveraging the concept of mobile edge computing (MEC), massive amount of data generated by a large number of Internet of Things (IoT) devices could be offloaded to MEC server at the edge of wireless network for further computational intensive processing.
no code implementations • 27 Sep 2019 • Reuben Tan, Huijuan Xu, Kate Saenko, Bryan A. Plummer
However, while such approaches tend to focus on identifying relationships between elements of the video and language modalities, there is less emphasis on modeling relational context between video frames given the semantic context of the query.
no code implementations • 1 Apr 2020 • Huijuan Xu, Lizhi Yang, Stan Sclaroff, Kate Saenko, Trevor Darrell
Spatio-temporal action detection in videos requires localizing the action both spatially and temporally in the form of an "action tube".
no code implementations • 31 Mar 2020 • Huijuan Xu, Ximeng Sun, Eric Tzeng, Abir Das, Kate Saenko, Trevor Darrell
In this paper, we present a conceptually simple and general yet novel framework for few-shot temporal activity detection based on proposal regression which detects the start and end time of the activities in untrimmed videos.
no code implementations • NeurIPS 2020 • Baifeng Shi, Judy Hoffman, Kate Saenko, Trevor Darrell, Huijuan Xu
By adjusting the auxiliary task weights to minimize the divergence between the surrogate prior and the true prior of the main task, we obtain a more accurate prior estimation, achieving the goal of minimizing the required amount of training data for the main task and avoiding a costly grid search.
no code implementations • ICCV 2021 • Baifeng Shi, Qi Dai, Judy Hoffman, Kate Saenko, Trevor Darrell, Huijuan Xu
We extensively benchmark against the baselines for SSAD and OSAD on our created data splits in THUMOS14 and ActivityNet1. 2, and demonstrate the effectiveness of the proposed UFA and IB methods.
no code implementations • 25 Sep 2019 • Reuben Tan, Huijuan Xu, Kate Saenko, Bryan A. Plummer
Given a video and a sentence, the goal of weakly-supervised video moment retrieval is to locate the video segment which is described by the sentence without having access to temporal annotations during training.
no code implementations • NAACL 2022 • Zhekun Luo, Shalini Ghosh, Devin Guillory, Keizo Kato, Trevor Darrell, Huijuan Xu
In this paper, we aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns that are unseen during training time, by leveraging the power of knowledge graphs.
no code implementations • 12 Dec 2022 • Tianliang Zhang, Zhenjun Han, Huijuan Xu, Baochang Zhang, Qixiang Ye
In this paper we propose a novel feature learning model, referred to as CircleNet, to achieve feature adaptation by mimicking the process humans looking at low resolution and occluded objects: focusing on it again, at a finer scale, if the object can not be identified clearly for the first time.
no code implementations • 2 Oct 2023 • Shu Zhao, Huijuan Xu
Specifically, considering that text modifier may refer to semantic concepts not existing in the reference image and requiring to be added into the target image, we learn the multi-modal concept alignment between the text modifier and the concatenation of reference and target images, under multiple-instance learning framework with image and sentence level weak supervision.