2 code implementations • 7 Jun 2022 • Jie Lei, Tamara L. Berg, Mohit Bansal
Training an effective video-and-language model intuitively requires multiple frames as model inputs.
Ranked #5 on Video Retrieval on SSv2-template retrieval (using extra training data)
no code implementations • 3 May 2022 • Andrew Brown, Cheng-Yang Fu, Omkar Parkhi, Tamara L. Berg, Andrea Vedaldi
We consider the targeted image editing problem: blending a region in a source image with a driver image that specifies the desired change.
no code implementations • 10 Mar 2022 • Jie Lei, Xinlei Chen, Ning Zhang, Mengjiao Wang, Mohit Bansal, Tamara L. Berg, Licheng Yu
In this work, we propose LoopITR, which combines them in the same network for joint learning.
no code implementations • 15 Feb 2022 • Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao MJ Wang, Hugo Chen, Tamara L. Berg, Ning Zhang
We introduce CommerceMM - a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Query-to-Product Retrieval, Image-to-Product Retrieval, etc.
1 code implementation • ACL 2021 • Jie Lei, Tamara L. Berg, Mohit Bansal
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21. 8K TV show video clips.
4 code implementations • 20 Jul 2021 • Jie Lei, Tamara L. Berg, Mohit Bansal
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.
Ranked #15 on Highlight Detection on QVHighlights
1 code implementation • CVPR 2021 • Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.
Ranked #27 on Visual Question Answering (VQA) on MSRVTT-QA (using extra training data)
1 code implementation • EMNLP 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
Given a video with aligned dialogue, people can often infer what is more likely to happen next.
1 code implementation • ACL 2020 • Jie Lei, Li-Wei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, Mohit Bansal
Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph.
Ranked #5 on Video Captioning on ActivityNet Captions
2 code implementations • ECCV 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.
Ranked #2 on Video Retrieval on TVR
no code implementations • ICCV 2019 • Cheng-Yang Fu, Tamara L. Berg, Alexander C. Berg
In addition, the instance mask projection operator works well on other (non-clothing) datasets, providing an improvement of 3 points in mIOU on Thing classes of Cityscapes, a self-driving dataset, on top of a state-of-the-art approach.
3 code implementations • ACL 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.
Ranked #6 on Video Question Answering on TVQA
1 code implementation • CVPR 2019 • Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra
To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module.
no code implementations • 30 Mar 2019 • Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg
This work presents computational methods for transferring body movements from one person to another with videos collected in the wild.
4 code implementations • EMNLP 2018 • Jie Lei, Licheng Yu, Mohit Bansal, Tamara L. Berg
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.
Ranked #4 on Video Question Answering on SUTD-TrafficQA
no code implementations • 27 Jan 2018 • Yipin Zhou, Yale Song, Tamara L. Berg
Given a still photograph, one can imagine how dynamic objects might move against a static background.
1 code implementation • CVPR 2018 • Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, Tamara L. Berg
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.
Generalized Referring Expression Segmentation Referring Expression +1
3 code implementations • CVPR 2018 • Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg
As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world.
no code implementations • EMNLP 2017 • Licheng Yu, Mohit Bansal, Tamara L. Berg
For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story.
Ranked #15 on Visual Storytelling on VIST (BLEU-3 metric)
2 code implementations • CVPR 2017 • Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg
The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.
no code implementations • 1 Nov 2016 • Tatiana Tommasi, Arun Mallya, Bryan Plummer, Svetlana Lazebnik, Alexander C. Berg, Tamara L. Berg
This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset.
no code implementations • 27 Aug 2016 • Yipin Zhou, Tamara L. Berg
Based on life-long observations of physical, chemical, and biologic phenomena in the natural world, humans can often easily picture in their minds what an object will look like in the future.
no code implementations • 12 Aug 2016 • Sirion Vittayakorn, Alexander C. Berg, Tamara L. Berg
Toward this goal, we utilize features from existing deep networks and also fine-tune new networks for temporal estimation.
no code implementations • 11 Aug 2016 • Tatiana Tommasi, Arun Mallya, Bryan Plummer, Svetlana Lazebnik, Alexander C. Berg, Tamara L. Berg
This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset.
4 code implementations • 31 Jul 2016 • Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, Tamara L. Berg
Humans refer to objects in their environments all the time, especially in dialogue with other people.
no code implementations • ICCV 2015 • Yipin Zhou, Tamara L. Berg
Given a video of an activity, can we predict what will happen next?
no code implementations • ICCV 2015 • M. Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexander C. Berg, Tamara L. Berg
In this paper, we define a new task, Exact Street to Shop, where our goal is to match a real-world example of a garment item to the same item in an online shop.
no code implementations • ICCV 2015 • Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg
In this paper, we introduce a new dataset consisting of 360, 001 focused natural language descriptions for 10, 738 images.
no code implementations • 31 May 2015 • Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg
In this paper, we introduce a new dataset consisting of 360, 001 focused natural language descriptions for 10, 738 images.
no code implementations • TACL 2014 • Polina Kuznetsova, Vicente Ordonez, Tamara L. Berg, Yejin Choi
We present a new tree based approach to composing expressive image descriptions that makes use of naturally occuring web images with captions.
no code implementations • CVPR 2013 • Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, Tamara L. Berg
We posit that user behavior during natural viewing of images contains an abundance of information about the content of images as well as information related to user intent and user defined content importance.
no code implementations • International Workshop on Human Activity Understanding from 3D Data at Conference on Computer Vision and Pattern Recognition (HAU3D-CVPRW) 2012 • Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L. Berg, Dimitris Samaras
Human activity recognition has potential to impact a wide range of applications from surveillance to human computer interfaces to content based video retrieval.
no code implementations • NeurIPS 2011 • Vicente Ordonez, Girish Kulkarni, Tamara L. Berg
We develop and demonstrate automatic image description methods using a large captioned photo collection.