Search Results for author: Tamara L. Berg

Found 33 papers, 14 papers with code

Revealing Single Frame Bias for Video-and-Language Learning

2 code implementations • 7 Jun 2022 • Jie Lei, Tamara L. Berg, Mohit Bansal

Training an effective video-and-language model intuitively requires multiple frames as model inputs.

Ranked #5 on Video Retrieval on SSv2-template retrieval (using extra training data)

Fine-grained Action Recognition Language Modelling +6

686

Paper
Code

End-to-End Visual Editing with a Generatively Pre-Trained Artist

no code implementations • 3 May 2022 • Andrew Brown, Cheng-Yang Fu, Omkar Parkhi, Tamara L. Berg, Andrea Vedaldi

We consider the targeted image editing problem: blending a region in a source image with a driver image that specifies the desired change.

Paper
Add Code

LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval

no code implementations • 10 Mar 2022 • Jie Lei, Xinlei Chen, Ning Zhang, Mengjiao Wang, Mohit Bansal, Tamara L. Berg, Licheng Yu

In this work, we propose LoopITR, which combines them in the same network for joint learning.

Retrieval Text Retrieval

Paper
Add Code

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

no code implementations • 15 Feb 2022 • Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao MJ Wang, Hugo Chen, Tamara L. Berg, Ning Zhang

We introduce CommerceMM - a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Query-to-Product Retrieval, Image-to-Product Retrieval, etc.

Representation Learning Retrieval +1

Paper
Add Code

MTVR: Multilingual Moment Retrieval in Videos

1 code implementation • ACL 2021 • Jie Lei, Tamara L. Berg, Mohit Bansal

We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21. 8K TV show video clips.

Moment Retrieval Retrieval

Paper
Code

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

3 code implementations • 20 Jul 2021 • Jie Lei, Tamara L. Berg, Mohit Bansal

Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.

Ranked #11 on Highlight Detection on QVHighlights

Highlight Detection Moment Retrieval +2

231

Paper
Code

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

1 code implementation • CVPR 2021 • Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu

Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.

Ranked #24 on Visual Question Answering (VQA) on MSRVTT-QA (using extra training data)

Question Answering Retrieval +4

686

Paper
Code

What is More Likely to Happen Next? Video-and-Language Future Event Prediction

1 code implementation • EMNLP 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

Given a video with aligned dialogue, people can often infer what is more likely to happen next.

Paper
Code

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

1 code implementation • ACL 2020 • Jie Lei, Li-Wei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, Mohit Bansal

Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph.

Ranked #5 on Video Captioning on ActivityNet Captions

Sentence

168

Paper
Code

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

2 code implementations • ECCV 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.

Ranked #2 on Video Retrieval on TVR

Moment Retrieval Retrieval +2

147

Paper
Code

IMP: Instance Mask Projection for High Accuracy Semantic Segmentation of Things

no code implementations • ICCV 2019 • Cheng-Yang Fu, Tamara L. Berg, Alexander C. Berg

In addition, the instance mask projection operator works well on other (non-clothing) datasets, providing an improvement of 3 points in mIOU on Thing classes of Cityscapes, a self-driving dataset, on top of a state-of-the-art approach.

Instance Segmentation Scene Segmentation +2

Paper
Add Code

TVQA+: Spatio-Temporal Grounding for Video Question Answering

3 code implementations • ACL 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.

Ranked #6 on Video Question Answering on TVQA

Question Answering Video Question Answering

120

Paper
Code

Multi-Target Embodied Question Answering

1 code implementation • CVPR 2019 • Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra

To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module.

Embodied Question Answering Navigate +1

287

Paper
Code

Dance Dance Generation: Motion Transfer for Internet Videos

no code implementations • 30 Mar 2019 • Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg

This work presents computational methods for transferring body movements from one person to another with videos collected in the wild.

Paper
Add Code

TVQA: Localized, Compositional Video Question Answering

4 code implementations • EMNLP 2018 • Jie Lei, Licheng Yu, Mohit Bansal, Tamara L. Berg

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.

Ranked #4 on Video Question Answering on SUTD-TrafficQA

Video Question Answering

157

Paper
Code

Image2GIF: Generating Cinemagraphs using Recurrent Deep Q-Networks

no code implementations • 27 Jan 2018 • Yipin Zhou, Yale Song, Tamara L. Berg

Given a still photograph, one can imagine how dynamic objects might move against a static background.

Paper
Add Code

MAttNet: Modular Attention Network for Referring Expression Comprehension

1 code implementation • CVPR 2018 • Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, Tamara L. Berg

In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.

Ranked #7 on Generalized Referring Expression Segmentation on gRefCOCO

Generalized Referring Expression Segmentation Referring Expression +1

292

Paper
Code

Visual to Sound: Generating Natural Sound for Videos in the Wild

3 code implementations • CVPR 2018 • Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg

As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world.

Paper
Code

Hierarchically-Attentive RNN for Album Summarization and Storytelling

no code implementations • EMNLP 2017 • Licheng Yu, Mohit Bansal, Tamara L. Berg

For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story.

Ranked #15 on Visual Storytelling on VIST (BLEU-3 metric)

Retrieval Visual Storytelling

Paper
Add Code

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

2 code implementations • CVPR 2017 • Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg

The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.

Referring Expression Referring Expression Comprehension

Paper
Code

Combining Multiple Cues for Visual Madlibs Question Answering

no code implementations • 1 Nov 2016 • Tatiana Tommasi, Arun Mallya, Bryan Plummer, Svetlana Lazebnik, Alexander C. Berg, Tamara L. Berg

This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset.

Attribute General Classification +3

Paper
Add Code

Learning Temporal Transformations From Time-Lapse Videos

no code implementations • 27 Aug 2016 • Yipin Zhou, Tamara L. Berg

Based on life-long observations of physical, chemical, and biologic phenomena in the natural world, humans can often easily picture in their minds what an object will look like in the future.

Object

Paper
Add Code

When was that made?

no code implementations • 12 Aug 2016 • Sirion Vittayakorn, Alexander C. Berg, Tamara L. Berg

Toward this goal, we utilize features from existing deep networks and also fine-tune new networks for temporal estimation.

Retrieval

Paper
Add Code

Solving Visual Madlibs with Multiple Cues

no code implementations • 11 Aug 2016 • Tatiana Tommasi, Arun Mallya, Bryan Plummer, Svetlana Lazebnik, Alexander C. Berg, Tamara L. Berg

This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset.

Activity Prediction Attribute +4

Paper
Add Code

Modeling Context in Referring Expressions

4 code implementations • 31 Jul 2016 • Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, Tamara L. Berg

Humans refer to objects in their environments all the time, especially in dialogue with other people.

Referring Expression Referring expression generation +1

392

Paper
Code

Where to Buy It: Matching Street Clothing Photos in Online Shops

no code implementations • ICCV 2015 • M. Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexander C. Berg, Tamara L. Berg

In this paper, we define a new task, Exact Street to Shop, where our goal is to match a real-world example of a garment item to the same item in an online shop.

Retrieval

Paper
Add Code

Visual Madlibs: Fill in the Blank Description Generation and Question Answering

no code implementations • ICCV 2015 • Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg

In this paper, we introduce a new dataset consisting of 360, 001 focused natural language descriptions for 10, 738 images.

Multiple-choice Question Answering

Paper
Add Code

Temporal Perception and Prediction in Ego-Centric Video

no code implementations • ICCV 2015 • Yipin Zhou, Tamara L. Berg

Given a video of an activity, can we predict what will happen next?

Paper
Add Code

Visual Madlibs: Fill in the blank Image Generation and Question Answering

no code implementations • 31 May 2015 • Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg

In this paper, we introduce a new dataset consisting of 360, 001 focused natural language descriptions for 10, 738 images.

Image Generation Multiple-choice +1

Paper
Add Code

TreeTalk: Composition and Compression of Trees for Image Descriptions

no code implementations • TACL 2014 • Polina Kuznetsova, Vicente Ordonez, Tamara L. Berg, Yejin Choi

We present a new tree based approach to composing expressive image descriptions that makes use of naturally occuring web images with captions.

Image Captioning Image Retrieval

Paper
Add Code

Studying Relationships between Human Gaze, Description, and Computer Vision

no code implementations • CVPR 2013 • Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, Tamara L. Berg

We posit that user behavior during natural viewing of images contains an abundance of information about the content of images as well as information related to user intent and user defined content importance.

Paper
Add Code

Two-person interaction detection using body-pose features and multiple instance learning

no code implementations • International Workshop on Human Activity Understanding from 3D Data at Conference on Computer Vision and Pattern Recognition (HAU3D-CVPRW) 2012 • Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L. Berg, Dimitris Samaras

Human activity recognition has potential to impact a wide range of applications from surveillance to human computer interfaces to content based video retrieval.

Human Activity Recognition Multiple Instance Learning +2

Paper
Add Code

Im2Text: Describing Images Using 1 Million Captioned Photographs

no code implementations • NeurIPS 2011 • Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

We develop and demonstrate automatic image description methods using a large captioned photo collection.

Image Captioning

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.