Search Results for author: Yale Song

Found 43 papers, 16 papers with code

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

no code implementations • 30 Nov 2023 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.

Video Understanding

Paper
Add Code

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

1 code implementation • ICCV 2023 • Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang

Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks.

Ranked #1 on Video Summarization on Query-Focused Video Summarization Dataset

Action Recognition Moment Queries +4

Paper
Code

Egocentric Video Task Translation @ Ego4D Challenge 2022

no code implementations • 3 Feb 2023 • Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani

With no modification to the baseline architectures, our proposed approach achieves competitive performance on two Ego4D challenges, ranking the 1st in the talking to me challenge and the 3rd in the PNR keyframe localization challenge.

Translation

Paper
Add Code

Egocentric Video Task Translation

no code implementations • CVPR 2023 • Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani

Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e. g., classifying sports in one dataset, tracking animals in another).

Multi-Task Learning Translation +1

Paper
Add Code

PatchBlender: A Motion Prior for Video Transformers

no code implementations • 11 Nov 2022 • Gabriele Prato, Yale Song, Janarthanan Rajendran, R Devon Hjelm, Neel Joshi, Sarath Chandar

We show that our method is successful at enabling vision transformers to encode the temporal component of video data.

Paper
Add Code

Video Summarization Overview

no code implementations • 21 Oct 2022 • Mayu Otani, Yale Song, Yang Wang

With the broad growth of video capturing devices and applications on the web, it is more demanding to provide desired video content for users efficiently.

Video Summarization

Paper
Add Code

Neural-Sim: Learning to Generate Training Data with NeRF

1 code implementation • 22 Jul 2022 • Yunhao Ge, Harkirat Behl, Jiashu Xu, Suriya Gunasekar, Neel Joshi, Yale Song, Xin Wang, Laurent Itti, Vibhav Vineet

However, existing approaches either require human experts to manually tune each scene property or use automatic methods that provide little to no control; this requires rendering large amounts of random data variations, which is slow and is often suboptimal for the target domain.

Object Detection

154

Paper
Code

Scaling Novel Object Detection with Weakly Supervised Detection Transformers

1 code implementation • 11 Jul 2022 • Tyler LaBonte, Yale Song, Xin Wang, Vibhav Vineet, Neel Joshi

A critical object detection task is finetuning an existing model to detect novel objects, but the standard workflow requires bounding box annotations which are time-consuming and expensive to collect.

Multiple Instance Learning Novel Object Detection +4

Paper
Code

Visual Attention Emerges from Recurrent Sparse Reconstruction

1 code implementation • 23 Apr 2022 • Baifeng Shi, Yale Song, Neel Joshi, Trevor Darrell, Xin Wang

We present VARS, Visual Attention from Recurrent Sparse reconstruction, a new attention formulation built on two prominent features of the human visual attention mechanism: recurrency and sparsity.

Paper
Code

One Network Doesn't Rule Them All: Moving Beyond Handcrafted Architectures in Self-Supervised Learning

no code implementations • 15 Mar 2022 • Sharath Girish, Debadeepta Dey, Neel Joshi, Vibhav Vineet, Shital Shah, Caio Cesar Teodoro Mendes, Abhinav Shrivastava, Yale Song

We conduct a large-scale study with over 100 variants of ResNet and MobileNet architectures and evaluate them across 11 downstream scenarios in the SSL setting.

Image Classification Self-Supervised Learning

Paper
Add Code

Robust Contrastive Learning against Noisy Views

1 code implementation • CVPR 2022 • Ching-Yao Chuang, R Devon Hjelm, Xin Wang, Vibhav Vineet, Neel Joshi, Antonio Torralba, Stefanie Jegelka, Yale Song

Contrastive learning relies on an assumption that positive pairs contain related views, e. g., patches of an image or co-occurring multimodal signals of a video, that share certain underlying information about an instance.

Binary Classification Contrastive Learning

Paper
Code

Contrastive Learning of Global and Local Video Representations

no code implementations • NeurIPS 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).

Classification Contrastive Learning +4

Paper
Add Code

CausalCity: Complex Simulations with Agency for Causal Discovery and Reasoning

no code implementations • 25 Jun 2021 • Daniel McDuff, Yale Song, Jiyoung Lee, Vibhav Vineet, Sai Vemprala, Nicholas Gyde, Hadi Salman, Shuang Ma, Kwanghoon Sohn, Ashish Kapoor

The ability to perform causal and counterfactual reasoning are central properties of human intelligence.

Causal Discovery counterfactual +2

Paper
Add Code

Contrastive Learning of Global-Local Video Representations

1 code implementation • 7 Apr 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

Classification Contrastive Learning +6

Paper
Code

DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

no code implementations • 28 Jan 2021 • Tsu-Jui Fu, William Yang Wang, Daniel McDuff, Yale Song

Creating presentation materials requires complex multimodal reasoning skills to summarize key concepts and arrange them in a logical and visually pleasing manner.

Document Summarization Multimodal Reasoning +2

Paper
Add Code

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

1 code implementation • ICCV 2021 • Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song

We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data achieve competitive performances compared to models trained on existing manually curated datasets.

Representation Learning

Paper
Code

Self-Supervised Learning of Compressed Video Representations

no code implementations • ICLR 2021 • Youngjae Yu, Sangho Lee, Gunhee Kim, Yale Song

We show that our approach achieves competitive performance on self-supervised learning of video representations with a considerable improvement in speed compared to the traditional methods.

Computational Efficiency Self-Supervised Learning

Paper
Add Code

Contrastive Self-Supervised Learning of Global-Local Audio-Visual Representations

no code implementations • 1 Jan 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

Contrastive self-supervised learning has delivered impressive results in many audio-visual recognition tasks.

Classification DeepFake Detection +5

Paper
Add Code

Parameter Efficient Multimodal Transformers for Video Representation Learning

no code implementations • ICLR 2021 • Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song

The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model.

Language Modelling Representation Learning

Paper
Add Code

Learning to Transfer Visual Effects from Videos to Images

no code implementations • 3 Dec 2020 • Christopher Thomas, Yale Song, Adriana Kovashka

We study the problem of animating images by transferring spatio-temporal visual effects (such as melting) from a collection of videos.

Optical Flow Estimation

Paper
Add Code

Active Contrastive Learning of Audio-Visual Video Representations

1 code implementation • ICLR 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance.

Contrastive Learning Representation Learning +1

Paper
Code

Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency

no code implementations • 25 Oct 2019 • Matt Whitehill, Shuang Ma, Daniel McDuff, Yale Song

We use this method to transfer emotion from a dataset containing four emotions to a dataset with only a single emotion.

Emotion Classification Style Transfer

Paper
Add Code

Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck

1 code implementation • ICCV 2019 • Shuang Ma, Daniel McDuff, Yale Song

We propose a multimodal information bottleneck approach that learns the correspondence between modalities from unpaired data (image and speech) by leveraging the shared modality (text).

Image Generation Speech Synthesis

Paper
Code

Image to Video Domain Adaptation Using Web Supervision

no code implementations • 5 Aug 2019 • Andrew Kae, Yale Song

Training deep neural networks typically requires large amounts of labeled data which may be scarce or expensive to obtain for a particular target domain.

Domain Adaptation

Paper
Add Code

M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

no code implementations • 9 Jul 2019 • Shuang Ma, Daniel McDuff, Yale Song

Generative adversarial networks have led to significant advances in cross-modal/domain translation.

Dialogue Generation Image Captioning +5

Paper
Add Code

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

1 code implementation • CVPR 2019 • Yale Song, Mohammad Soleymani

In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.

Ranked #33 on Cross-Modal Retrieval on COCO 2014

Cross-Modal Retrieval Multiple Instance Learning +4

130

Paper
Code

Characterizing Bias in Classifiers using Generative Models

1 code implementation • NeurIPS 2019 • Daniel McDuff, Shuang Ma, Yale Song, Ashish Kapoor

Models that are learned from real-world data are often biased because the data used to train them is biased.

Bayesian Optimization Image Classification

Paper
Code

Neural TTS Stylization with Adversarial and Collaborative Games

no code implementations • ICLR 2019 • Shuang Ma, Daniel McDuff, Yale Song

The synthesized audio waveform is expected to contain the verbal content of x_txt and the auditory style of x_aud.

Disentanglement Style Transfer

Paper
Add Code

Video Prediction with Appearance and Motion Conditions

no code implementations • ICML 2018 • Yunseok Jang, Gunhee Kim, Yale Song

Video prediction aims to generate realistic future frames by learning dynamic visual patterns.

Video Prediction

Paper
Add Code

Cross-Modal Retrieval with Implicit Concept Association

no code implementations • 12 Apr 2018 • Yale Song, Mohammad Soleymani

Traditional cross-modal retrieval assumes explicit association of concepts across modalities, where there is no ambiguity in how the concepts are linked to each other, e. g., when we do the image search with a query "dogs", we expect to see dog images.

Cross-Modal Retrieval Image Retrieval +3

Paper
Add Code

Image2GIF: Generating Cinemagraphs using Recurrent Deep Q-Networks

no code implementations • 27 Jan 2018 • Yipin Zhou, Yale Song, Tamara L. Berg

Given a still photograph, one can imagine how dynamic objects might move against a static background.

Paper
Add Code

ElasticPlay: Interactive Video Summarization with Dynamic Time Budgets

no code implementations • 23 Aug 2017 • Haojian Jin, Yale Song, Koji Yatani

Video consumption is being shifted from sit-and-watch to selective skimming.

Video Summarization Video Understanding

Paper
Add Code

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

2 code implementations • CVPR 2017 • Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, Gunhee Kim

In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways.

Ranked #30 on Visual Question Answering (VQA) on MSRVTT-QA

Visual Question Answering Zero-Shot Video Question Answer

Paper
Code

Improving Pairwise Ranking for Multi-label Image Classification

4 code implementations • CVPR 2017 • Yuncheng Li, Yale Song, Jiebo Luo

Pairwise ranking, in particular, has been successful in multi-label image classification, achieving state-of-the-art results on various benchmarks.

Classification General Classification +2

Paper
Code

Learning from Noisy Labels with Distillation

no code implementations • ICCV 2017 • Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, Li-Jia Li

The ability of learning from noisy labels is very useful in many visual recognition tasks, as a vast amount of data with noisy labels are relatively easy to obtain.

Paper
Add Code

Real-Time Video Highlights for Yahoo Esports

no code implementations • 27 Nov 2016 • Yale Song

We present a technique for detecting highlights from live streaming videos of esports game matches.

Dota 2 Video Understanding

Paper
Add Code

To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos

2 code implementations • 6 Sep 2016 • Yale Song, Miriam Redi, Jordi Vallmitjana, Alejandro Jaimes

Our system selects attractive thumbnails by analyzing various visual quality and aesthetic metrics of video frames, and performs a clustering analysis to determine the relevance to video content, thus making the resulting thumbnails more representative of the video.

Multimedia

468

Paper
Code

Video2GIF: Automatic Generation of Animated GIFs from Video

1 code implementation • CVPR 2016 • Michael Gygli, Yale Song, Liangliang Cao

We introduce the novel problem of automatically generating animated GIFs from video.

Paper
Code

Balancing Appearance and Context in Sketch Interpretation

no code implementations • 25 Apr 2016 • Yale Song, Randall Davis, Kaichen Ma, Dana L. Penny

We describe a sketch interpretation system that detects and classifies clock numerals created by subjects taking the Clock Drawing Test, a clinical tool widely used to screen for cognitive impairments (e. g., dementia).

Paper
Add Code

TGIF: A New Dataset and Benchmark on Animated GIF Description

1 code implementation • CVPR 2016 • Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, Jiebo Luo

The motivation for this work is to develop a testbed for image sequence description systems, where the task is to generate natural language descriptions for animated GIFs or video clips.

Image Captioning Machine Translation +3

109

Paper
Code

Video Co-Summarization: Video Summarization by Visual Co-Occurrence

no code implementations • CVPR 2015 • Wen-Sheng Chu, Yale Song, Alejandro Jaimes

We present video co-summarization, a novel perspective to video summarization that exploits visual co-occurrence across multiple videos.

Video Summarization

Paper
Add Code

TVSum: Summarizing Web Videos Using Titles

no code implementations • CVPR 2015 • Yale Song, Jordi Vallmitjana, Amanda Stent, Alejandro Jaimes

We observe that a video title is often carefully chosen to be maximally descriptive of its main topic, and hence images related to the title can serve as a proxy for important visual concepts of the main topic.

Descriptive Image Retrieval +1

Paper
Add Code

Action Recognition by Hierarchical Sequence Summarization

no code implementations • CVPR 2013 • Yale Song, Louis-Philippe Morency, Randall Davis

We develop an efficient learning method to train our model and show that its complexity grows sublinearly with the size of the hierarchy.

Action Recognition Temporal Action Localization

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.