Search Results for author: Yale Song

Found 43 papers, 16 papers with code

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

no code implementations30 Nov 2023 Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.

Video Understanding

Egocentric Video Task Translation @ Ego4D Challenge 2022

no code implementations3 Feb 2023 Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani

With no modification to the baseline architectures, our proposed approach achieves competitive performance on two Ego4D challenges, ranking the 1st in the talking to me challenge and the 3rd in the PNR keyframe localization challenge.

Translation

Egocentric Video Task Translation

no code implementations CVPR 2023 Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani

Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e. g., classifying sports in one dataset, tracking animals in another).

Multi-Task Learning Translation +1

PatchBlender: A Motion Prior for Video Transformers

no code implementations11 Nov 2022 Gabriele Prato, Yale Song, Janarthanan Rajendran, R Devon Hjelm, Neel Joshi, Sarath Chandar

We show that our method is successful at enabling vision transformers to encode the temporal component of video data.

Video Summarization Overview

no code implementations21 Oct 2022 Mayu Otani, Yale Song, Yang Wang

With the broad growth of video capturing devices and applications on the web, it is more demanding to provide desired video content for users efficiently.

Video Summarization

Neural-Sim: Learning to Generate Training Data with NeRF

1 code implementation22 Jul 2022 Yunhao Ge, Harkirat Behl, Jiashu Xu, Suriya Gunasekar, Neel Joshi, Yale Song, Xin Wang, Laurent Itti, Vibhav Vineet

However, existing approaches either require human experts to manually tune each scene property or use automatic methods that provide little to no control; this requires rendering large amounts of random data variations, which is slow and is often suboptimal for the target domain.

Object Detection

Scaling Novel Object Detection with Weakly Supervised Detection Transformers

1 code implementation11 Jul 2022 Tyler LaBonte, Yale Song, Xin Wang, Vibhav Vineet, Neel Joshi

A critical object detection task is finetuning an existing model to detect novel objects, but the standard workflow requires bounding box annotations which are time-consuming and expensive to collect.

Multiple Instance Learning Novel Object Detection +4

Visual Attention Emerges from Recurrent Sparse Reconstruction

1 code implementation23 Apr 2022 Baifeng Shi, Yale Song, Neel Joshi, Trevor Darrell, Xin Wang

We present VARS, Visual Attention from Recurrent Sparse reconstruction, a new attention formulation built on two prominent features of the human visual attention mechanism: recurrency and sparsity.

Robust Contrastive Learning against Noisy Views

1 code implementation CVPR 2022 Ching-Yao Chuang, R Devon Hjelm, Xin Wang, Vibhav Vineet, Neel Joshi, Antonio Torralba, Stefanie Jegelka, Yale Song

Contrastive learning relies on an assumption that positive pairs contain related views, e. g., patches of an image or co-occurring multimodal signals of a video, that share certain underlying information about an instance.

Binary Classification Contrastive Learning

Contrastive Learning of Global and Local Video Representations

no code implementations NeurIPS 2021 Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).

Classification Contrastive Learning +4

Contrastive Learning of Global-Local Video Representations

1 code implementation7 Apr 2021 Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).

Classification Contrastive Learning +6

DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

no code implementations28 Jan 2021 Tsu-Jui Fu, William Yang Wang, Daniel McDuff, Yale Song

Creating presentation materials requires complex multimodal reasoning skills to summarize key concepts and arrange them in a logical and visually pleasing manner.

Document Summarization Multimodal Reasoning +2

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

1 code implementation ICCV 2021 Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song

We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data achieve competitive performances compared to models trained on existing manually curated datasets.

Representation Learning

Self-Supervised Learning of Compressed Video Representations

no code implementations ICLR 2021 Youngjae Yu, Sangho Lee, Gunhee Kim, Yale Song

We show that our approach achieves competitive performance on self-supervised learning of video representations with a considerable improvement in speed compared to the traditional methods.

Computational Efficiency Self-Supervised Learning

Parameter Efficient Multimodal Transformers for Video Representation Learning

no code implementations ICLR 2021 Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song

The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model.

Language Modelling Representation Learning

Learning to Transfer Visual Effects from Videos to Images

no code implementations3 Dec 2020 Christopher Thomas, Yale Song, Adriana Kovashka

We study the problem of animating images by transferring spatio-temporal visual effects (such as melting) from a collection of videos.

Optical Flow Estimation

Active Contrastive Learning of Audio-Visual Video Representations

1 code implementation ICLR 2021 Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance.

Contrastive Learning Representation Learning +1

Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency

no code implementations25 Oct 2019 Matt Whitehill, Shuang Ma, Daniel McDuff, Yale Song

We use this method to transfer emotion from a dataset containing four emotions to a dataset with only a single emotion.

Emotion Classification Style Transfer

Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck

1 code implementation ICCV 2019 Shuang Ma, Daniel McDuff, Yale Song

We propose a multimodal information bottleneck approach that learns the correspondence between modalities from unpaired data (image and speech) by leveraging the shared modality (text).

Image Generation Speech Synthesis

Image to Video Domain Adaptation Using Web Supervision

no code implementations5 Aug 2019 Andrew Kae, Yale Song

Training deep neural networks typically requires large amounts of labeled data which may be scarce or expensive to obtain for a particular target domain.

Domain Adaptation

M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

no code implementations9 Jul 2019 Shuang Ma, Daniel McDuff, Yale Song

Generative adversarial networks have led to significant advances in cross-modal/domain translation.

Dialogue Generation Image Captioning +5

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

1 code implementation CVPR 2019 Yale Song, Mohammad Soleymani

In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.

Cross-Modal Retrieval Multiple Instance Learning +4

Neural TTS Stylization with Adversarial and Collaborative Games

no code implementations ICLR 2019 Shuang Ma, Daniel McDuff, Yale Song

The synthesized audio waveform is expected to contain the verbal content of x_txt and the auditory style of x_aud.

Disentanglement Style Transfer

Video Prediction with Appearance and Motion Conditions

no code implementations ICML 2018 Yunseok Jang, Gunhee Kim, Yale Song

Video prediction aims to generate realistic future frames by learning dynamic visual patterns.

Video Prediction

Cross-Modal Retrieval with Implicit Concept Association

no code implementations12 Apr 2018 Yale Song, Mohammad Soleymani

Traditional cross-modal retrieval assumes explicit association of concepts across modalities, where there is no ambiguity in how the concepts are linked to each other, e. g., when we do the image search with a query "dogs", we expect to see dog images.

Cross-Modal Retrieval Image Retrieval +3

Image2GIF: Generating Cinemagraphs using Recurrent Deep Q-Networks

no code implementations27 Jan 2018 Yipin Zhou, Yale Song, Tamara L. Berg

Given a still photograph, one can imagine how dynamic objects might move against a static background.

Improving Pairwise Ranking for Multi-label Image Classification

4 code implementations CVPR 2017 Yuncheng Li, Yale Song, Jiebo Luo

Pairwise ranking, in particular, has been successful in multi-label image classification, achieving state-of-the-art results on various benchmarks.

Classification General Classification +2

Learning from Noisy Labels with Distillation

no code implementations ICCV 2017 Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, Li-Jia Li

The ability of learning from noisy labels is very useful in many visual recognition tasks, as a vast amount of data with noisy labels are relatively easy to obtain.

Real-Time Video Highlights for Yahoo Esports

no code implementations27 Nov 2016 Yale Song

We present a technique for detecting highlights from live streaming videos of esports game matches.

Dota 2 Video Understanding

To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos

2 code implementations6 Sep 2016 Yale Song, Miriam Redi, Jordi Vallmitjana, Alejandro Jaimes

Our system selects attractive thumbnails by analyzing various visual quality and aesthetic metrics of video frames, and performs a clustering analysis to determine the relevance to video content, thus making the resulting thumbnails more representative of the video.

Multimedia

Video2GIF: Automatic Generation of Animated GIFs from Video

1 code implementation CVPR 2016 Michael Gygli, Yale Song, Liangliang Cao

We introduce the novel problem of automatically generating animated GIFs from video.

Balancing Appearance and Context in Sketch Interpretation

no code implementations25 Apr 2016 Yale Song, Randall Davis, Kaichen Ma, Dana L. Penny

We describe a sketch interpretation system that detects and classifies clock numerals created by subjects taking the Clock Drawing Test, a clinical tool widely used to screen for cognitive impairments (e. g., dementia).

TGIF: A New Dataset and Benchmark on Animated GIF Description

1 code implementation CVPR 2016 Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, Jiebo Luo

The motivation for this work is to develop a testbed for image sequence description systems, where the task is to generate natural language descriptions for animated GIFs or video clips.

Image Captioning Machine Translation +3

Video Co-Summarization: Video Summarization by Visual Co-Occurrence

no code implementations CVPR 2015 Wen-Sheng Chu, Yale Song, Alejandro Jaimes

We present video co-summarization, a novel perspective to video summarization that exploits visual co-occurrence across multiple videos.

Video Summarization

TVSum: Summarizing Web Videos Using Titles

no code implementations CVPR 2015 Yale Song, Jordi Vallmitjana, Amanda Stent, Alejandro Jaimes

We observe that a video title is often carefully chosen to be maximally descriptive of its main topic, and hence images related to the title can serve as a proxy for important visual concepts of the main topic.

Descriptive Image Retrieval +1

Action Recognition by Hierarchical Sequence Summarization

no code implementations CVPR 2013 Yale Song, Louis-Philippe Morency, Randall Davis

We develop an efficient learning method to train our model and show that its complexity grows sublinearly with the size of the hierarchy.

Action Recognition Temporal Action Localization

Cannot find the paper you are looking for? You can Submit a new open access paper.