1 code implementation • 17 Apr 2025 • Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer
In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding.
2 code implementations • CVPR 2024 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.
1 code implementation • ICCV 2023 • Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang
Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks.
no code implementations • 3 Feb 2023 • Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
With no modification to the baseline architectures, our proposed approach achieves competitive performance on two Ego4D challenges, ranking the 1st in the talking to me challenge and the 3rd in the PNR keyframe localization challenge.
no code implementations • CVPR 2023 • Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e. g., classifying sports in one dataset, tracking animals in another).
no code implementations • 11 Nov 2022 • Gabriele Prato, Yale Song, Janarthanan Rajendran, R Devon Hjelm, Neel Joshi, Sarath Chandar
We show that our method is successful at enabling vision transformers to encode the temporal component of video data.
no code implementations • 21 Oct 2022 • Mayu Otani, Yale Song, Yang Wang
With the broad growth of video capturing devices and applications on the web, it is more demanding to provide desired video content for users efficiently.
1 code implementation • 22 Jul 2022 • Yunhao Ge, Harkirat Behl, Jiashu Xu, Suriya Gunasekar, Neel Joshi, Yale Song, Xin Wang, Laurent Itti, Vibhav Vineet
However, existing approaches either require human experts to manually tune each scene property or use automatic methods that provide little to no control; this requires rendering large amounts of random data variations, which is slow and is often suboptimal for the target domain.
1 code implementation • 11 Jul 2022 • Tyler LaBonte, Yale Song, Xin Wang, Vibhav Vineet, Neel Joshi
A critical object detection task is finetuning an existing model to detect novel objects, but the standard workflow requires bounding box annotations which are time-consuming and expensive to collect.
1 code implementation • 23 Apr 2022 • Baifeng Shi, Yale Song, Neel Joshi, Trevor Darrell, Xin Wang
We present VARS, Visual Attention from Recurrent Sparse reconstruction, a new attention formulation built on two prominent features of the human visual attention mechanism: recurrency and sparsity.
no code implementations • 15 Mar 2022 • Sharath Girish, Debadeepta Dey, Neel Joshi, Vibhav Vineet, Shital Shah, Caio Cesar Teodoro Mendes, Abhinav Shrivastava, Yale Song
We conduct a large-scale study with over 100 variants of ResNet and MobileNet architectures and evaluate them across 11 downstream scenarios in the SSL setting.
1 code implementation • CVPR 2022 • Ching-Yao Chuang, R Devon Hjelm, Xin Wang, Vibhav Vineet, Neel Joshi, Antonio Torralba, Stefanie Jegelka, Yale Song
Contrastive learning relies on an assumption that positive pairs contain related views, e. g., patches of an image or co-occurring multimodal signals of a video, that share certain underlying information about an instance.
no code implementations • NeurIPS 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).
no code implementations • 25 Jun 2021 • Daniel McDuff, Yale Song, Jiyoung Lee, Vibhav Vineet, Sai Vemprala, Nicholas Gyde, Hadi Salman, Shuang Ma, Kwanghoon Sohn, Ashish Kapoor
The ability to perform causal and counterfactual reasoning are central properties of human intelligence.
1 code implementation • 7 Apr 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).
no code implementations • 28 Jan 2021 • Tsu-Jui Fu, William Yang Wang, Daniel McDuff, Yale Song
Creating presentation materials requires complex multimodal reasoning skills to summarize key concepts and arrange them in a logical and visually pleasing manner.
1 code implementation • ICCV 2021 • Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data achieve competitive performances compared to models trained on existing manually curated datasets.
no code implementations • 1 Jan 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
Contrastive self-supervised learning has delivered impressive results in many audio-visual recognition tasks.
no code implementations • ICLR 2021 • Youngjae Yu, Sangho Lee, Gunhee Kim, Yale Song
We show that our approach achieves competitive performance on self-supervised learning of video representations with a considerable improvement in speed compared to the traditional methods.
no code implementations • ICLR 2021 • Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song
The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model.
no code implementations • 3 Dec 2020 • Christopher Thomas, Yale Song, Adriana Kovashka
We study the problem of animating images by transferring spatio-temporal visual effects (such as melting) from a collection of videos.
1 code implementation • ICLR 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance.
no code implementations • 25 Oct 2019 • Matt Whitehill, Shuang Ma, Daniel McDuff, Yale Song
We use this method to transfer emotion from a dataset containing four emotions to a dataset with only a single emotion.
1 code implementation • ICCV 2019 • Shuang Ma, Daniel McDuff, Yale Song
We propose a multimodal information bottleneck approach that learns the correspondence between modalities from unpaired data (image and speech) by leveraging the shared modality (text).
no code implementations • 5 Aug 2019 • Andrew Kae, Yale Song
Training deep neural networks typically requires large amounts of labeled data which may be scarce or expensive to obtain for a particular target domain.
no code implementations • 9 Jul 2019 • Shuang Ma, Daniel McDuff, Yale Song
Generative adversarial networks have led to significant advances in cross-modal/domain translation.
1 code implementation • CVPR 2019 • Yale Song, Mohammad Soleymani
In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.
Ranked #34 on
Cross-Modal Retrieval
on COCO 2014
1 code implementation • NeurIPS 2019 • Daniel McDuff, Shuang Ma, Yale Song, Ashish Kapoor
Models that are learned from real-world data are often biased because the data used to train them is biased.
no code implementations • ICLR 2019 • Shuang Ma, Daniel McDuff, Yale Song
The synthesized audio waveform is expected to contain the verbal content of x_txt and the auditory style of x_aud.
no code implementations • ICML 2018 • Yunseok Jang, Gunhee Kim, Yale Song
Video prediction aims to generate realistic future frames by learning dynamic visual patterns.
no code implementations • 12 Apr 2018 • Yale Song, Mohammad Soleymani
Traditional cross-modal retrieval assumes explicit association of concepts across modalities, where there is no ambiguity in how the concepts are linked to each other, e. g., when we do the image search with a query "dogs", we expect to see dog images.
no code implementations • 27 Jan 2018 • Yipin Zhou, Yale Song, Tamara L. Berg
Given a still photograph, one can imagine how dynamic objects might move against a static background.
no code implementations • 23 Aug 2017 • Haojian Jin, Yale Song, Koji Yatani
Video consumption is being shifted from sit-and-watch to selective skimming.
2 code implementations • CVPR 2017 • Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, Gunhee Kim
In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways.
Ranked #33 on
Visual Question Answering (VQA)
on MSRVTT-QA
4 code implementations • CVPR 2017 • Yuncheng Li, Yale Song, Jiebo Luo
Pairwise ranking, in particular, has been successful in multi-label image classification, achieving state-of-the-art results on various benchmarks.
no code implementations • ICCV 2017 • Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, Li-Jia Li
The ability of learning from noisy labels is very useful in many visual recognition tasks, as a vast amount of data with noisy labels are relatively easy to obtain.
no code implementations • 27 Nov 2016 • Yale Song
We present a technique for detecting highlights from live streaming videos of esports game matches.
2 code implementations • 6 Sep 2016 • Yale Song, Miriam Redi, Jordi Vallmitjana, Alejandro Jaimes
Our system selects attractive thumbnails by analyzing various visual quality and aesthetic metrics of video frames, and performs a clustering analysis to determine the relevance to video content, thus making the resulting thumbnails more representative of the video.
Multimedia
1 code implementation • CVPR 2016 • Michael Gygli, Yale Song, Liangliang Cao
We introduce the novel problem of automatically generating animated GIFs from video.
no code implementations • 25 Apr 2016 • Yale Song, Randall Davis, Kaichen Ma, Dana L. Penny
We describe a sketch interpretation system that detects and classifies clock numerals created by subjects taking the Clock Drawing Test, a clinical tool widely used to screen for cognitive impairments (e. g., dementia).
1 code implementation • CVPR 2016 • Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, Jiebo Luo
The motivation for this work is to develop a testbed for image sequence description systems, where the task is to generate natural language descriptions for animated GIFs or video clips.
no code implementations • CVPR 2015 • Yale Song, Jordi Vallmitjana, Amanda Stent, Alejandro Jaimes
We observe that a video title is often carefully chosen to be maximally descriptive of its main topic, and hence images related to the title can serve as a proxy for important visual concepts of the main topic.
no code implementations • CVPR 2015 • Wen-Sheng Chu, Yale Song, Alejandro Jaimes
We present video co-summarization, a novel perspective to video summarization that exploits visual co-occurrence across multiple videos.
no code implementations • CVPR 2013 • Yale Song, Louis-Philippe Morency, Randall Davis
We develop an efficient learning method to train our model and show that its complexity grows sublinearly with the size of the hierarchy.