no code implementations • 31 Jan 2025 • Zhengqi Gao, Kaiwen Zha, Tianyuan Zhang, Zihui Xue, Duane S. Boning
Experiments on 1D and 2D demonstrate that REG provides a better approximation to the optimal solution than prior guidance techniques, validating the proposed theoretical framework.
no code implementations • 3 Dec 2024 • Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman
The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning.
no code implementations • 13 Jun 2024 • Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
We propose a novel ambient-aware audio generation model, AV-LDM.
no code implementations • 11 Jun 2024 • Zihui Xue, Mi Luo, Changan Chen, Kristen Grauman
We study the problem of precisely swapping objects in videos, with a focus on those interacted with by hands, given one user-provided reference object image.
no code implementations • 11 Mar 2024 • Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman
We investigate exocentric-to-egocentric cross-view translation, which aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective.
no code implementations • CVPR 2024 • Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, Kristen Grauman
We introduce the video detours problem for navigating instructional videos.
no code implementations • CVPR 2024 • Zihui Xue, Kumar Ashutosh, Kristen Grauman
Object State Changes (OSCs) are pivotal for video understanding.
2 code implementations • CVPR 2024 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.
no code implementations • 3 Feb 2023 • Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
With no modification to the baseline architectures, our proposed approach achieves competitive performance on two Ego4D challenges, ranking the 1st in the talking to me challenge and the 3rd in the PNR keyframe localization challenge.
no code implementations • CVPR 2023 • Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e. g., classifying sports in one dataset, tracking animals in another).
1 code implementation • 13 Jun 2022 • Zihui Xue, Zhengqi Gao, Sucheng Ren, Hang Zhao
Crossmodal knowledge distillation (KD) extends traditional knowledge distillation to the area of multimodal learning and demonstrates great success in various applications.
no code implementations • 5 Apr 2022 • Zhengqi Gao, Sucheng Ren, Zihui Xue, Siting Li, Hang Zhao
Multimodal fusion emerges as an appealing technique to improve model performances on many tasks.
1 code implementation • 31 Mar 2022 • Zihui Xue, Radu Marculescu
In this work, we propose dynamic multimodal fusion (DynMM), a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference.
Ranked #49 on
Semantic Segmentation
on NYU Depth v2
no code implementations • 31 Jan 2022 • Zihui Xue, Yuedong Yang, Mengtian Yang, Radu Marculescu
Graph Neural Networks (GNNs) have demonstrated a great potential in a variety of graph-based applications, such as recommender systems, drug discovery, and object recognition.
no code implementations • CVPR 2022 • Sucheng Ren, Zhengqi Gao, Tianyu Hua, Zihui Xue, Yonglong Tian, Shengfeng He, Hang Zhao
Transformers recently are adapted from the community of natural language processing as a promising substitute of convolution-based neural networks for visual learning tasks.
no code implementations • NeurIPS 2021 • Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, Longbo Huang
The world provides us with data of multiple modalities.
1 code implementation • ICCV 2021 • Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, Hang Zhao
In self-supervised representation learning, a common idea behind most of the state-of-the-art approaches is to enforce the robustness of the representations to predefined augmentations.
1 code implementation • ICCV 2021 • Zihui Xue, Sucheng Ren, Zhengqi Gao, Hang Zhao
The popularity of multimodal sensors and the accessibility of the Internet have brought us a massive amount of unlabeled multimodal data.
Ranked #69 on
Semantic Segmentation
on NYU Depth v2