no code implementations • 29 Sep 2024 • Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, Mike Zheng Shou
We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos.
no code implementations • 29 Aug 2024 • Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
1 code implementation • 31 Jul 2024 • Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou
In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts.
no code implementations • CVPR 2024 • Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou
Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content.
no code implementations • 12 Jun 2024 • Shiwei Wu, Chao Zhang, Joya Chen, Tong Xu, Likang Wu, Yao Hu, Enhong Chen
People's social relationships are often manifested through their surroundings, with certain objects or interactions acting as symbols for specific relationships, e. g., wedding rings, roses, hugs, or holding hands.
1 code implementation • CVPR 2024 • Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou
In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way.
2 code implementations • CVPR 2024 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.
1 code implementation • ICCV 2023 • Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou
Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels.
Ranked #6 on Natural Language Moment Retrieval on TACoS
no code implementations • 14 Jun 2023 • Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, Mike Zheng Shou
2) Flexible inputs and intermediate results.
1 code implementation • CVPR 2023 • Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou
Humans excel at learning from expert demonstrations and solving their own problems.
Ranked #1 on Video-to-image Affordance Grounding on EPIC-Hotspot
4 code implementations • 8 Mar 2022 • Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, Mike Zheng Shou
In this paper, we define a new task called Affordance-centric Question-driven Task Completion, where the AI assistant should learn from instructional videos to provide step-by-step help in the user's view.
1 code implementation • 28 Feb 2022 • Joya Chen, Kai Xu, Yuhui Wang, Yifei Cheng, Angela Yao
A standard hardware bottleneck when training deep neural networks is GPU memory.
no code implementations • 16 Jun 2020 • Joya Chen, Qi Wu, Dong Liu, Tong Xu
Recent years have witnessed the remarkable developments made by deep learning techniques for object detection, a fundamentally challenging problem of computer vision.
1 code implementation • 27 Oct 2019 • Xianfeng Liang, Likang Wu, Joya Chen, Yang Liu, Runlong Yu, Min Hou, Han Wu, Yuyang Ye, Qi Liu, Enhong Chen
Recently, the traffic congestion in modern cities has become a growing worry for the residents.
13 code implementations • 11 Sep 2019 • Joya Chen, Dong Liu, Tong Xu, Shiwei Wu, Yifei Cheng, Enhong Chen
In this paper, we challenge the necessity of such hard/soft sampling methods for training accurate deep object detectors.
no code implementations • 24 Aug 2019 • Joya Chen, Dong Liu, Bin Luo, Xuezheng Peng, Tong Xu, Enhong Chen
For a long time, object detectors have suffered from extreme imbalance between foregrounds and backgrounds.