no code implementations • 18 Mar 2025 • Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman
When obtaining visual illustrations from text descriptions, today's methods take a description with-a single text context caption, or an action description-and retrieve or generate the matching visual context.
1 code implementation • 30 Jan 2025 • Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM.
no code implementations • 1 Dec 2024 • Kumar Ashutosh, Georgios Pavlakos, Kristen Grauman
Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames-capturing physically ungrounded predictions of 'what' and ignoring the 'where' and 'how'.
no code implementations • 1 Aug 2024 • Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, Kristen Grauman
Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections.
no code implementations • CVPR 2024 • Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.
no code implementations • CVPR 2024 • Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, Kristen Grauman
We introduce the video detours problem for navigating instructional videos.
no code implementations • CVPR 2024 • Zihui Xue, Kumar Ashutosh, Kristen Grauman
Object State Changes (OSCs) are pivotal for video understanding.
2 code implementations • CVPR 2024 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.
no code implementations • 5 Jan 2023 • Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman
Narrated ''how-to'' videos have emerged as a promising data source for a wide range of learning problems, from learning visual representations to training robot policies.
1 code implementation • CVPR 2023 • Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman
Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text.
Ranked #3 on
Long Term Action Anticipation
on Ego4D
no code implementations • 15 Oct 2022 • Ajay Jaiswal, Kumar Ashutosh, Justin F Rousseau, Yifan Peng, Zhangyang Wang, Ying Ding
Our extensive experiments on popular medical imaging classification tasks (cardiopulmonary disease and lesion classification) using real-world datasets, show the performance benefit of RoS-KD, its ability to distill knowledge from many popular large networks (ResNet-50, DenseNet-121, MobileNet-V2) in a comparatively small network, and its robustness to adversarial attacks (PGD, FSGM).
no code implementations • 3 Dec 2020 • Kumar Ashutosh, Saurabh Kumar, Subhasis Chaudhuri
We present a classification based approach for the next best view selection and show how we can plausibly obtain a supervisory signal for this task.
no code implementations • 16 Sep 2020 • Kumar Ashutosh, Sarthak Consul, Bhishma Dedhia, Parthasarathi Khirwadkar, Sahil Shah, Shivaram Kalyanakrishnan
An important theoretical question is how many iterations a specified PI variant will take to terminate as a function of the number of states $n$ and the number of actions $k$ in the input MDP.
1 code implementation • 22 Jun 2020 • Kumar Ashutosh, Jayakrishnan Nair, Anmol Kagrecha, Krishna Jagannathan
We study regret minimization in a stochastic multi-armed bandit setting and establish a fundamental trade-off between the regret suffered under an algorithm, and its statistical robustness.
no code implementations • 28 Nov 2019 • Sarthak Consul, Bhishma Dedhia, Kumar Ashutosh, Parthasarathi Khirwadkar
We generalize the previous result and prove a novel exponential lower bound on the number of iterations taken by policy iteration for $N-$state, $k-$action MDPs.