no code implementations • 30 Jan 2025 • Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan Misra
Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers.
2 code implementations • 17 Oct 2024 • Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy, Shelly Sheynin, Siddharth Bhattacharya, Simran Motwani, Tao Xu, Tianhe Li, Tingbo Hou, Wei-Ning Hsu, Xi Yin, Xiaoliang Dai, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, Yi-Chiao Wu, Yue Zhao, Yuval Kirstain, Zecheng He, Zijian He, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu, Arun Mallya, Baishan Guo, Boris Araya, Breena Kerr, Carleigh Wood, Ce Liu, Cen Peng, Dimitry Vengertsev, Edgar Schonfeld, Elliot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jeff Liang, John Hoffman, Jonas Kohler, Kaolin Fire, Karthik Sivakumar, Lawrence Chen, Licheng Yu, Luya Gao, Markos Georgopoulos, Rashel Moritz, Sara K. Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, Yuming Du
Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation.
no code implementations • 25 Jul 2024 • Pulkit Kumar, Namitha Padmanabhan, Luke Luo, Sai Saketh Rambhatla, Abhinav Shrivastava
We propose a simple yet effective approach for few-shot action recognition, emphasizing the disentanglement of motion and appearance representations.
no code implementations • 11 Jun 2024 • Shuaiyi Huang, Saksham Suri, Kamal Gupta, Sai Saketh Rambhatla, Ser-Nam Lim, Abhinav Shrivastava
To improve the quality of VIS predictions in the unsupervised setup, we introduce a dual-memory design.
1 code implementation • CVPR 2024 • Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra
Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image.
Ranked #3 on
Conditional Text-to-Image Synthesis
on COCO-MIG
no code implementations • 17 Nov 2023 • Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra
We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image.
no code implementations • 17 Nov 2023 • Sai Saketh Rambhatla, Ishan Misra
We present an automated way to evaluate the text alignment of text-to-image generative diffusion models using standard image-text recognition datasets.
Ranked #62 on
Visual Reasoning
on Winoground
no code implementations • ICCV 2023 • Sai Saketh Rambhatla, Ishan Misra, Rama Chellappa, Abhinav Shrivastava
In this work, we present Multiple Object localization with Self-supervised Transformers (MOST) that uses features of transformers trained using self-supervised learning to localize multiple objects in real world images.
no code implementations • ICCV 2023 • Saksham Suri, Sai Saketh Rambhatla, Rama Chellappa, Abhinav Shrivastava
On average, we improve by $2. 6$, $3. 9$ and $9. 6$ mAP over previous state-of-the-art methods on three splits of increasing sparsity on COCO.
no code implementations • 26 Oct 2021 • Steven Schwarcz, Sai Saketh Rambhatla, Rama Chellappa
This architecture, which we call a Self-Denoising Neural Network (SDNN), can be applied easily to most modern convolutional neural architectures, and can be used as a supplement to many existing few-shot learning techniques.
no code implementations • 28 Jul 2021 • Sai Saketh Rambhatla, Michael Jones, Rama Chellappa
Boosting is a method for finding a highly accurate hypothesis by linearly combining many ``weak" hypotheses, each of which may be only moderately accurate.
no code implementations • ICCV 2021 • Sai Saketh Rambhatla, Rama Chellappa, Abhinav Shrivastava
We tackle object category discovery, which is the problem of discovering and localizing novel objects in a large unlabeled dataset.
no code implementations • 9 Apr 2020 • Ankan Bansal, Sai Saketh Rambhatla, Abhinav Shrivastava, Rama Chellappa
The proposed method consists of a layout module which primes a visual module to predict the type of interaction between a human and an object.
1 code implementation • ICCV 2019 • Pirazh Khorramshahi, Amit Kumar, Neehar Peri, Sai Saketh Rambhatla, Jun-Cheng Chen, Rama Chellappa
In this paper, we present a novel dual-path adaptive attention model for vehicle re-identification (AAVER).
Vehicle Key-Point and Orientation Estimation
Vehicle Re-Identification
no code implementations • 5 Apr 2019 • Ankan Bansal, Sai Saketh Rambhatla, Abhinav Shrivastava, Rama Chellappa
We present an approach for detecting human-object interactions (HOIs) in images, based on the idea that humans interact with functionally similar objects in a similar manner.