no code implementations • 19 Dec 2024 • Estelle Aflalo, Gabriela Ben Melech Stan, Tiep Le, Man Luo, Shachar Rosenman, Sayak Paul, Shao-Yen Tseng, Vasudev Lal
Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning.
no code implementations • 5 Dec 2024 • Donghoon Ahn, Jiwon Kang, SangHyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hyoungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, Kyong Hwan Jin, Seungryong Kim
Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline.
no code implementations • 2 Dec 2024 • Gabriela Ben-Melech Stan, Estelle Aflalo, Man Luo, Shachar Rosenman, Tiep Le, Sayak Paul, Shao-Yen Tseng, Vasudev Lal
While Large Vision Language Models (LVLMs) have become masterly capable in reasoning over human prompts and visual inputs, they are still prone to producing responses that contain misinformation.
1 code implementation • 24 Aug 2024 • Chansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jing Tang
The widespread adoption of cloud-based proprietary large language models (LLMs) has introduced significant challenges, including operational dependencies, privacy concerns, and the necessity of continuous internet connectivity.
no code implementations • 10 Jun 2024 • Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, Jongheon Jeong
In this paper, we focus on the alignment of recent text-to-image diffusion models, such as Stable Diffusion XL (SDXL), and find that this "reference mismatch" is indeed a significant problem in aligning these models due to the unstructured nature of visual modalities: e. g., a preference for a particular stylistic aspect can easily induce such a discrepancy.
1 code implementation • 1 Apr 2024 • Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang
One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt.
no code implementations • 27 Feb 2024 • Shyam Marjit, Harshit Singh, Nityanand Mathur, Sayak Paul, Chia-Mu Yu, Pin-Yu Chen
In the realm of subject-driven text-to-image (T2I) generative models, recent developments like DreamBooth and BLIP-Diffusion have led to impressive results yet encounter limitations due to their intensive fine-tuning demands and substantial parameter requirements.
1 code implementation • 10 Jan 2024 • Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li
As a state-of-the-art, open-source image generation model, PIXART-{\delta} offers a promising alternative to the Stable Diffusion family of models, contributing significantly to text-to-image synthesis.
1 code implementation • 5 Jan 2024 • Yatharth Gupta, Vishnu V. Jaddipal, Harish Prabhala, Sayak Paul, Patrick von Platen
In this work, we introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1. 3B and 0. 74B parameter UNets, respectively, achieved through progressive removal using layer-level losses focusing on reducing the model size while preserving generative quality.
1 code implementation • NeurIPS Workshop AI4Scien 2021 • Sayak Paul, Siddha Ganju
Floods wreak havoc throughout the world, causing billions of dollars in damages, and uprooting communities, ecosystems and economies.
1 code implementation • 17 May 2021 • Sayak Paul, Pin-Yu Chen
Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy.
no code implementations • 17 May 2021 • Andrey Ignatov, Grigory Malivenko, Radu Timofte, Sheng Chen, Xin Xia, Zhaoyan Liu, Yuwei Zhang, Feng Zhu, Jiashi Li, Xuefeng Xiao, Yuan Tian, Xinglong Wu, Christos Kyrkou, Yixin Chen, Zexin Zhang, Yunbo Peng, Yue Lin, Saikat Dutta, Sourya Dipta Das, Nisarg A. Shah, Himanshu Kumar, Chao Ge, Pei-Lin Wu, Jin-Hua Du, Andrew Batutin, Juan Pablo Federico, Konrad Lyda, Levon Khojoyan, Abhishek Thanki, Sayak Paul, Shahid Siddiqui
To address this problem, we introduce the first Mobile AI challenge, where the target is to develop quantized deep learning-based camera scene classification solutions that can demonstrate a real-time performance on smartphones and IoT platforms.
1 code implementation • 28 Sep 2020 • Souradip Chakraborty, Aritra Roy Gosthipaty, Sayak Paul
In this work, we propose that, with the normalized temperature-scaled cross-entropy (NT-Xent) loss function (as used in SimCLR), it is beneficial to not have images of the same category in the same batch.
1 code implementation • 25 Sep 2020 • Souradip Chakraborty, Aritra Roy Gosthipaty, Sayak Paul
In this work, we propose that, with the normalized temperature-scaled cross-entropy (NT-Xent) loss function (as used in SimCLR), it is beneficial to not have images of the same category in the same batch.