1 code implementation • 12 Jun 2024 • Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, JianFeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang
Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics.
1 code implementation • 12 Jun 2024 • Weixi Feng, Jiachen Li, Michael Saxon, Tsu-Jui Fu, Wenhu Chen, William Yang Wang
Video generation has many unique challenges beyond those of image generation.
1 code implementation • 29 May 2024 • Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, William Yang Wang
In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve $\textbf{both fast and high-quality video generation}$.
no code implementations • 16 Mar 2024 • Jiachen Li, Weixi Feng, Wenhu Chen, William Yang Wang
By distilling a latent consistency model (LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates the generation of high-fidelity images within merely 2 to 4 inference steps.
1 code implementation • 12 Jul 2023 • Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, William Yang Wang
In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action.
1 code implementation • NeurIPS 2023 • Weixi Feng, Wanrong Zhu, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, William Yang Wang
When combined with a downstream image generation model, LayoutGPT outperforms text-to-image models/systems by 20-40% and achieves comparable performance as human users in designing visual layouts for numerical and spatial correctness.
1 code implementation • 23 May 2023 • SiQi Liu, Weixi Feng, Tsu-Jui Fu, Wenhu Chen, William Yang Wang
Making image retrieval methods practical for real-world search applications requires significant progress in dataset scales, entity comprehension, and multimodal information fusion.
1 code implementation • 18 May 2023 • Xuehai He, Weixi Feng, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang
Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation.
1 code implementation • 9 Dec 2022 • Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, William Yang Wang
In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions.
no code implementations • 19 Oct 2022 • Xuehai He, Diji Yang, Weixi Feng, Tsu-Jui Fu, Arjun Akula, Varun Jampani, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang
Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP.
1 code implementation • 18 Oct 2022 • Weixi Feng, Tsu-Jui Fu, Yujie Lu, William Yang Wang
Vision-and-Language Navigation (VLN) is a task to guide an embodied agent moving to a target position using language instructions.
no code implementations • 10 Sep 2022 • Yujie Lu, Huiliang Zhang, Ping Nie, Weixi Feng, Wenda Xu, Xin Eric Wang, William Yang Wang
In this paper, we propose an Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
no code implementations • 6 Jun 2022 • Yujie Lu, Weixi Feng, Wanrong Zhu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, William Yang Wang
Procedural planning aims to implement complex high-level goals by decomposition into sequential simpler low-level steps.