no code implementations • 27 Sep 2024 • Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, BoWen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang
While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e. g., Stable Diffusion) and compositional approaches (e. g., CLIP combined with LLMs).
1 code implementation • 17 Sep 2024 • Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu
In this work, we introduce OmniGen, a new diffusion model for unified image generation.
1 code implementation • 11 Jul 2024 • Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, Ling-Yu Duan
To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions.
Ranked #78 on Visual Question Answering on MM-Vet
1 code implementation • 17 Jun 2024 • Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang
Training pure VLMs that accept the seamless vision and language inputs, i. e., without vision encoders, remains challenging and rarely explored.
1 code implementation • 15 Jun 2024 • Yexin Liu, Zhengyang Liang, Yueze Wang, Muyang He, Jian Li, Bo Zhao
To enhance MLLMs' understanding capability and robustness, we further present a training set with paired positive and negative visual question-answer samples.
1 code implementation • 18 Feb 2024 • Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, Bo Zhao
Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks.
no code implementations • 16 Feb 2024 • Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang
To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization.
1 code implementation • CVPR 2024 • Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang
The human ability to easily solve multimodal tasks in context (i. e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate.
Ranked #3 on Personalized Image Generation on DreamBooth
2 code implementations • 11 Jul 2023 • Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang
We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context.
Ranked #1 on Visual Question Answering on VizWiz
1 code implementation • NeurIPS 2023 • Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, Jian Yang
Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest.