Search Results for author: Yueze Wang

Found 10 papers, 8 papers with code

Emu3: Next-Token Prediction is All You Need

no code implementations27 Sep 2024 Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, BoWen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang

While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e. g., Stable Diffusion) and compositional approaches (e. g., CLIP combined with LLMs).

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

1 code implementation11 Jul 2024 Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, Ling-Yu Duan

To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions.

Visual Question Answering

Unveiling Encoder-Free Vision-Language Models

1 code implementation17 Jun 2024 Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang

Training pure VLMs that accept the seamless vision and language inputs, i. e., without vision encoders, remains challenging and rarely explored.

Decoder Inductive Bias +1

Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions

1 code implementation15 Jun 2024 Yexin Liu, Zhengyang Liang, Yueze Wang, Muyang He, Jian Li, Bo Zhao

To enhance MLLMs' understanding capability and robustness, we further present a training set with paired positive and negative visual question-answer samples.

Efficient Multimodal Learning from Data-centric Perspective

1 code implementation18 Feb 2024 Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, Bo Zhao

Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks.

Universal Prompt Optimizer for Safe Text-to-Image Generation

no code implementations16 Feb 2024 Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang

To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization.

Blocking Text-to-Image Generation

Generative Multimodal Models are In-Context Learners

1 code implementation CVPR 2024 Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang

The human ability to easily solve multimodal tasks in context (i. e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate.

In-Context Learning Personalized Image Generation +3

Emu: Generative Pretraining in Multimodality

2 code implementations11 Jul 2023 Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context.

Image Captioning Temporal/Casual QA +4

Fine-Grained Visual Prompting

1 code implementation NeurIPS 2023 Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, Jian Yang

Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest.

Visual Prompting

Cannot find the paper you are looking for? You can Submit a new open access paper.