no code implementations • 13 Feb 2025 • Mingxiao Li, Fang Qu, Zhanpeng Chen, Na Su, Zhizhou Zhong, Ziyang Chen, Nan Du, Xiaolong Li
While MLLMs perform well on perceptual tasks, they lack precise multimodal alignment, limiting performance.
1 code implementation • 2 Feb 2025 • Teng Xiao, Yige Yuan, Zhengyu Chen, Mingxiao Li, Shangsong Liang, Zhaochun Ren, Vasant G Honavar
Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models.
1 code implementation • 19 Jan 2025 • Zhanpeng Chen, Mingxiao Li, Ziyang Chen, Nan Du, Xiaolong Li, Yuexian Zou
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models' comprehensive perception performance across different levels of granularity.
1 code implementation • 19 Dec 2024 • Mang Ning, Mingxiao Li, Jianlin Su, Haozhe Jia, Lanmiao Liu, Martin Beneš, Albert Ali Salah, Itir Onal Ertugrul
The effectiveness of DCTdiff and the introduced properties suggest a promising direction for image modeling in the frequency space.
1 code implementation • 19 Dec 2024 • Teng Xiao, Yige Yuan, Huaisheng Zhu, Mingxiao Li, Vasant G Honavar
Contrastive preference optimization has shown promising results in aligning LLMs with available preference data by optimizing the implicit reward associated with the policy.
no code implementations • 5 Dec 2024 • Maria Mihaela Trusca, Mingxiao Li, Marie-Francine Moens
We show substantial improvements in image editing using action-based text instructions and high reasoning capabilities that allow our model to use the input image as a starting scene for an action while generating a new image that shows the final scene of the action.
1 code implementation • 17 Nov 2024 • Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, Marie-Francine Moens
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
Ranked #1 on
Zero-Shot Video Question Answer
on MVBench
MVBench
Video-based Generative Performance Benchmarking (Consistency)
+6
1 code implementation • 14 Oct 2024 • Teng Xiao, Mingxiao Li, Yige Yuan, Huaisheng Zhu, Chao Cui, Vasant G Honavar
This paper introduces a novel generalized self-imitation learning ($\textbf{GSIL}$) framework, which effectively and efficiently aligns large language models with offline demonstration data.
1 code implementation • 7 Oct 2024 • Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Pengcheng Chen, Yu Dong, Christopher Brinton, Jiebo Luo
To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data.
1 code implementation • 2 May 2024 • Wei Sun, Mingxiao Li, Jingyuan Sun, Jesse Davis, Marie-Francine Moens
Argument structure learning~(ASL) entails predicting relations between arguments.
no code implementations • 15 Mar 2024 • Mingxiao Li, Bo Wan, Marie-Francine Moens, Tinne Tuytelaars
For the first time, we integrate both semantic and motion cues within a diffusion model for video generation, as demonstrated in Fig 1.
no code implementations • 2 Feb 2024 • Jingyuan Sun, Mingxiao Li, Zijiao Chen, Marie-Francine Moens
In the pursuit to understand the intricacies of human brain's visual processing, reconstructing dynamic visual experiences from brain activities emerges as a challenging yet fascinating endeavor.
no code implementations • 2 Oct 2023 • Wei Sun, Mingxiao Li, Damien Sileo, Jesse Davis, Marie-Francine Moens
Medical Question Answering~(medical QA) systems play an essential role in assisting healthcare workers in finding answers to their questions.
no code implementations • 30 Sep 2023 • Jingyuan Sun, Mingxiao Li, Marie-Francine Moens
Reconstructing visual stimuli from human brain activities provides a promising opportunity to advance our understanding of the brain's visual system and its connection with computer vision models.
5 code implementations • 29 Aug 2023 • Mang Ning, Mingxiao Li, Jianlin Su, Albert Ali Salah, Itir Onal Ertugrul
In this paper, we systematically investigate the exposure bias problem in diffusion models by first analytically modelling the sampling distribution, based on which we then attribute the prediction error at each sampling step as the root cause of the exposure bias issue.
Ranked #10 on
Image Generation
on CIFAR-10
1 code implementation • NeurIPS 2023 • Jingyuan Sun, Mingxiao Li, Zijiao Chen, Yunhao Zhang, Shaonan Wang, Marie-Francine Moens
The second phase tunes the feature learner to attend to neural activation patterns most informative for visual reconstruction with guidance from an image auto-encoder.
Ranked #1 on
Brain Visual Reconstruction from fMRI
on GOD
1 code implementation • 24 May 2023 • Mingxiao Li, Tingyu Qu, Ruicong Yao, Wei Sun, Marie-Francine Moens
In this work, we conduct a systematic study of exposure bias in DPM and, intriguingly, we find that the exposure bias could be alleviated with a novel sampling method that we propose, without retraining the model.
no code implementations • 3 Apr 2023 • Mingxiao Li, Rui Jin, Liyao Xiang, Kaiming Shen, Shuguang Cui
The traditional methods for data compression are typically based on the symbol-level statistics, with the information source modeled as a long sequence of i. i. d.
1 code implementation • 30 Nov 2022 • Mingxiao Li, Zehao Wang, Tinne Tuytelaars, Marie-Francine Moens
In this work, we study the problem of Embodied Referring Expression Grounding, where an agent needs to navigate in a previously unseen environment and localize a remote object described by a concise high-level natural language instruction.
no code implementations • 7 Mar 2022 • Zehao Wang, Mingxiao Li, Minye Wu, Marie-Francine Moens, Tinne Tuytelaars
In this paper, we introduce the map-language navigation task where an agent executes natural language instructions and moves to the target position based only on a given 3D semantic map.
1 code implementation • 6 Mar 2022 • Mingxiao Li, Marie-Francine Moens
Knowledge-based visual question answering (VQA) is a vision-language task that requires an agent to correctly answer image-related questions using knowledge that is not presented in the given image.
no code implementations • EACL 2021 • Mingxiao Li, Marie-Francine Moens
Visual dialog is a vision-language task where an agent needs to answer a series of questions grounded in an image based on the understanding of the dialog history and the image.
no code implementations • 13 Jun 2021 • Jaron Maene, Mingxiao Li, Marie-Francine Moens
The lottery ticket hypothesis states that sparse subnetworks exist in randomly initialized dense networks that can be trained to the same accuracy as the dense network they reside in.
7 code implementations • 27 Aug 2020 • Yuhao Kang, Song Gao, Yunlei Liang, Mingxiao Li, Jinmeng Rao, Jake Kruse
Understanding dynamic human mobility changes and spatial interaction patterns at different geographic scales is crucial for monitoring and measuring the impacts of non-pharmaceutical interventions (such as stay-at-home orders) during the pandemic.
Social and Information Networks Physics and Society