Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

19 Feb 2025  ·  Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, Ninghao Liu ·

Large multimodal models (LMMs) have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address this, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of data synthesis and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Pneumonia Detection Chest X-ray images Selfsynthx Accuracy 98.72 # 1
Fine-Grained Visual Recognition CUB-200-2011 Selfsynthx Accuracy (%) 85.02 # 1
Fine-Grained Visual Recognition FGVC-Aircraft Selfsynthx Accuracy (%) 91.99 # 1
Fine-Grained Visual Recognition New Plant Diseases Dataset Selfsynthx Accuracy (% ) 97.16 # 1
Fine-Grained Visual Recognition Stanford Dogs Selfsynthx Accuracy (%) 86.91 # 1

Methods


No methods listed for this paper. Add relevant methods here