In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions.
1 code implementation • 10 Jan 2024 • Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, Fei Wu
In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks.
In light of the remarkable success of in-context learning in large language models, its potential extension to the vision domain, particularly with visual foundation models like Stable Diffusion, has sparked considerable interest.
This work introduces self-infilling code generation, a general framework that incorporates infilling operations into auto-regressive decoding.
Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts.
To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks.
Recent years have seen considerable advancements in multi-step reasoning with Large Language Models (LLMs).
Our empirical results demonstrate that LEMON reduces computational costs by 56. 7% for Vision Transformers and 33. 2% for BERT when compared to training from scratch.
Although natural language is an obvious choice for communication due to LLM's language understanding capability, the token sampling step needed when generating natural language poses a potential risk of information loss, as it uses only one token to represent the model's belief across the entire vocabulary.
Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling.
However, direct aligning cross-modal information using such representations is challenging, as visual patches and text tokens differ in semantic levels and granularities.
The success of large-scale contrastive vision-language pretraining (CLIP) has benefited both visual recognition and multimodal content understanding.
Built upon previous progress of RFA, we characterize this gap through the lens of control variates and show that RFA can be decomposed into a sum of multiple control variate estimators for each element in the sequence.
Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult.
Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to its capability of learning fine-grained relevance across different modalities.
In addition, in order to enrich the decoder with descriptive semantics and enforce the correctness of the deterministic medical-related contents such as mentions of organs or diagnoses, we extract medical concepts based on the radiology reports in the training data and fine-tune the encoder to extract the most frequent medical concepts from the x-ray images.
We propose a Dynamic Graph-Based Spatial-Temporal Attention (DG-STA) method for hand gesture recognition.
Ranked #3 on Hand Gesture Recognition on SHREC 2017
Extensive experiments show that our method significantly outperforms the existing unsupervised models for image-to-image translation problems, and achieves comparable performance to existing supervised models on a synthesized dataset.
Sentiment analysis on large-scale social media data is important to bridge the gaps between social media contents and real world activities including political election prediction, individual and public emotional status monitoring and analysis, and so on.
One of the most important features of the proposed technique is the fact that it can be applied on top of any existing CF based recommendation engine without changing the CF core.