There is a growing need to model interactions between data modalities (e. g., vision, language) — both to improve AI predictions on existing tasks and to enable new applications.
1 code implementation • 14 Sep 2023 • Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, William Collins, Neera Ahuja, Curtis P. Langlotz, Jason Hom, Sergios Gatidis, John Pauly, Akshay S. Chaudhari
Sifting through vast textual data and summarizing key information imposes a substantial burden on how clinicians allocate their time.
The first key contribution of this work is to demonstrate through systematic evaluations that as the pairwise complexity of the training dataset increases, standard VLMs struggle to learn region-attribute relationships, exhibiting performance degradations of up to 37% on retrieval tasks.
1 code implementation • 2 May 2023 • Dave Van Veen, Cara Van Uden, Maayane Attias, Anuj Pareek, Christian Bluethgen, Malgorzata Polacin, Wah Chiu, Jean-Benoit Delbrouck, Juan Manuel Zambrano Chaves, Curtis P. Langlotz, Akshay S. Chaudhari, John Pauly
We systematically investigate lightweight strategies to adapt large language models (LLMs) for the task of radiology report summarization (RRS).
no code implementations • 23 Nov 2022 • Pierre Chambon, Christian Bluethgen, Jean-Benoit Delbrouck, Rogier van der Sluijs, Małgorzata Połacin, Juan Manuel Zambrano Chaves, Tanishq Mathew Abraham, Shivanshu Purohit, Curtis P. Langlotz, Akshay Chaudhari
We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images, and that the output can be controlled to a new extent by using free-form text prompts including radiology-specific language.
We then conduct extensive experiments to evaluate the performance of models both within and across modality-anatomy pairs in MIMIC-RRS.
To overcome this limitation, we propose a new method, the RadGraph reward, to further improve the factual completeness and correctness of generated radiology reports.
In this work, we address these challenges by first designing a principled evaluation framework that enables a quantitative comparison of SDMs across 1, 235 slice discovery settings in three input domains (natural images, medical images, and time-series data).
This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis.
Understanding expressed sentiment and emotions are two crucial factors in human multimodal language.
Ranked #4 on Multimodal Sentiment Analysis on CMU-MOSEI (using extra training data)
As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training.
Even with the growing interest in problems at the intersection of Computer Vision and Natural Language, grounding (i. e. identifying) the components of a structured description in an image still remains a challenging task.
When searching for an object humans navigate through a scene using semantic information and spatial relationships.
This paper describes the UMONS solution for the Multimodal Machine Translation Task presented at the third conference on machine translation (WMT18).
So far, the goal has been to maximize scores on automated metric and to do so, one has to come up with a plurality of new modules and techniques.
We propose a new and fully end-to-end approach for multimodal translation where the source text encoder modulates the entire visual input processing using conditional batch normalization, in order to compute the most informative image features for our task.
In Multimodal Neural Machine Translation (MNMT), a neural model generates a translated sentence that describes an image, given the image itself and one source descriptions in English.
In state-of-the-art Neural Machine Translation (NMT), an attention mechanism is used during decoding to enhance the translation.
Recently, the effectiveness of the attention mechanism has also been explored for multimodal tasks, where it becomes possible to focus both on sentence parts and image regions.