19 papers with code • 1 benchmarks • 1 datasets
Multimodal generation refers to the process of generating outputs that incorporate multiple modalities, such as images, text, and sound. This can be done using deep learning models that are trained on data that includes multiple modalities, allowing the models to generate output that is informed by more than one type of data.
For example, a multimodal generation model could be trained to generate captions for images that incorporate both text and visual information. The model could learn to identify objects in the image and generate descriptions of them in natural language, while also taking into account contextual information and the relationships between the objects in the image.
Multimodal generation can also be used in other applications, such as generating realistic images from textual descriptions or generating audio descriptions of video content. By combining multiple modalities in this way, multimodal generation models can produce more accurate and comprehensive output, making them useful for a wide range of applications.
This adversarial loss guarantees the map is diverse -- a very wide range of anime can be produced from a single content code.
To learn a multimodal semantic correlation in a quantized space, we combine VQ-VAE with a Transformer encoder and apply an input masking strategy.
Multimodal Generation of Novel Action Appearances for Synthetic-to-Real Recognition of Activities of Daily Living
We tackle this challenge and introduce an activity domain generation framework which creates novel ADL appearances (novel domains) from different existing activity modalities (source domains) inferred from video training data.
Goal-oriented generative script learning aims to generate subsequent steps to reach a particular goal, which is an essential task to assist robots or humans in performing stereotypical activities.
The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals.
We also introduce a novel reliability parameter that allows using different off-the-shelf diffusion models trained across various datasets during sampling time alone to guide it to the desired outcome satisfying multiple constraints.
Poster generation is a significant task for a wide range of applications, which is often time-consuming and requires lots of manual editing and artistic experience.
We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.