Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt.
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning.
Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain.
Ranked #1 on
Question Answering
on PubMedQA
The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the Weisfeiler-Leman (WL) graph isomorphism test.
We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image.
The recent surge in popularity of diffusion models for image generation has brought new attention to the potential of these models in other areas of media generation.
We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models.
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
Ranked #1 on
Image Retrieval
on COCO
To overcome the problem, we propose a prompt retrieval framework to automate the selection of in-context examples.
By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency.