The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich.
Ranked #1 on Phrase Grounding on Flickr30k Entities Test (using extra training data)
A reverse dictionary takes descriptions of words as input and outputs words semantically matching the input descriptions.
To further improve the performance of the proposed method, we propose a skeleton-based search space to reduce false positive detection.
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS.
Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details.
Ranked #1 on Blind Face Restoration on CelebA-Test
While only the semantics of each task differ, current research focuses on designing specialized architectures for each task.
Ranked #1 on Panoptic Segmentation on COCO minival
In order to modify style, we obtain a similarity score between a text prompt (describing style) and a stylized mesh by harnessing the representational power of CLIP.
Ranked #1 on Neural Stylization on Meshes
We perform a subjective and objective evaluation to compare the performance of each vocoder along a different axis.
We approach text-to-image generation by combining the power of the retrained CLIP representation with an off-the-shelf image generator (GANs), optimizing in the latent space of GAN to find images that achieve maximum CLIP score with the given input text.
Ranked #1 on Zero-Shot Text-to-Image Generation on COCO