Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models.
Ranked #9 on
Text-to-Image Generation
on COCO
We introduce k-planes, a white-box model for radiance fields in arbitrary dimensions.
Ranked #1 on
Novel View Synthesis
on LLFF
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning.
In this paper, we demonstrate that a learned discrete codebook prior in a small proxy space largely reduces the uncertainty and ambiguity of restoration mapping by casting blind face restoration as a code prediction task, while providing rich visual atoms for generating high-quality faces.
Ranked #1 on
Blind Face Restoration
on WIDER
In this report, we present a fast and accurate object detection method dubbed DAMO-YOLO, which achieves higher performance than the state-of-the-art YOLO series.
Ranked #12 on
Real-Time Object Detection
on COCO
We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP.
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial intelligence-driven protein design.
With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance.
Ranked #18 on
Action Recognition
on Something-Something V2
(using extra training data)
Retrieval-augmented in-context learning has emerged as a powerful approach for addressing knowledge-intensive tasks using frozen language models (LM) and retrieval models (RM).
This is the first use of sparse convolution for 2D masked modeling.
Ranked #1 on
Instance Segmentation
on COCO 2017 val