To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning.
In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Ranked #1 on
Text-To-Speech Synthesis
on LJSpeech
We also perform a case study of a large codebase where PyGlove led to an 80% reduction in the number of lines of code.
Proteins power a vast array of functional processes in living cells.
This paper presents a spiking neural network (SNN) accelerator made using fully open-source EDA tools, process design kit (PDK), and memory macros synthesized using OpenRAM.
Accelerated MRI aims to find a pair of samplers and reconstructors to reduce acquisition time while maintaining the reconstruction quality.
Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt.
This is the first use of sparse convolution for 2D masked modeling.
Ranked #1 on
Instance Segmentation
on COCO 2017 val
We present a new method for lightweight novel-view synthesis that generalizes to an arbitrary forward-facing scene.
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.