We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
Ranked #1 on
Text-to-Image Generation
on COCO
(using extra training data)
PaddleSpeech is an open-source all-in-one speech toolkit.
Automatic Speech Recognition
Environmental Sound Classification
+8
We present SymForce, a library for fast symbolic computation, code generation, and nonlinear optimization for robotics applications like computer vision, motion planning, and controls.
This paper describes the system developed by the NPU team for the 2020 personalized voice trigger challenge.
We propose a new method named OnePose for object pose estimation.
To facilitate optimal control applications and in particular sampling and finite differencing, the dynamics can be evaluated for different states and controls in parallel.
Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information.
Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We introduce Ivy, a templated Deep Learning (DL) framework which abstracts existing DL frameworks.
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style.
Ranked #5 on
Text-to-Image Generation
on COCO
(using extra training data)
Conditional Image Generation
Zero-Shot Text-to-Image Generation