Masked AutoEncoder (MAE) has recently led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design, which significantly optimizes both the pre-training efficiency and fine-tuning accuracy.
Ranked #16 on
Object Detection
on COCO minival
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs).
Ranked #9 on
Speech Recognition
on LibriSpeech test-clean
We introduce Ivy, a templated Deep Learning (DL) framework which abstracts existing DL frameworks.
PaddleSpeech is an open-source all-in-one speech toolkit.
To facilitate optimal control applications and in particular sampling and finite differencing, the dynamics can be evaluated for different states and controls in parallel.
Deep learning is increasingly moving towards a transfer learning paradigm whereby large foundation models are fine-tuned on downstream tasks, starting from an initialization learned on the source task.
Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timestamp and multi-view images.
Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time.
Ranked #6 on
Text-To-Speech Synthesis
on LJSpeech
We present the Berkeley Crossword Solver, a state-of-the-art approach for automatically solving crossword puzzles.
We present an efficient method for joint optimization of topology, materials and lighting from multi-view image observations.