Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks.
Basically, there are two main methods for SLU tasks: (1) Two-stage method, which uses a speech model to transfer speech to text, then uses a language model to get the results of downstream tasks; (2) One-stage method, which just fine-tunes a pre-trained speech model to fit in the downstream tasks.
Panoramic segmentation is a scene where image segmentation tasks is more difficult.
Face detection has witnessed significant progress due to the advances of deep convolutional neural networks (CNNs).
Ranked #1 on Face Detection on WIDER Face (Easy)