In this work we focus on low-resource spoken named entity recognition (NER) and address the question: Beyond self-supervised pre-training, how can we use external speech and/or text data that are not annotated for the task?
Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks.
Ranked #1 on Named Entity Recognition on SLUE
Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models.
Many computer vision tasks address the problem of scene understanding and are naturally interrelated e. g. object classification, detection, scene segmentation, depth estimation, etc.
We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve the performance of single-image semantic segmentation, by enforcing 3D-geometric and temporal consistency of segmentation masks across video frames.
Recent work has shown that speech paired with images can be used to learn semantically meaningful speech representations even without any textual supervision.