Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks.
SOTA for Natural Language Inference on QNLI
While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human.
We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task.
Per-pixel ground-truth depth data is challenging to acquire at scale.
This paper presents SimCLR: a simple framework for contrastive learning of visual representations.