A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities.
Ranked #1 on Zero-shot 3D classification on Objaverse LVIS (using extra training data)
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator.
For example, when we learn mathematics at school, we build upon our knowledge of addition to learn multiplication.
For the class-center involved triplet loss, the positive and negative samples in each triplet are replaced by their corresponding class centers, which enforces data representations of the same class closer to the class center.