Vision and Language Pre-Trained Models

Computer Vision • 31 methods

Involves models that adapt pre-training to the field of Vision-and-Language (V-L) learning and improve the performance on downstream tasks like visual question answering and visual captioning.

According to Du et al. (2022), information coming from the different modalities can be encoded in three ways: fusion encoder, dual encoder, and a combination of both.

References:

Method Year Papers
2021 1940
2021 1443
2022 44
2019 39
2019 29
2019 22
2020 22
2022 18
2021 17
2021 11
2021 9
2021 6
2021 5
2000 5
2019 4
2020 4
2021 4
2021 4
2021 3
2021 3
2023 3
2020 2
2000 2
2022 2
2022 2
2020 1
2019 1
2020 1
2021 1
2022 1