Computer Vision • 31 methods
Involves models that adapt pre-training to the field of Vision-and-Language (V-L) learning and improve the performance on downstream tasks like visual question answering and visual captioning.
According to Du et al. (2022), information coming from the different modalities can be encoded in three ways: fusion encoder, dual encoder, and a combination of both.
References:
Method | Year | Papers |
---|---|---|
2021 | 4622 | |
2021 | 2747 | |
2022 | 81 | |
2019 | 40 | |
2020 | 34 | |
2019 | 30 | |
2022 | 27 | |
2019 | 24 | |
2021 | 17 | |
2021 | 11 | |
2021 | 8 | |
2021 | 6 | |
2021 | 5 | |
2021 | 5 | |
2022 | 5 | |
2000 | 5 | |
2023 | 5 | |
2019 | 4 | |
2020 | 4 | |
2021 | 4 | |
2021 | 4 | |
2021 | 3 | |
2020 | 2 | |
2000 | 2 | |
2022 | 2 | |
2020 | 1 | |
2019 | 1 | |
2020 | 1 | |
2021 | 1 | |
2022 | 1 |