Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored... (read more)
PDF Abstract CVPR 2020 PDF CVPR 2020 Abstract