Unifying Vision-and-Language Tasks via Text Generation

Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc... (read more)

Results in Papers With Code
(↓ scroll down to see all results)