Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc... (read more)
PDF AbstractMETHOD | TYPE | |
---|---|---|
🤖 No Methods Found | Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet |