We show that Vision-Language Transformers can be learned without human labels (e. g. class labels, bounding boxes, etc).
The primary focus of recent work with largescale transformers has been on optimizing the amount of information packed into the model's parameters.
This paper surveyed the recent works on image-based and text-based person search from the perspective of challenges and solutions.
For example, given an image, we want to not only detect and recognize objects in the image, but also know the relationship between objects (visual relationship detection), and generate a text description (image captioning) based on the image content.
In the last couple of years, weakly labeled learning for sound events has turned out to be an exciting approach for audio event detection.