We investigate knowledge retrieval with multi-modal queries, i. e. queries containing information split across image and text inputs, a challenging task that differs from previous work on cross-modal retrieval.
Our MicroSeg is based on the assumption that background regions with strong objectness possibly belong to those concepts in the historical or future stages.
In videos that contain actions performed unintentionally, agents do not achieve their desired goals.
In this paper, we are concerned with a better-performing detector-free image captioning model, and propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
In this paper, we study knowledge distillation (KD) to effectively compress a transformer-based large VL model into a small VL model.
This paper is concerned with self-supervised learning for small models.
The referring attention is our designed mechanism acting as a scoring function for grounding the given queries over frames temporally.
By extracting various features from high to low resolutions, the MD-IPN is able to improve the performance of small object detection as well as maintaining the performance of middle and large objects.
Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions.
Ranked #16 on Text based Person Retrieval on CUHK-PEDES
In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene.
The process of identifying changes or transformations in a scene along with the ability of reasoning about their causes and effects, is a key aspect of intelligence.
Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries.
Grounding textual phrases in visual content is a meaningful yet challenging problem with various potential applications such as image-text inference or text-driven multimedia interaction.
Unlike these work, this paper investigated how long-tailed data impact the training of face CNNs and develop a novel loss function, called range loss, to effectively utilize the tailed data in training process.
Convolutional neural networks have achieved great improvement on face recognition in recent years because of its extraordinary ability in learning discriminative features of people with different identities.