In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information.
We study multimodal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Inspired by the recent work on vision transformers and vision-language transformers, we propose a novel Fully Cross-Transformer based model (FCT) for FSOD by incorporating cross-transformer into both the feature backbone and detection head.
Few-shot object detection (FSOD) aims to detect never-seen objects using few examples.
To improve the fine-grained few-shot proposal classification, we propose a novel attentive feature alignment method to address the spatial misalignment between the noisy proposals and few-shot classes, thus improving the performance of few-shot object detection.
In this paper, we instead propose task-adaptive negative class envision for FSOR to integrate threshold tuning into the learning process.
Recent works instead use modern compressed video modalities as an alternative to the RGB spatial stream and improve the inference speed by orders of magnitudes.