Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning

2 Mar 2023

Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building-block for many vision tasks.

Intention-aware Feature Propagation Network for Interactive Segmentation

10 Mar 2022

We aim to tackle the problem of point-based interactive segmentation, in which two key challenges are to infer user's intention correctly and to propagate the user-provided annotations to unlabeled regions efficiently.

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

Findings (NAACL) 2022

Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning.

GEM: A General Evaluation Benchmark for Multimodal Tasks

Findings (ACL) 2021

Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages.

Relation-aware Instance Refinement for Weakly Supervised Visual Grounding

CVPR 2021

Visual grounding, which aims to build a correspondence between visual objects and their language entities, plays a key role in cross-modal scene understanding.

Learning Cross-modal Context Graph for Visual Grounding

20 Nov 2019

To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.

Pose-aware Multi-level Feature Network for Human Object Interaction Detection

ICCV 2019

Reasoning human object interactions is a core problem in human-centric scene understanding and detecting such relations poses a unique challenge to vision systems due to large variations in human-object configurations, multiple co-occurring relation instances and subtle visual difference between relation categories.

