ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

ECCV 2020  ·  Zhe Wang, Zhiyuan Fang, Jun Wang, Yezhou Yang ·

Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as the performance boosting by a robust feature learning that the referred identity can be accurately bundled by multiple attribute visual cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into subspaces corresponding to attributes using a light auxiliary attribute segmentation computing branch. It then aligns these visual features with the textual attributes parsed from the sentences by using a novel contrastive learning loss. Upon that, we validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Code will be publicly available upon publication.

PDF Abstract ECCV 2020 PDF ECCV 2020 Abstract


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Text based Person Retrieval CUHK-PEDES ViTAA R@1 55.97 # 15
R@10 83.52 # 14
R@5 75.84 # 14


No methods listed for this paper. Add relevant methods here