As a highlighting research topic in the multimedia area, cross-media
retrieval aims to capture the complex correlations among multiple media types.
Learning better shared representation and distance metric for multimedia data
is important to boost the cross-media retrieval. Motivated by the strong
ability of deep neural network in feature representation and comparison
functions learning, we propose the Unified Network for Cross-media Similarity
Metric (UNCSM) to associate cross-media shared representation learning with
distance metric in a unified framework. First, we design a two-pathway deep
network pretrained with contrastive loss, and employ double triplet similarity
loss for fine-tuning to learn the shared representation for each media type by
modeling the relative semantic similarity. Second, the metric network is
designed for effectively calculating the cross-media similarity of the shared
representation, by modeling the pairwise similar and dissimilar constraints.
Compared to the existing methods which mostly ignore the dissimilar constraints
and only use sample distance metric as Euclidean distance separately, our UNCSM
approach unifies the representation learning and distance metric to preserve
the relative similarity as well as embrace more complex similarity functions
for further improving the cross-media retrieval accuracy. The experimental
results show that our UNCSM approach outperforms 8 state-of-the-art methods on
4 widely-used cross-media datasets.