Associating Neural Word Embeddings With Deep Image Representations Using Fisher Vectors

CVPR 2015  ·  Benjamin Klein, Guy Lev, Gil Sadeh, Lior Wolf ·

In recent years, the problem of associating a sentence with an image has gained a lot of attention. This work continues to push the envelope and makes further progress in the performance of image annotation and image search by a sentence tasks. In this work, we are using the Fisher Vector as a sentence representation by pooling the word2vec embedding of each word in the sentence. The Fisher Vector is typically taken as the gradients of the log-likelihood of descriptors, with respect to the parameters of a Gaussian Mixture Model (GMM). In this work we present two other Mixture Models and derive their Expectation-Maximization and Fisher Vector expressions. The first is a Laplacian Mixture Model (LMM), which is based on the Laplacian distribution. The second Mixture Model presented is a Hybrid Gaussian-Laplacian Mixture Model (HGLMM) which is based on a weighted geometric mean of the Gaussian and Laplacian distribution. Finally, by using the new Fisher Vectors derived from HGLMMs to represent sentences, we achieve state-of-the-art results for both the image annotation and the image search by a sentence tasks on four benchmarks: Pascal1K, Flickr8K, Flickr30K, and COCO.

PDF Abstract

Results from the Paper


Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Video Retrieval YouCook2 HGLMM FV CCA text-to-video Median Rank 75 # 9
text-to-video R@1 4.6 # 14
text-to-video R@10 21.6 # 15
text-to-video R@5 14.3 # 13

Methods


No methods listed for this paper. Add relevant methods here