Most recent CNN architectures use average pooling as a final feature encoding
step. In the field of fine-grained recognition, however, recent global
representations like bilinear pooling offer improved performance...
paper, we generalize average and bilinear pooling to "alpha-pooling", allowing
for learning the pooling strategy during training. In addition, we present a
novel way to visualize decisions made by these approaches. We identify parts of
training images having the highest influence on the prediction of a given test
image. It allows for justifying decisions to users and also for analyzing the
influence of semantic parts. For example, we can show that the higher capacity
VGG16 model focuses much more on the bird's head than, e.g., the lower-capacity
VGG-M model when recognizing fine-grained bird categories. Both contributions
allow us to analyze the difference when moving between average and bilinear
pooling. In addition, experiments show that our generalized approach can
outperform both across a variety of standard datasets.