We study the problem of explaining a rich class of behavioral properties of
deep neural networks. Distinctively, our influence-directed explanations
approach this problem by peering inside the network to identify neurons with
high influence on a quantity and distribution of interest, using an
axiomatically-justified influence measure, and then providing an interpretation
for the concepts these neurons represent...
We evaluate our approach by
demonstrating a number of its unique capabilities on convolutional neural
networks trained on ImageNet. Our evaluation demonstrates that
influence-directed explanations (1) identify influential concepts that
generalize across instances, (2) can be used to extract the "essence" of what
the network learned about a class, and (3) isolate individual features the
network uses to make decisions and distinguish related classes.