Many visual recognition problems can be approached by counting instances. To
determine whether an event is present in a long internet video, one could count
how many frames seem to contain the activity...
Classifying the activity of a
group of people can be done by counting the actions of individual people. Encoding these cardinality relationships can reduce sensitivity to clutter, in
the form of irrelevant frames or individuals not involved in a group activity. Learned parameters can encode how many instances tend to occur in a class of
interest. To this end, this paper develops a powerful and flexible framework to
infer any cardinality relation between latent labels in a multi-instance model. Hard or soft cardinality relations can be encoded to tackle diverse levels of
ambiguity. Experiments on tasks such as human activity recognition, video event
detection, and video summarization demonstrate the effectiveness of using
cardinality relations for improving recognition results.