Recognizing that the visual `arrest' event is a subevent of the broader `protest' event is a challenging, yet important problem that prior work has not explored.
In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.
We introduce the new task of Video MultiMedia Event Extraction (Video M2E2) and propose two novel components to build the first system towards this task.
To defend against machine-generated fake news, an effective mechanism is urgently needed.
We study the problem of animating images by transferring spatio-temporal visual effects (such as melting) from a collection of videos.
Oscillations in the local field potential (LFP) of the brain are key signatures of neural information processing.
The abundance of multimodal data (e. g. social media posts) has inspired interest in cross-modal retrieval methods.
We collect a dataset of over one million unique images and associated news articles from left- and right-leaning news sources, and develop a method to predict the image's political leaning.
To do so, we introduce a complementary training modality constructed to be similar in artistic style to the target domain, and enforce that the network learns features that are invariant between the two training modalities.
There is more to images than their objective physical content: for example, advertisements are created to persuade a viewer to take a certain action.
In this technical report, we present our publicly downloadable implementation of the SALICON saliency model.
To explore the feasibility of current computer vision techniques to address this problem, we created a new dataset of over 180, 000 images taken by 41 well-known photographers.