Descriptive
327 papers with code • 1 benchmarks • 1 datasets
Most implemented papers
Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story.
Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text
This paper investigates how linguistic knowledge mined from large text corpora can aid the generation of natural language descriptions of videos.
A Hierarchical Approach for Generating Descriptive Image Paragraphs
Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail.
PL-SLAM: a Stereo SLAM System through the Combination of Points and Line Segments
This paper proposes PL-SLAM, a stereo visual SLAM system that combines both points and line segments to work robustly in a wider variety of scenarios, particularly in those where point features are scarce or not well-distributed in the image.
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
While these models thrive on the perception-based task (descriptive), they perform poorly on the causal tasks (explanatory, predictive and counterfactual), suggesting that a principled approach for causal reasoning should incorporate the capability of both perceiving complex visual and language inputs, and understanding the underlying dynamics and causal relations.
Uninformed Students: Student-Teacher Anomaly Detection with Discriminative Latent Embeddings
Our experiments demonstrate improvements over state-of-the-art methods on a number of real-world datasets, including the recently introduced MVTec Anomaly Detection dataset that was specifically designed to benchmark anomaly segmentation algorithms.
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search
In this research work we present CLIP-GLaSS, a novel zero-shot framework to generate an image (or a caption) corresponding to a given caption (or image).
Music transcription modelling and composition using deep learning
We apply deep learning methods, specifically long short-term memory (LSTM) networks, to music transcription modelling and composition.
Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions
We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation.
A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering
In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different models' predictions, and compare these with human performance.