Gender biases are known to exist within large-scale visual datasets and can be reflected or even amplified in downstream models.
Specifically, we develop a novel explanation framework ELUDE (Explanation via Labelled and Unlabelled DEcomposition) that decomposes a model's prediction into two parts: one that is explainable through a linear combination of the semantic attributes, and another that is dependent on the set of uninterpretable features.
We propose an algorithm that compresses the critical information of a large dataset into compact addressable memories.
In this work, we grapple with questions that arise along three stages of the machine learning pipeline when incorporating intersectionality as multiple demographic attributes: (1) which demographic attributes to include as dataset labels, (2) how to handle the progressively smaller size of subgroups during model training, and (3) how to move beyond existing evaluation metrics when benchmarking model fairness for more subgroups.
We introduce CARETS, a systematic test suite to measure consistency and robustness of modern VQA models through a series of six fine-grained capability tests.
In this paper, we focus on the less-studied setting of multi-query video retrieval, where multiple queries are provided to the model for searching over the video archive.
Despite the recent growth of interpretability work, there is a lack of systematic evaluation of proposed techniques.
In real-world applications, however, there are multiple protected attributes yielding a large number of intersectional protected groups.
Image captioning is an important task for benchmarking visual reasoning and for enabling accessibility for people with vision impairments.
The implementation of most (7 of 10) methods was straightforward, especially after we received additional details from the original authors.
In this paper, we explore the effects of face obfuscation on the popular ImageNet challenge visual recognition benchmark.
Fairness in visual recognition is becoming a prominent and critical topic of discussion as recognition systems are deployed at scale in the real world.
Concretely, we (1) introduce and motivate point-input questions as an extension of VQA, (2) define three novel classes of questions within this space, and (3) for each class, introduce both a benchmark dataset and a series of baseline models to handle its unique challenges.
We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be 'topped' using simple captioning systems relying on object detectors.
The ability to perform effective planning is crucial for building an instruction-following agent.
Machine learning models are known to perpetuate and even amplify the biases present in the data.
In the Vision-and-Language Navigation (VLN) task, an agent with egocentric vision navigates to a destination given natural language instructions.
Computer vision technology is being used by many but remains representative of only a few.
Temporal grounding entails establishing a correspondence between natural language event descriptions and their visual depictions.
We design a simple but surprisingly effective visual recognition benchmark for studying bias mitigation.
We then show that, while contemporary classifiers fail to exhibit human-like uncertainty on their own, explicit training on our dataset closes this gap, supports improved generalization to increasingly out-of-training-distribution test datasets, and confers robustness to adversarial attacks.
Together these two variants address the two critical use cases in efficient object detection: improving efficiency without sacrificing accuracy, and improving accuracy at real-time efficiency.
Ranked #105 on Object Detection on COCO minival
We present the many kinds of information that will be needed to achieve substantial gains in activity understanding: objects, verbs, intent, and sequential reasoning.
Our method uses Q-learning to learn a data labeling policy on a small labeled training dataset, and then uses this to automatically label noisy web data for new visual concepts.
While deep feature learning has revolutionized techniques for static-image understanding, the same does not quite hold for video processing.
We conclude that the optimal strategy is to ask as many questions as possible in a HIT (up to 52 binary questions after watching a 30-second video clip in our experiments).
In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions.
Ranked #9 on Temporal Action Localization on THUMOS’14 (mAP IOU@0.2 metric)
Every moment counts in action recognition.
Ranked #5 on Action Detection on Multi-THUMOS
The semantic image segmentation task presents a trade-off between test time accuracy and training-time annotation cost.
This paper brings together the latest advancements in object detection and in crowd engineering into a principled framework for accurately and efficiently localizing objects in images.
We formulate joint calibration as a constrained optimization problem and devise an efficient optimization algorithm to find its global optimum.
9 code implementations • 1 Sep 2014 • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, Li Fei-Fei
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images.