Gender biases are known to exist within large-scale visual datasets and can be reflected or even amplified in downstream models.
In this work, we grapple with questions that arise along three stages of the machine learning pipeline when incorporating intersectionality as multiple demographic attributes: (1) which demographic attributes to include as dataset labels, (2) how to handle the progressively smaller size of subgroups during model training, and (3) how to move beyond existing evaluation metrics when benchmarking model fairness for more subgroups.
Image captioning is an important task for benchmarking visual reasoning and for enabling accessibility for people with vision impairments.
Machine learning models are known to perpetuate and even amplify the biases present in the data.
We further demonstrate our approach on learning to imagine and execute in 3 environments, the final of which is deformable rope manipulation on a PR2 robot.
We posit that a generative approach is the natural remedy for this problem, and propose a method for classification using generative models.