One of the distinguishing aspects of human language is its compositionality,
which allows us to describe complex environments with limited vocabulary. Previously, it has been shown that neural network agents can learn to
communicate in a highly structured, possibly compositional language based on
disentangled input (e.g. hand- engineered features)...
Humans, however, do not
learn to communicate based on well-summarized features. In this work, we train
neural agents to simultaneously develop visual perception from raw image
pixels, and learn to communicate with a sequence of discrete symbols. The
agents play an image description game where the image contains factors such as
colors and shapes. We train the agents using the obverter technique where an
agent introspects to generate messages that maximize its own understanding. Through qualitative analysis, visualization and a zero-shot test, we show that
the agents can develop, out of raw image pixels, a language with compositional
properties, given a proper pressure from the environment.