Revealing the Incentive to Cause Distributional Shift

29 Sep 2021 · David Krueger, Tegan Maharaj, Jan Leike ·

Decisions made by machine learning systems have increasing influence on the world, yet it is common for machine learning algorithms to assume that no such influence exists. An example is the use of the i.i.d. assumption in content recommendation: In fact, the (choice of) content displayed can change users’ perceptions and preferences, or even drive them away, causing a shift in the distribution of users. We introduce the term auto-induced distributional shift (ADS) to describe the phenomenon of an algorithm causing change in the distribution of its own inputs. Leveraging ADS can be a means of increasing performance. But this is not always desirable, since performance metrics often underspecify what type of behaviour is desirable. When real-world conditions violate assumptions (such as i.i.d. data), this underspecification can result in unexpected behaviour. To diagnose such issues, we introduce the approach of unit tests for incentives: simple environments designed to show whether an algorithm will hide or reveal incentives to achieve performance via certain means (in our case, via ADS). We use these unit tests to demonstrate that changes to the learning algorithm (e.g. introducing meta-learning) can cause previously hidden incentives to be revealed, resulting in qualitatively different behaviour despite no change in performance metric. We further introduce a toy environment for modelling real-world issues with ADS in content recommendation, where we demonstrate that strong meta-learners achieve gains in performance via ADS. These experiments confirm that the unit tests work – an algorithm’s failure of the unit test correctly diagnoses its propensity to reveal incentives for ADS.

PDF Abstract