no code implementations • 15 Dec 2021 • Andrew Wang, Mohit Sudhakar, Yangfeng Ji
We hypothesize the existence of a low-dimensional toxic subspace in the latent space of pre-trained language models, the existence of which suggests that toxic features follow some underlying pattern and are thus removable.