White-Box Attacks on Hate-speech BERT Classifiers in German with Explicit and Implicit Character Level Defense

11 Feb 2022  ·  Shahrukh Khan, Mahnoor Shahid, Navdeeppal Singh ·

In this work, we evaluate the adversarial robustness of BERT models trained on German Hate Speech datasets. We also complement our evaluation with two novel white-box character and word level attacks thereby contributing to the range of attacks available. Furthermore, we also perform a comparison of two novel character-level defense strategies and evaluate their robustness with one another.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods