no code implementations • 23 Nov 2024 • Manuel Tonneau, Diyi Liu, Niyati Malhotra, Scott A. Hale, Samuel P. Fraiberger, Victor Orozco-Olvera, Paul Röttger
To address this issue, we introduce HateDay, the first global hate speech dataset representative of social media settings, randomly sampled from all tweets posted on September 21, 2022 for eight languages and four English-speaking countries.
no code implementations • 4 Oct 2024 • Xinpeng Wang, Chengzhi Hu, Paul Röttger, Barbara Plank
We also show that our approach can be used for fine-grained calibration of model safety.
no code implementations • 8 Aug 2024 • Fabio Pernisi, Dirk Hovy, Paul Röttger
We contribute towards closing this gap by investigating the effectiveness of many-shot jailbreaking, where models are prompted with unsafe demonstrations to induce unsafe behaviour, in Italian.
1 code implementation • 20 Jun 2024 • Kobi Hackenburg, Ben M. Tappin, Paul Röttger, Scott Hale, Jonathan Bright, Helen Margetts
Large language models can now generate political messages as persuasive as those written by humans, raising concerns about how far this persuasiveness may continue to increase with model size.
1 code implementation • 27 Apr 2024 • Manuel Tonneau, Diyi Liu, Samuel Fraiberger, Ralph Schroeder, Scott A. Hale, Paul Röttger
We find that HS datasets for these languages exhibit a strong geo-cultural bias, largely overrepresenting a handful of countries (e. g., US and UK for English) relative to their prominence in both the broader social media population and the general population speaking these languages.
no code implementations • 25 Apr 2024 • Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder de Witt, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Botos Csaba, Fabro Steibel, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Marvin Imperial, Juan A. Nolazco-Flores, Lori Landay, Matthew Jackson, Paul Röttger, Philip H. S. Torr, Trevor Darrell, Yong Suk Lee, Jakob Foerster
In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education.
1 code implementation • 24 Apr 2024 • Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, Scott A. Hale
Human feedback is central to the alignment of Large Language Models (LLMs).
1 code implementation • 18 Apr 2024 • Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller, Ram Gandikota, Agasthya Gangavarapu, Ananya Gangavarapu, James Gealy, Rajat Ghosh, James Goel, Usman Gohar, Sujata Goswami, Scott A. Hale, Wiebke Hutiri, Joseph Marvin Imperial, Surgan Jandial, Nick Judd, Felix Juefei-Xu, Foutse khomh, Bhavya Kailkhura, Hannah Rose Kirk, Kevin Klyman, Chris Knotz, Michael Kuchnik, Shachi H. Kumar, Srijan Kumar, Chris Lengerich, Bo Li, Zeyi Liao, Eileen Peters Long, Victor Lu, Sarah Luger, Yifan Mai, Priyanka Mary Mammen, Kelvin Manyeki, Sean McGregor, Virendra Mehta, Shafee Mohammed, Emanuel Moss, Lama Nachman, Dinesh Jinenhally Naganna, Amin Nikanjam, Besmira Nushi, Luis Oala, Iftach Orr, Alicia Parrish, Cigdem Patlak, William Pietri, Forough Poursabzi-Sangdeh, Eleonora Presani, Fabrizio Puletti, Paul Röttger, Saurav Sahay, Tim Santos, Nino Scherrer, Alice Schoenauer Sebag, Patrick Schramowski, Abolfazl Shahbazi, Vin Sharma, Xudong Shen, Vamsi Sistla, Leonard Tang, Davide Testuggine, Vithursan Thangarasa, Elizabeth Anne Watkins, Rebecca Weiss, Chris Welty, Tyler Wilbers, Adina Williams, Carole-Jean Wu, Poonam Yadav, Xianjun Yang, Yi Zeng, Wenhui Zhang, Fedor Zhdanov, Jiacheng Zhu, Percy Liang, Peter Mattson, Joaquin Vanschoren
We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0. 5 benchmark.
1 code implementation • 12 Apr 2024 • Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank
We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers.
1 code implementation • 8 Apr 2024 • Paul Röttger, Fabio Pernisi, Bertie Vidgen, Dirk Hovy
Researchers and practitioners have met these concerns by introducing an abundance of new datasets for evaluating and improving LLM safety.
1 code implementation • 28 Mar 2024 • Janis Goldzycher, Paul Röttger, Gerold Schneider
Our experiments show that the resulting dataset is challenging even for state-of-the-art hate speech detection models, and that training on GAHD clearly improves model robustness.
1 code implementation • 6 Mar 2024 • Carolin Holtermann, Paul Röttger, Timm Dill, Anne Lauscher
Therefore, in this paper, we investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use.
1 code implementation • 26 Feb 2024 • Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy
Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations.
1 code implementation • 22 Feb 2024 • Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul Röttger, Frauke Kreuter, Dirk Hovy, Barbara Plank
The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging.
no code implementations • 14 Nov 2023 • Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, Paul Röttger
While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme.
no code implementations • 11 Oct 2023 • Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen, Paul Röttger, Scott A. Hale
Human feedback is increasingly used to steer the behaviours of Large Language Models (LLMs).
no code implementations • 3 Oct 2023 • Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale
In this paper, we address the concept of "alignment" in large language models (LLMs) through the lens of post-structuralist socio-political theory, specifically examining its parallels to empty signifiers.
4 code implementations • 14 Sep 2023 • Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, James Zou
Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful.
1 code implementation • 2 Aug 2023 • Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy
In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way.
1 code implementation • 20 Jun 2023 • Matthias Orlikowski, Paul Röttger, Philipp Cimiano, Dirk Hovy
To account for sociodemographics in models of individual annotator behaviour, we introduce group-specific layers to multi-annotator models.
no code implementations • 9 Mar 2023 • Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale
Large language models (LLMs) are used to generate content for a wide range of tasks, and are set to reach a growing audience in coming years due to integration in product interfaces like ChatGPT or search engines like Bing.
1 code implementation • 7 Mar 2023 • Hannah Rose Kirk, Wenjie Yin, Bertie Vidgen, Paul Röttger
Online sexism is a widespread and harmful phenomenon.
1 code implementation • 20 Oct 2022 • Paul Röttger, Debora Nozza, Federico Bianchi, Dirk Hovy
More data is needed, but annotating hateful content is expensive, time-consuming and potentially harmful to annotators.
1 code implementation • NAACL (WOAH) 2022 • Paul Röttger, Haitham Seelawi, Debora Nozza, Zeerak Talat, Bertie Vidgen
To help address this issue, we introduce Multilingual HateCheck (MHC), a suite of functional tests for multilingual hate speech detection models.
1 code implementation • NAACL 2022 • Paul Röttger, Bertie Vidgen, Dirk Hovy, Janet B. Pierrehumbert
To address this issue, we propose two contrasting paradigms for data annotation.
1 code implementation • NAACL 2022 • Hannah Rose Kirk, Bertram Vidgen, Paul Röttger, Tristan Thrush, Scott A. Hale
Using the test suite, we expose weaknesses in existing hate detection models.
2 code implementations • Findings (EMNLP) 2021 • Paul Röttger, Janet B. Pierrehumbert
Token-level analysis shows that temporal adaptation captures event-driven changes in language use in the downstream task, but not those changes that are actually relevant to task performance.
3 code implementations • ACL 2021 • Paul Röttger, Bertram Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, Janet B. Pierrehumbert
Detecting online hate is a difficult task that even state-of-the-art models struggle with.