Search Results for author: Paul Röttger

Found 22 papers, 17 papers with code

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

1 code implementation • 24 Apr 2024 • Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, Scott A. Hale

Human feedback plays a central role in the alignment of Large Language Models (LLMs).

Paper
Code

Introducing v0.5 of the AI Safety Benchmark from MLCommons

1 code implementation • 18 Apr 2024 • Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller, Ram Gandikota, Agasthya Gangavarapu, Ananya Gangavarapu, James Gealy, Rajat Ghosh, James Goel, Usman Gohar, Sujata Goswami, Scott A. Hale, Wiebke Hutiri, Joseph Marvin Imperial, Surgan Jandial, Nick Judd, Felix Juefei-Xu, Foutse khomh, Bhavya Kailkhura, Hannah Rose Kirk, Kevin Klyman, Chris Knotz, Michael Kuchnik, Shachi H. Kumar, Chris Lengerich, Bo Li, Zeyi Liao, Eileen Peters Long, Victor Lu, Yifan Mai, Priyanka Mary Mammen, Kelvin Manyeki, Sean McGregor, Virendra Mehta, Shafee Mohammed, Emanuel Moss, Lama Nachman, Dinesh Jinenhally Naganna, Amin Nikanjam, Besmira Nushi, Luis Oala, Iftach Orr, Alicia Parrish, Cigdem Patlak, William Pietri, Forough Poursabzi-Sangdeh, Eleonora Presani, Fabrizio Puletti, Paul Röttger, Saurav Sahay, Tim Santos, Nino Scherrer, Alice Schoenauer Sebag, Patrick Schramowski, Abolfazl Shahbazi, Vin Sharma, Xudong Shen, Vamsi Sistla, Leonard Tang, Davide Testuggine, Vithursan Thangarasa, Elizabeth Anne Watkins, Rebecca Weiss, Chris Welty, Tyler Wilbers, Adina Williams, Carole-Jean Wu, Poonam Yadav, Xianjun Yang, Yi Zeng, Wenhui Zhang, Fedor Zhdanov, Jiacheng Zhu, Percy Liang, Peter Mattson, Joaquin Vanschoren

We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0. 5 benchmark.

Paper
Code

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

no code implementations • 12 Apr 2024 • Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank

We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers.

Multiple-choice

Paper
Add Code

SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety

2 code implementations • 8 Apr 2024 • Paul Röttger, Fabio Pernisi, Bertie Vidgen, Dirk Hovy

Researchers and practitioners have met these concerns by introducing an abundance of new datasets for evaluating and improving LLM safety.

Language Modelling Large Language Model

152

Paper
Code

Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset

1 code implementation • 28 Mar 2024 • Janis Goldzycher, Paul Röttger, Gerold Schneider

Our experiments show that the resulting dataset is challenging even for state-of-the-art hate speech detection models, and that training on GAHD clearly improves model robustness.

Hate Speech Detection

Paper
Code

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

1 code implementation • 6 Mar 2024 • Carolin Holtermann, Paul Röttger, Timm Dill, Anne Lauscher

Therefore, in this paper, we investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use.

Open-Ended Question Answering

Paper
Code

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

1 code implementation • 26 Feb 2024 • Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy

Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations.

Multiple-choice

Paper
Code

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

1 code implementation • 22 Feb 2024 • Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul Röttger, Frauke Kreuter, Dirk Hovy, Barbara Plank

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging.

Multiple-choice Text Generation

Paper
Code

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models

no code implementations • 14 Nov 2023 • Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, Paul Röttger

While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme.

Paper
Add Code

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values

no code implementations • 11 Oct 2023 • Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen, Paul Röttger, Scott A. Hale

Human feedback is increasingly used to steer the behaviours of Large Language Models (LLMs).

Paper
Add Code

The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models

no code implementations • 3 Oct 2023 • Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale

In this paper, we address the concept of "alignment" in large language models (LLMs) through the lens of post-structuralist socio-political theory, specifically examining its parallels to empty signifiers.

Paper
Add Code

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

1 code implementation • 14 Sep 2023 • Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, James Zou

Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful.

Paper
Code

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

1 code implementation • 2 Aug 2023 • Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy

In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way.

Language Modelling

Paper
Code

The Ecological Fallacy in Annotation: Modelling Human Label Variation goes beyond Sociodemographics

1 code implementation • 20 Jun 2023 • Matthias Orlikowski, Paul Röttger, Philipp Cimiano, Dirk Hovy

To account for sociodemographics in models of individual annotator behaviour, we introduce group-specific layers to multi-annotator models.

Paper
Code

Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback

no code implementations • 9 Mar 2023 • Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale

Large language models (LLMs) are used to generate content for a wide range of tasks, and are set to reach a growing audience in coming years due to integration in product interfaces like ChatGPT or search engines like Bing.

Paper
Add Code

SemEval-2023 Task 10: Explainable Detection of Online Sexism

1 code implementation • 7 Mar 2023 • Hannah Rose Kirk, Wenjie Yin, Bertie Vidgen, Paul Röttger

Online sexism is a widespread and harmful phenomenon.

Paper
Code

Data-Efficient Strategies for Expanding Hate Speech Detection into Under-Resourced Languages

1 code implementation • 20 Oct 2022 • Paul Röttger, Debora Nozza, Federico Bianchi, Dirk Hovy

More data is needed, but annotating hateful content is expensive, time-consuming and potentially harmful to annotators.

Hate Speech Detection

Paper
Code

Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models

1 code implementation • NAACL (WOAH) 2022 • Paul Röttger, Haitham Seelawi, Debora Nozza, Zeerak Talat, Bertie Vidgen

To help address this issue, we introduce Multilingual HateCheck (MHC), a suite of functional tests for multilingual hate speech detection models.

Hate Speech Detection

Paper
Code

Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks

1 code implementation • NAACL 2022 • Paul Röttger, Bertie Vidgen, Dirk Hovy, Janet B. Pierrehumbert

To address this issue, we propose two contrasting paradigms for data annotation.

Descriptive valid +1

Paper
Code

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate

1 code implementation • NAACL 2022 • Hannah Rose Kirk, Bertram Vidgen, Paul Röttger, Tristan Thrush, Scott A. Hale

Using the test suite, we expose weaknesses in existing hate detection models.

Benchmarking

Paper
Code

Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media

2 code implementations • Findings (EMNLP) 2021 • Paul Röttger, Janet B. Pierrehumbert

Token-level analysis shows that temporal adaptation captures event-driven changes in language use in the downstream task, but not those changes that are actually relevant to task performance.

Document Classification Domain Adaptation +2

Paper
Code

HateCheck: Functional Tests for Hate Speech Detection Models

3 code implementations • ACL 2021 • Paul Röttger, Bertram Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, Janet B. Pierrehumbert

Detecting online hate is a difficult task that even state-of-the-art models struggle with.

Hate Speech Detection

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.