no code implementations • 7 Jan 2025 • Yuxi Xia, Pedro Henrique Luz de Araujo, Klim Zaporojets, Benjamin Roth
Concretely, we build Calib-n, a novel framework that trains an auxiliary model for confidence estimation that aggregates responses from multiple LLMs to capture inter-model agreement.
1 code implementation • 2 Jul 2024 • Pedro Henrique Luz de Araujo, Benjamin Roth
We also compare persona's generations to two baseline settings: a control persona setting with 30 paraphrases of "a helpful assistant" to control for models' prompt sensitivity, and an empty persona setting where no persona is assigned.
no code implementations • 7 Jun 2024 • Andreas Stephan, Lukas Miklautz, Collin Leiber, Pedro Henrique Luz de Araujo, Dominik Répás, Claudia Plant, Benjamin Roth
Traditional image clustering techniques only find a single grouping within visual data.
no code implementations • 5 May 2024 • Yuxi Xia, Anastasiia Sedova, Pedro Henrique Luz de Araujo, Vasiliki Kougia, Lisa Nußbaumer, Benjamin Roth
Finally, the prompt performance of detecting model memorization is quantified by the percentage of name pairs for which the model has higher confidence for the name from the training set.
no code implementations • 13 Mar 2024 • Benjamin Roth, Pedro Henrique Luz de Araujo, Yuxi Xia, Saskia Kaltenbrunner, Christoph Korab
Machine learning (ML) and artificial intelligence (AI) approaches are often criticized for their inherent bias and for their lack of control, accountability, and transparency.
no code implementations • 14 Nov 2023 • Pedro Henrique Luz de Araujo, Benjamin Roth
We combine the specification instructions to create specification-augmented prompts, which we feed to language models pre-trained on natural instruction data.
1 code implementation • 22 May 2023 • Pedro Henrique Luz de Araujo, Benjamin Roth
In behavioural testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs.
1 code implementation • nlppower (ACL) 2022 • Pedro Henrique Luz de Araujo, Benjamin Roth
Behavioural testing -- verifying system capabilities by validating human-designed input-output pairs -- is an alternative evaluation method of natural language processing systems proposed to address the shortcomings of the standard approach: computing metrics on held-out data.
1 code implementation • 1 Dec 2020 • Pedro Henrique Luz de Araujo, Teófilo Emidio de Campos
The data consist of a corpus of 45, 532 lawsuits manually annotated by the Court’s experts with theme labels, a multi-class and multi-label classification task.
1 code implementation • LREC 2020 • Pedro Henrique Luz de Araujo, Te{\'o}filo Em{\'\i}dio de Campos, Fabricio Ataides Braz, Nilton Correia da Silva
This paper describes VICTOR, a novel dataset built from Brazil{'}s Supreme Court digitalized legal documents, composed of more than 45 thousand appeals, which includes roughly 692 thousand documents{---}about 4. 6 million pages.
Ranked #1 on
Multi-Label Text Classification
on MVICTOR (theme)
1 code implementation • International Conference on Computational Processing of the Portuguese Language 2020 • Pedro Henrique Luz de Araujo, Teófilo Emidio de Campos, Marcelo Magalhães Silva de Sousa
Official Gazettes are a rich source of relevant information to the public.
Ranked #1 on
Text Classification
on DODF Data