Search Results for author: Steffen Eger

Found 99 papers, 62 papers with code

TUDa at WMT21: Sentence-Level Direct Assessment with Adapters

no code implementations WMT (EMNLP) 2021 Gregor Geigle, Jonas Stadtmüller, Wei Zhao, Jonas Pfeiffer, Steffen Eger

This paper presents our submissions to the WMT2021 Shared Task on Quality Estimation, Task 1 Sentence-Level Direct Assessment.

Sentence

Evaluation of Coreference Resolution Systems Under Adversarial Attacks

no code implementations EMNLP (CODI) 2020 Haixia Chai, Wei Zhao, Steffen Eger, Michael Strube

A substantial overlap of coreferent mentions in the CoNLL dataset magnifies the recent progress on coreference resolution.

coreference-resolution

End-to-end style-conditioned poetry generation: What does it take to learn from examples alone?

no code implementations EMNLP (LaTeCHCLfL, CLFL, LaTeCH) 2021 Jörg Wöckener, Thomas Haider, Tristan Miller, The-Khang Nguyen, Thanh Tung Linh Nguyen, Minh Vu Pham, Jonas Belouadi, Steffen Eger

In this work, we design an end-to-end model for poetry generation based on conditioned recurrent neural network (RNN) language models whose goal is to learn stylistic features (poem length, sentiment, alliteration, and rhyming) from examples alone.

TUDA-Reproducibility @ ReproGen: Replicability of Human Evaluation of Text-to-Text and Concept-to-Text Generation

no code implementations INLG (ACL) 2021 Christian Richter, Yanran Chen, Steffen Eger

This paper describes our contribution to the Shared Task ReproGen by Belz et al. (2021), which investigates the reproducibility of human evaluations in the context of Natural Language Generation.

Concept-To-Text Generation Paper generation

LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models

no code implementations25 May 2025 Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen, Steffen Eger

In this survey, we conduct a data-driven, semi-automated review of research on limitations of LLM (LLLMs) from 2022 to 2024 using a bottom-up approach.

Hallucination knowledge editing +1

CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks

1 code implementation16 May 2025 Christoph Leiter, Yuki M. Asano, Margret Keuper, Steffen Eger

We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties.

Negation

DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

no code implementations10 Apr 2025 Daniil Larionov, Sotaro Takeshita, Ran Zhang, Yanran Chen, Christoph Leiter, Zhipin Wang, Christian Greisinger, Steffen Eger

Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks, yet their effectiveness in evaluating natural language generation remains unexplored.

Machine Translation nlg evaluation +2

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

no code implementations2 Apr 2025 Xiao Wang, Daniil Larionov, Siwei Wu, Yiqi Liu, Steffen Eger, Nafise Sadat Moosavi, Chenghua Lin

In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text.

Machine Translation Text Generation

TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

2 code implementations14 Mar 2025 Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto

Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available.

Program Synthesis

BatchGEMBA: Token-Efficient Machine Translation Evaluation with Batched Prompting and Prompt Compression

no code implementations4 Mar 2025 Daniil Larionov, Steffen Eger

Recent advancements in Large Language Model (LLM)-based Natural Language Generation evaluation have largely focused on single-example prompting, resulting in significant token overhead and computational inefficiencies.

Large Language Model Machine Translation +3

Argument Summarization and its Evaluation in the Era of Large Language Models

no code implementations2 Mar 2025 Moritz Altemeyer, Steffen Eger, Johannes Daxenberger, Tim Altendorf, Philipp Cimiano, Benjamin Schiller

Large Language Models (LLMs) have revolutionized various Natural Language Generation (NLG) tasks, including Argument Summarization (ArgSum), a key subfield of Argument Mining (AM).

Argument Mining Text Generation

Do Emotions Really Affect Argument Convincingness? A Dynamic Approach with LLM-based Manipulation Checks

no code implementations24 Feb 2025 Yanran Chen, Steffen Eger

Emotions have been shown to play a role in argument convincingness, yet this aspect is underexplored in the natural language processing (NLP) community.

PromptOptMe: Error-Aware Prompt Compression for LLM-based MT Evaluation Metrics

no code implementations20 Dec 2024 Daniil Larionov, Steffen Eger

Evaluating the quality of machine-generated natural language content is a challenging task in Natural Language Processing (NLP).

Language Modeling Language Modelling +1

Graph-Guided Textual Explanation Generation Framework

no code implementations16 Dec 2024 Shuzhou Yuan, Jingyi Sun, Ran Zhang, Michael Färber, Steffen Eger, Pepa Atanasova, Isabelle Augenstein

Specifically, highlight explanations are extracted as highly faithful cues representing the model's reasoning and are subsequently encoded through a graph neural network layer, which explicitly guides the NLE generation process.

Explanation Generation Graph Neural Network

NLLG Quarterly arXiv Report 09/24: What are the most influential current AI Papers?

2 code implementations2 Dec 2024 Christoph Leiter, Jonas Belouadi, Yanran Chen, Ran Zhang, Daniil Larionov, Aida Kostikova, Steffen Eger

The NLLG (Natural Language Learning & Generation) arXiv reports assist in navigating the rapidly evolving landscape of NLP and AI research across cs. CL, cs. CV, cs. AI, and cs. LG categories.

State Space Models

How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs

1 code implementation24 Oct 2024 Ran Zhang, Wei Zhao, Steffen Eger

We find that Multidimensional Quality Metrics (MQM), as the de facto standard in non-literary human MT evaluation, is inadequate for literary translation: While Best-Worst Scaling (BWS) with students and Scalar Quality Metric (SQM) with professional translators prefer human translations at rates of ~82% and ~94%, respectively, MQM with student annotators prefers human professional translations over the translations of the best-performing LLMs in only ~42% of cases.

2k Machine Translation +1

LLM-based multi-agent poetry generation in non-cooperative environments

1 code implementation5 Sep 2024 Ran Zhang, Steffen Eger

PROMPTING-BASED agents in our framework also benefit from non-cooperative environments and a more diverse ensemble of models with non-homogeneous agents has the potential to further enhance diversity, with an increase of 7. 0-17. 5 pp according to our experiments.

Diversity

Evaluating Diversity in Automatic Poetry Generation

1 code implementation21 Jun 2024 Yanran Chen, Hannes Gröner, Sina Zarrieß, Steffen Eger

Natural Language Generation (NLG), and more generally generative AI, are among the currently most impactful research fields.

Diversity Text Generation

xCOMET-lite: Bridging the Gap Between Efficiency and Quality in Learned MT Evaluation Metrics

1 code implementation20 Jun 2024 Daniil Larionov, Mikhail Seleznyov, Vasiliy Viskov, Alexander Panchenko, Steffen Eger

State-of-the-art trainable machine translation evaluation metrics like xCOMET achieve high correlation with human judgment but rely on large encoders (up to 10. 7B parameters), making them computationally expensive and inaccessible to researchers with limited resources.

Machine Translation Quantization

DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ

2 code implementations24 May 2024 Jonas Belouadi, Simone Paolo Ponzetto, Steffen Eger

To achieve this, we create three new datasets: DaTikZv2, the largest TikZ dataset to date, containing over 360k human-created TikZ graphics; SketchFig, a dataset that pairs hand-drawn sketches with their corresponding scientific figures; and MetaFig, a collection of diverse scientific figures and associated metadata.

Language Modeling Language Modelling

Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph

no code implementations3 May 2024 Vladyslav Nechakhin, Jennifer D'Souza, Steffen Eger

Current methods, such as those used by the Open Research Knowledge Graph (ORKG), involve manually curating properties to describe research papers' contributions in a structured manner, but this is labor-intensive and inconsistent between the domain expert human curators.

Recommendation Systems

Syntactic Language Change in English and German: Metrics, Parsers, and Convergences

1 code implementation18 Feb 2024 Yanran Chen, Wei Zhao, Anne Breitbarth, Manuel Stoeckel, Alexander Mehler, Steffen Eger

Even though we have evidence that recent parsers trained on modern treebanks are not heavily affected by data 'noise' such as spelling changes and OCR errors in our historic data, we find that results of syntactic language change are sensitive to the parsers involved, which is a caution against using a single parser for evaluating syntactic language change as done in previous work.

Optical Character Recognition (OCR) Sentence

Is there really a Citation Age Bias in NLP?

no code implementations7 Jan 2024 Hoa Nguyen, Steffen Eger

Recently, it has been noted that there is a citation age bias in the Natural Language Processing (NLP) community, one of the currently fastest growing AI subfields, in that the mean age of the bibliography of NLP papers has become ever younger in the last few years, leading to `citation amnesia' in which older knowledge is increasingly forgotten.

NLLG Quarterly arXiv Report 09/23: What are the most influential current AI Papers?

2 code implementations9 Dec 2023 Ran Zhang, Aida Kostikova, Christoph Leiter, Jonas Belouadi, Daniil Larionov, Yanran Chen, Vivian Fresen, Steffen Eger

Artificial Intelligence (AI) has witnessed rapid growth, especially in the subfields Natural Language Processing (NLP), Machine Learning (ML) and Computer Vision (CV).

Navigate

The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics

1 code implementation30 Oct 2023 Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, Steffen Eger

Specifically, we propose a novel competition setting in which we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting.

Machine Translation Text Generation

AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ

1 code implementation30 Sep 2023 Jonas Belouadi, Anne Lauscher, Steffen Eger

To address this, we propose the use of TikZ, a well-known abstract graphics language that can be compiled to vector graphics, as an intermediate representation of scientific figures.

Language Modeling Language Modelling +3

NLLG Quarterly arXiv Report 06/23: What are the most influential current AI Papers?

2 code implementations31 Jul 2023 Steffen Eger, Christoph Leiter, Jonas Belouadi, Ran Zhang, Aida Kostikova, Daniil Larionov, Yanran Chen, Vivian Fresen

In particular, we compile a list of the 40 most popular papers based on normalized citation counts from the first half of 2023.

Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation

1 code implementation22 Jun 2023 Ran Zhang, Jihed Ouni, Steffen Eger

While summarization has been extensively researched in natural language processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding.

Adversarial Attack Negation

Towards Explainable Evaluation Metrics for Machine Translation

no code implementations22 Jun 2023 Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger

In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4.

Machine Translation Translation

Cross-Genre Argument Mining: Can Language Models Automatically Fill in Missing Discourse Markers?

no code implementations7 Jun 2023 Gil Rocha, Henrique Lopes Cardoso, Jonas Belouadi, Steffen Eger

We demonstrate the impact of our approach on an Argument Mining downstream task, evaluated on different corpora, showing that language models can be trained to automatically fill in discourse markers across different corpora, improving the performance of a downstream model in some, but not all, cases.

Argument Mining Discourse Parsing

ChatGPT: A Meta-Analysis after 2.5 Months

no code implementations20 Feb 2023 Christoph Leiter, Ran Zhang, Yanran Chen, Jonas Belouadi, Daniil Larionov, Vivian Fresen, Steffen Eger

ChatGPT, a chatbot developed by OpenAI, has gained widespread popularity and media attention since its release in November 2022.

Chatbot Ethics

Transformers Go for the LOLs: Generating (Humourous) Titles from Scientific Abstracts End-to-End

1 code implementation20 Dec 2022 Yanran Chen, Steffen Eger

Our human evaluation suggests that our best end-to-end system performs similarly to human authors (but arguably slightly worse).

Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?

1 code implementation COLING 2022 Doan Nam Long Vu, Nafise Sadat Moosavi, Steffen Eger

The evaluation of recent embedding-based evaluation metrics for text generation is primarily based on measuring their correlation with human evaluations on standard benchmarks.

Text Generation Word Embeddings

MENLI: Robust Evaluation Metrics from Natural Language Inference

1 code implementation15 Aug 2022 Yanran Chen, Steffen Eger

Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks, e. g., relating to information correctness.

Adversarial Attack Adversarial Robustness +4

Reproducibility Issues for BERT-based Evaluation Metrics

1 code implementation30 Mar 2022 Yanran Chen, Jonas Belouadi, Steffen Eger

We find that reproduction of claims and results often fails because of (i) heavy undocumented preprocessing involved in the metrics, (ii) missing code and (iii) reporting weaker results for the baseline metrics.

Machine Translation Text Generation

Towards Explainable Evaluation Metrics for Natural Language Generation

1 code implementation21 Mar 2022 Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger

We also provide a synthesizing overview over recent approaches for explainable machine translation metrics and discuss how they relate to those goals and properties.

Machine Translation Text Generation +2

Did AI get more negative recently?

2 code implementations28 Feb 2022 Dominik Beese, Begüm Altunbaş, Görkem Güzeler, Steffen Eger

We annotate over 1. 5 k papers from NLP and ML to train a SciBERT-based model to automatically predict the stance of a paper based on its title and abstract.

Articles

USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation

1 code implementation21 Feb 2022 Jonas Belouadi, Steffen Eger

We show that our fully unsupervised metrics are effective, i. e., they beat supervised competitors on 4 out of our 5 evaluation datasets.

Machine Translation Parallel Corpus Mining +3

Constrained Density Matching and Modeling for Cross-lingual Alignment of Contextualized Representations

no code implementations31 Jan 2022 Wei Zhao, Steffen Eger

Multilingual representations pre-trained with monolingual data exhibit considerably unequal task performances across languages.

Attribute

DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence

1 code implementation26 Jan 2022 Wei Zhao, Michael Strube, Steffen Eger

Still, recent BERT-based evaluation metrics are weak in recognizing coherence, and thus are not reliable in a way to spot the discourse-level improvements of those text generation systems.

Document Level Machine Translation Machine Translation +1

Better than Average: Paired Evaluation of NLP Systems

1 code implementation ACL 2021 Maxime Peyrard, Wei Zhao, Steffen Eger, Robert West

Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances.

Constrained Density Matching and Modeling for Effective Contextualized Alignment

no code implementations29 Sep 2021 Wei Zhao, Steffen Eger

In this work, we analyze the limitations according to which previous alignments become very resource-intensive, \emph{viz.,} (i) the inability to sufficiently leverage data and (ii) that alignments are not trained properly.

Diachronic Analysis of German Parliamentary Proceedings: Ideological Shifts through the Lens of Political Biases

1 code implementation13 Aug 2021 Tobias Walter, Celina Kirschner, Steffen Eger, Goran Glavaš, Anne Lauscher, Simone Paolo Ponzetto

We analyze bias in historical corpora as encoded in diachronic distributional semantic models by focusing on two specific forms of bias, namely a political (i. e., anti-communism) and racist (i. e., antisemitism) one.

Diachronic Word Embeddings Word Embeddings

Graph Routing between Capsules

no code implementations22 Jun 2021 Yang Li, Wei Zhao, Erik Cambria, Suhang Wang, Steffen Eger

Therefore, in this paper, we introduce a new capsule network with graph routing to learn both relationships, where capsules in each layer are treated as the nodes of a graph.

Relation text-classification +1

CMCE at SemEval-2020 Task 1: Clustering on Manifolds of Contextualized Embeddings to Detect Historical Meaning Shifts

1 code implementation SEMEVAL 2020 David Rother, Thomas Haider, Steffen Eger

Remarkably, with only 10 dimensional MBERT embeddings (reduced from the original size of 768), our submitted model performs best on subtask 1 for English and ranks third in subtask 2 for English.

Change Detection Clustering +1

Probing Multilingual BERT for Genetic and Typological Signals

no code implementations COLING 2020 Taraka Rama, Lisa Beinborn, Steffen Eger

We probe the layers in multilingual BERT (mBERT) for phylogenetic and geographic language signals across 100 languages and compute language distances based on the mBERT representations.

regression

Vec2Sent: Probing Sentence Embeddings with Natural Language Generation

1 code implementation COLING 2020 Martin Kerscher, Steffen Eger

We introspect black-box sentence embeddings by conditionally generating from them with the objective to retrieve the underlying discrete sentence.

Sentence Sentence Embeddings +1

From Hero to Zéroe: A Benchmark of Low-Level Adversarial Attacks

1 code implementation12 Oct 2020 Steffen Eger, Yannik Benz

Adversarial attacks are label-preserving modifications to inputs of machine learning classifiers designed to fool machines but not humans.

Natural Language Inference Part-Of-Speech Tagging +1

How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation

1 code implementation CONLL 2020 Steffen Eger, Johannes Daxenberger, Iryna Gurevych

We then probe embeddings in a multilingual setup with design choices that lie in a 'stable region', as we identify for English, and find that results on English do not transfer to other languages.

Sentence Sentence Embeddings

On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

1 code implementation ACL 2020 Wei Zhao, Goran Glavaš, Maxime Peyrard, Yang Gao, Robert West, Steffen Eger

We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.

Language Modeling Language Modelling +5

PO-EMO: Conceptualization, Annotation, and Modeling of Aesthetic Emotions in German and English Poetry

1 code implementation LREC 2020 Thomas Haider, Steffen Eger, Evgeny Kim, Roman Klinger, Winfried Menninghaus

Thus, we conceptualize a set of aesthetic emotions that are predictive of aesthetic appreciation in the reader, and allow the annotation of multiple labels per line to capture mixed emotions within their context.

Emotion Classification Emotion Recognition

Semantic Change and Emerging Tropes In a Large Corpus of New High German Poetry

1 code implementation WS 2019 Thomas Haider, Steffen Eger

Due to its semantic succinctness and novelty of expression, poetry is a great test bed for semantic change analysis.

Towards Scalable and Reliable Capsule Networks for Challenging NLP Applications

5 code implementations ACL 2019 Wei Zhao, Haiyun Peng, Steffen Eger, Erik Cambria, Min Yang

Obstacles hindering the development of capsule networks for challenging NLP applications include poor scalability to large output spaces and less reliable routing processes.

 Ranked #1 on Text Classification on RCV1 (P@1 metric)

General Classification Multi-Label Text Classification +1

Pitfalls in the Evaluation of Sentence Embeddings

no code implementations WS 2019 Steffen Eger, Andreas Rücklé, Iryna Gurevych

Our motivation is to challenge the current evaluation of sentence embeddings and to provide an easy-to-access reference for future research.

Sentence Sentence Embeddings

Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems

1 code implementation NAACL 2019 Steffen Eger, Gözde Gül Şahin, Andreas Rücklé, Ji-Ung Lee, Claudia Schulz, Mohsen Mesgar, Krishnkant Swarnkar, Edwin Simpson, Iryna Gurevych

Visual modifications to text are often used to obfuscate offensive comments in social media (e. g., "! d10t") or as a writing style ("1337" in "leet speak"), among other scenarios.

Adversarial Attack Sentence

Does My Rebuttal Matter? Insights from a Major NLP Conference

1 code implementation NAACL 2019 Yang Gao, Steffen Eger, Ilia Kuznetsov, Iryna Gurevych, Yusuke Miyao

We then focus on the role of the rebuttal phase, and propose a novel task to predict after-rebuttal (i. e., final) scores from initial reviews and author responses.

4k

Predicting Research Trends From Arxiv

1 code implementation7 Mar 2019 Steffen Eger, Chao Li, Florian Netzer, Iryna Gurevych

By extrapolation, we predict that these topics will remain lead problems/approaches in their fields in the short- and mid-term.

reinforcement-learning Reinforcement Learning +2

Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks

1 code implementation EMNLP 2018 Steffen Eger, Paul Youssef, Iryna Gurevych

Activation functions play a crucial role in neural networks because they are the nonlinearities which have been attributed to the success story of deep learning.

image-classification Image Classification

One Size Fits All? A simple LSTM for non-literal token and construction-level classification

no code implementations COLING 2018 Erik-L{\^a}n Do Dinh, Steffen Eger, Iryna Gurevych

In this paper, we tackle four different tasks of non-literal language classification: token and construction level metaphor detection, classification of idiomatic use of infinitive-verb compounds, and classification of non-literal particle verbs.

All Classification +2

Multi-Task Learning for Argumentation Mining in Low-Resource Settings

1 code implementation NAACL 2018 Claudia Schulz, Steffen Eger, Johannes Daxenberger, Tobias Kahse, Iryna Gurevych

We investigate whether and where multi-task learning (MTL) can improve performance on NLP problems related to argumentation mining (AM), in particular argument component identification.

Multi-Task Learning

Neural End-to-End Learning for Computational Argumentation Mining

2 code implementations ACL 2017 Steffen Eger, Johannes Daxenberger, Iryna Gurevych

Contrary to models that operate on the argument component level, we find that framing AM as dependency parsing leads to subpar performance results.

Dependency Parsing General Classification +1

EELECTION at SemEval-2017 Task 10: Ensemble of nEural Learners for kEyphrase ClassificaTION

1 code implementation SEMEVAL 2017 Steffen Eger, Erik-Lân Do Dinh, Ilia Kuznetsov, Masoud Kiaeeha, Iryna Gurevych

From these approaches, we created an ensemble of differently hyper-parameterized systems, achieving a micro-F1-score of 0. 63 on the test data.

General Classification

Complex Decomposition of the Negative Distance kernel

no code implementations5 Jan 2016 Tim vor der Brück, Steffen Eger, Alexander Mehler

Our evaluation shows that the power kernel produces F-scores that are comparable to the reference kernels, but is -- except for the linear kernel -- faster to compute.

Document Classification General Classification +2

On the Number of Many-to-Many Alignments of Multiple Sequences

no code implementations2 Nov 2015 Steffen Eger

We provide a new asymptotic formula for the case $S=\{(s_1,\ldots, s_N) \:|\: 1\le s_i\le 2\}$.

Cannot find the paper you are looking for? You can Submit a new open access paper.